Hybrid Subword: Character Representation for Robust Sentiment Classification on Multilingual and Code-Mixed Indonesian Text
DOI:
https://doi.org/10.70062/jeci.v2i1.276Keywords:
Code-Mixing, Multilingual NLP, Robustness, Sentiment Analysis, XLM-RAbstract
User-generated Indonesian text frequently exhibits code-mixing with English (“Indonglish”), informal spelling, elongation, and keyboard typos. These phenomena break subword tokeniza tion assumptions and may degrade multilingual Transformer performance in deployment. This paper studies a hybrid representation that fuses XLM-R sentence features with a character-level CharCNN branch designed to capture orthographic patterns and mitigate character noise. We evaluate (i) a standard XLM-R fine-tuning baseline, (ii) an ablation that removes the character branch (NusaX only), and (iii) the proposed hybrid model on two datasets: NusaX-Senti (12 regional languages) and Indonglish (Indonesian–English code-mixed sentiment). Beyond clean test performance, we introduce a controlled robustness protocol by injecting character-level perturbations with probability p=0.18 and measuring performance drop. Results show that the XLM-R baseline achieves the best clean Macro-F1 on both datasets, while the hybrid model substantially improves robustness on Indonglish by reducing Macro-F1 drop from 0.030 to 0.007 under noise. We analyze common error confusions and discuss when character-aware features help or harm across languages.
References
Bojanowski, P., Grave, É., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. https://doi.org/10.1162/tacl_a_00051
Clark, J. H., Garrette, D., Turc, I., & Wieting, J. (2021). CANINE: Pre-training an efficient tokenization-free encoder for language representation. arXiv. https://doi.org/10.48550/arXiv.2103.06874
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv. https://doi.org/10.48550/arXiv.2003.10555
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, É., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 8440–8451). https://doi.org/10.18653/v1/2020.acl-main.747
Danang, D., & Mustofa, Z. (2026). CLSTMNet architecture: A CNN-LSTM-based hybrid deep learning model for DDoS attack detection and mitigation in network security. Journal of Artificial Intelligence and Technology. https://doi.org/10.37965/jait.2025.0887
Danang, D., Wahyono, T., Sembiring, I., Wellem, T., & Dzulkefly, N. H. (2025). An adaptive framework integrating ML, blockchain, and TEE for cloud security. In Proceedings of the 4th International Conference on Creative Communication and Innovative Technology (ICCIT) (pp. 1–7). IEEE. https://doi.org/10.1109/ICCIT65724.2025.11167152
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 4171–4186). https://doi.org/10.18653/v1/N19-1423
El Boukkouri, H., Ferres, J., Mamou, J., Hamdy, M., Boudoukh, G., Firooz, H., Kuan, L., & Stoyanov, V. (2020). CharBERT: Character-aware pre-trained language model. arXiv. https://doi.org/10.48550/arXiv.2010.10392
Gugger, S., et al. (2022). Accelerate: Training and inference at scale made simple, efficient and adaptable. arXiv. https://doi.org/10.48550/arXiv.2205.07917
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language-independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP) (pp. 66–71). https://doi.org/10.18653/v1/D18-2012
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A lite BERT for self-supervised learning of language representations. arXiv. https://doi.org/10.48550/arXiv.1909.11942
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the ACL (pp. 7871–7880). https://doi.org/10.18653/v1/2020.acl-main.703
Lhoest, Q., Delangue, C., von Platen, P., Wolf, T., Salazar, J., et al. (2021). Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). https://doi.org/10.18653/v1/2021.emnlp-demo.21
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv. https://doi.org/10.48550/arXiv.1907.11692
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR).
Morris, J. X., Lifland, E., Yoo, J. Y., Grigsby, J., Jin, D., & Qi, Y. (2020). TextAttack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP) (pp. 119–126). https://doi.org/10.18653/v1/2020.emnlp-demos.16
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., ... Chintala, S. (2019). PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS).
Patwa, P., Aguilar, G., Kar, S., Solorio, T., & Das, A. (2020). SemEval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval) (pp. 774–790). https://doi.org/10.18653/v1/2020.semeval-1.164
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.
Siswanto, E., Danang, D., Kusumaningroem, I., & Akhsani, I. (2026). Assessing software architecture resilience using quantitative metrics in cloud-native application development environments. Indonesian Journal of Informatics, 1(1), 11–21. https://doi.org/10.66472/iji.v1i1.27
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS). https://doi.org/10.48550/arXiv.1706.03762
Wei, J., & Zou, K. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP) (pp. 6382–6388). https://doi.org/10.18653/v1/D19-1670
Winata, G. I., Cahyawijaya, S., et al. (2024). SEACrowd: A multilingual multimodal data hub and benchmark suite for Southeast Asian languages. arXiv. https://doi.org/10.48550/arXiv.2406.10118
Winata, G. I., Cahyawijaya, S., Lin, Z., Wicaksono, A. F. A., et al. (2023). NusaX: Benchmarking machine translation for Southeast Asian languages. arXiv. https://doi.org/10.48550/arXiv.2305.12267
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., & Brew, J. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP) (pp. 38–45). https://doi.org/10.18653/v1/2020.emnlp-demos.6
Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., & Raffel, C. (2021). ByT5: Towards a token-free future with pretrained byte-to-byte models. arXiv. https://doi.org/10.48550/arXiv.2105.13626
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. arXiv. https://doi.org/10.48550/arXiv.1906.08237


