From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection
Published in The International Conference on Knowledge and Systems Engineering (KSE), 2020
Recommended citation: Q. H. Pham, V. Anh Nguyen, L. B. Doan, N. N. Tran and T. M. Thanh, "From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection," 2020 12th International Conference on Knowledge and Systems Engineering (KSE), 2020, pp. 37-42, doi: 10.1109/KSE50997.2020.9287406. https://ieeexplore.ieee.org/document/9287406
Abstract
Natural language processing (NLP) is a fast-growing field of artificial intelligence. Since the Transformer was introduced by Google in 2017, a large number of language models such as BERT, GPT, and ELMo have been inspired by this architecture. These models were trained on huge datasets and achieved state-of-the-art results on natural language understanding. However, fine-tuning a pre-trained language model on much smaller datasets for downstream tasks requires a carefully-designed pipeline to mitigate problems of the datasets such as lack of training data and imbalanced data. In this paper, we propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection. We first tune the PhoBERT on our dataset by re-training the model on the Masked Language Model (MLM) task; then, we employ its encoder for text classification. In order to preserve pre-trained weights while learning new feature representations, we further utilize different training techniques: layer freezing, block-wise learning rate, and label smoothing. Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection (HSD) campaign 2 with 0.7221 F1 score.
Citation
@INPROCEEDINGS{9287406, author={Pham, Quang Huu and Anh Nguyen, Viet and Doan, Linh Bao and Tran, Ngoc N. and Thanh, Ta Minh}, booktitle={2020 12th International Conference on Knowledge and Systems Engineering (KSE)}, title={From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection}, year={2020}, volume={}, number={}, pages={37-42}, doi={10.1109/KSE50997.2020.9287406}}