PerBOLD: A Big Dataset of Persian Offensive language on Instagram Comments

نوع مقاله : علمی-پژوهشی

نویسندگان

1 Faculty of Computer Engineering, Shahrood University of Technology, Shahrood, Iran.

2 Institute for Humanities and Cultural Studies, Ghom, Iran.

چکیده

Easy access to social media enables users to express their opinions and ideology about various topics like news, videos, and personalities freely, without any fear, and often in an offensive manner. It is a vital task to detect comments with offensive language on social media platforms and relies on a complete and comprehensive tagged dataset. Therefore, in this paper, we introduce and make publicly available PerBOLD, a new Persian comment dataset collected from Instagram as a popular platform among Iranian. We follow a two-level manual annotation process in order to determine whether a comment has offensive language or not and fine-grained tags of different types of offensive language. Furthermore, we present some interesting aspects of data and analysis them.

کلیدواژه‌ها


عنوان مقاله [English]

PerBOLD: A Big Dataset of Persian Offensive language on Instagram Comments

نویسندگان [English]

  • Maryam Khodabakhsh 1
  • F. Jafarinejad 1
  • M. Rahimi 1
  • M. Ghayoomi 2
1 1Faculty of Computer Engineering, Shahrood University of Technology, Shahrood, Iran.
2 Institute for Humanities and Cultural Studies, Ghom, Iran.
چکیده [English]

Easy access to social media enables users to express their opinions and ideology about various topics like news, videos, and personalities freely, without any fear, and often in an offensive manner. It is a vital task to detect comments with offensive language on social media platforms and relies on a complete and comprehensive tagged dataset. Therefore, in this paper, we introduce and make publicly available PerBOLD, a new Persian comment dataset collected from Instagram as a popular platform among Iranian. We follow a two-level manual annotation process in order to determine whether a comment has offensive language or not and fine-grained tags of different types of offensive language. Furthermore, we present some interesting aspects of data and analysis them.

کلیدواژه‌ها [English]

  • Keywords Natural language processing
  • offensive language
  • social media
  • annotation
[1] F. Ghanbari, M. Rahmani, "Presenting a Semantic Orientation Based Method for Multi-Label Classification of Movies Content Using Their Subtitle Texts", Tabriz Journal of Electrical Engineering, vol. 47, pp. 1599-1611, 2018.
[2] Z. Amighi, M. Yousef Sanati, M. Dezfoulian, "DynamicEvoStream: An EvoStream based Algorithm for Dynamically Determining The Number of Clusters in Data Streams", Tabriz Journal of Electrical Engineering, vol. 51, pp. 315-326, 2022.
[3] H. Mulki, H. Haddad, C. B. Ali, H. Alshabani, "L-hsab: A levantine twitter dataset for hate speech and abusive language", in Proceedings of the third workshop on abusive language online, Florence, Italy, pp. 111-118, 2019.
[4] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, Y. Chang, "Abusive language detection in online user content", in Proceedings of the 25th international conference on world wide web, Montréal Québec Canada, pp. 145-153, 2016.
[5] A. M. Founta, C. Djouvas, D. Chatzakou, I. Leontiadis, J. Blackburn, G. Stringhini, A. Vakali, M. Sirivianos, N. Kourtellis, "Large scale crowdsourcing and characterization of twitter abusive behavior", in Twelfth International AAAI Conference on Web and Social Media, Palo Alto, California, USA, 2018.
[6] P. Liu, J. Guberman, L. Hemphill, A. Culotta, "Forecasting the presence and intensity of hostility on Instagram using linguistic and social features", in Twelfth international aaai conference on web and social media, Palo Alto, California, USA, 2018.
[7] R. Sprugnoli, S. Menini, S. Tonelli, F. Oncini, E. Piras, "Creating a whatsapp dataset to study pre-teen cyberbullying", in Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), Brussels, Belgium, pp. 51-59, 2018.
[8] H. Zhong, H. Li, A. C. Squicciarini, S. M. Rajtmajer, C. Griffin, D. J. Miller, C. Caragea, "Content-Driven Detection of Cyberbullying on the Instagram Social Network", in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 3952-3958, 2016.
[9] J. Qian, A. Bethke, Y. Liu, E. Belding, W. Y. Wang, "A benchmark dataset for learning to intervene in online hate speech", in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 4755–4764, 2019.
[10] O. De Gibert, N. Perez, A. García-Pablos, M. Cuadros, "Hate speech dataset from a white supremacy forum", in Proceedings of the 2nd Workshop on Abusive Language Online ({ALW}2), Brussels, Belgium, pp. 11--20, 2018.
[11] Ò. G. i Orts, "Multilingual detection of hate speech against immigrants women in Twitter at SemEval-2019 task 5: Frequency analysis interpolation for hate in speech detection", in Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 460-463, 2019.
[12] M. A. Bashar, R. Nayak, K. Luong, T. Balasubramaniam, "Progressive domain adaptation for detecting hate speech on social media with small training set and its application to COVID-19 concerned posts", Social Network Analysis and Mining, vol. 11, pp. 1-18, 2021.
[13] X. Huang, L. Xing, F. Dernoncourt, M. J. Paul, "Multilingual twitter corpus and baselines for evaluating demographic bias in hate speech recognition", in Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 1440--1448, 2020.
[14] P. Fortuna, J. R. da Silva, L. Wanner, S. Nunes, "A hierarchically-labeled portuguese hate speech dataset", in Proceedings of the third workshop on abusive language online, pp. 94-104, 2019.
[15] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, "Semeval-2019 task 6: Identifying and categorizing offensive language in social media (offenseval) ", in Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 75--86, 2019.
[16] G. Kennedy, A. McCollough, E. Dixon, A. Bastidas, J. Ryan, C. Loo, J. Ryan, C. Loo, S. Sahay, "Technology solutions to combat online harassment", in Proceedings of the first workshop on abusive language online, pp. 73-77, 2017.
[17] M. Wiegand, M. Siegel, J. Ruppenhofer, "Overview of the germeval 2018 shared task on the identification of offensive language", in Proceedings of GermEval 2018, 14th Conference on Natural Language Processing, Vienna, Austria, 2018.
[18] I. Markov, N. Ljubešić, D. Fišer, W. Daelemans, "Exploring stylometric and emotion-based features for multilingual cross-domain hate speech detection", in Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 149-159, 2021.
[19] F. Alves Vargas, I. Carvalho, F. Rodrigues de Góes, F. Benevenuto, T. Alexandre Salgueiro Pardo, "Building an Expert Annotated Corpus of Brazilian Instagram Comments for Hate Speech and Offensive Language Detection", arXiv e-prints, p. arXiv: 2103.14972, 2021.
[20] M. Wiegand, M. Siegel, J. Ruppenhofer, "Overview of the germeval 2018 shared task on the identification of offensive language", in Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS 2018), pp. 1-10, 2018.
[21] P. Alavi, P. Nikvand, M. Shamsfard, "Offensive Language Detection with BERT-based models, By Customizing Attention Probabilities", arXiv preprint arXiv:2110.05133, 2021.
[22] M. Mozafari, "Hate speech and offensive language detection using transfer learning approaches", Institut Polytechnique de Paris, 2021.
[23] A. Hande, R. Priyadharshini, B. R. Chakravarthi, "KanCMD: Kannada CodeMixed dataset for sentiment analysis and offensive language detection", in Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media, pp. 54-63, 2020.
[24] S. A. Chowdhury, H. Mubarak, A. Abdelali, S.-g. Jung, B. J. Jansen, J. Salminen, "A multi-platform Arabic news comment dataset for offensive language detection", in Proceedings of the 12th language resources and evaluation conference, pp. 6203-6212, 2020.
[25] N. Romim, M. Ahmed, M. Islam, A. S. Sharma, H. Talukder, M. R. Amin, "BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts", in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China, pp. 4755--4764, 2022.
[26] S. Alsafari, S. Sadaoui, M. Mouhoub, "Hate and offensive speech detection on Arabic social media," Online Social Networks and Media, vol. 19, p. 100096, 2020.
[27] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, R. Kumar, "Predicting the type and target of offensive posts in social media," in Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologie, Minneapolis, Minnesota, pp. 1415--1420, 2019.
[28] M. J. Díaz-Torres, P. A. Morán-Méndez, L. Villasenor-Pineda, M. Montes, J. Aguilera, L. Meneses-Lerín, "Automatic detection of offensive language in social media: Defining linguistic criteria to build a Mexican Spanish dataset", in Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pp. 132-136, 2020.
[29] R. Hada, S. Sudhir, P. Mishra, H. Yannakoudakis, S. M. Mohammad, E. Shutova, "Ruddit: Norms of offensiveness for English Reddit comments", in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 2700--2717, 2021.
[30] Ç. Çöltekin, "A corpus of Turkish offensive language on social media", in Proceedings of the 12th language resources and evaluation conference, pp. 6174-6184, 2020.