استفاده از شبکه‌های مولد تخاصمی در افزایش کارایی دسته بندی نظرات نامتعادل کاربران

نوع مقاله : علمی-پژوهشی

نویسندگان

دانشیار، دانشکده مهندسی کامپیوتر، دانشگاه صنعتی شاهرود، شاهرود، ایران

چکیده

روش‌های تولید متن برای تولید خودکار متون زبان طبیعی از هوش مصنوعی استفاده می‌کنند. یکی از کاربردهای تولید متن در دسته‌بندی متن است. بسیاری از مسائل دنیای واقعی با داده‌های متنی نامتعادل در ارتباط هستند که می‌تواند کارایی دسته‌بندی را کاهش دهد. یک رویکرد حل مشکل داده‌های نامتعادل، بیش-نمونه‌برداری از کلاس اقلیت است. با توجه به پیشرفت شبکه‌های مولد تخاصمی (GAN) در تولید داده، می‌توان از این شبکه‌ها برای تولید نمونه‌های متنی در بیش‌نمونه‌برداری استفاده کرد. تولید متن به کمک شبکه‌های مولد تخاصمی به دلیل ماهیت گسسته متن مسئله‌ای پیچیده است. علیرغم پتانسیل آن‌ها، استفاده این شبکه‌ها در حل مشکل داده‌های متنی نامتعادل به ندرت مورد بررسی قرار گرفته است. این مقاله به بررسی تاثیر استفاده از شبکه‌ی SentiGAN بر حل مشکل عدم تعادل نظرات کاربران با هدف بهبود کارایی دسته‌بندی می‌پردازد. بعد از ارائه روش پیشنهادی و چارچوب ارزیابی، چهار الگوریتم دسته‌بندی بر روی داده‌ها اجرا شده و معیارهای ارزیابی مختلف پیش و پس از بیش‌نمونه‌برداری محاسبه و تحلیل شده‌اند. هم‌چنین نتایج با روش‌های بیش‌نمونه‌برداری سنتی و اخیر مقایسه شده است. بیش‌نمونه‌برداری با روش پیشنهادی باعث افزایش معیار‌های صحت، دقت و تشخیص‌پذیری، و امتیاز اف دسته‌بندی داده‌های اقلیت نسبت به داده‌های نامتعادل و همچنین در مقایسه با روش‌های دیگر بیش‌نمونه‌برداری می‌شود.

کلیدواژه‌ها


عنوان مقاله [English]

Using generative adversarial networks to increase the classification efficiency of imbalanced user reviews

نویسندگان [English]

  • B. Javid
  • H. Mashayekhi
Faculty of Computer Engineering, Shahrood University of Technology, Shahrood, Iran
چکیده [English]

Text generation methods use artificial intelligence to automatically generate natural language texts. One of the uses of text generation is in text classification. Many real-world problems are related to imbalanced textual data, which can reduce classification efficiency. One approach to solving the imbalanced data problem is the minority class oversampling. Due to the progress of generative adversarial networks (GAN) in data generation, these networks can be used to generate text samples in oversampling. Generating text using GANs is a complex problem due to the discrete nature of text. Despite their potential, the use of these networks in solving the problem of imbalanced textual data has rarely been investigated. This article examines the effect of using the SentiGAN network to solve the problem of imbalanced user reviews with the aim of improving the classification efficiency. To evaluate the proposed method, before and after oversampling with traditional, recent and SentiGAN methods, four classification algorithms were implemented on the data and evaluation criteria were calculated. It was observed that oversampling with the help of SentiGAN has increased the accuracy, precision, specificity and f_score of zero class compared to the situation where the data is imbalanced or even is oversampled by the other methods.

کلیدواژه‌ها [English]

  • Generative adversarial networks (GAN)
  • imbalanced text classification
  • oversampling
  • imbalanced text
  • classification
[1] G. P. Zhang, “Neural networks for classification: a survey”, IEEE Trans. Syst. Man, Cybern. Part C Appl. Rev., vol. 30, no. 4, pp. 451–462, 2000.
[2] T. R. Baitharu and S. K. Pani, “Effect of Missing Values on Data Classification Corresponding Author : Tapas Ranjan Baitharu,” journals.co.za, vol. 4, no. 2, pp. 311–316, 2013.
[3] C. Padurariu and M. E. Breaban, “Dealing with data imbalance in text classification”, in Procedia Computer Science, vol. 159, pp. 736–745, 2019.
[4] I. Glaser, S. Sadegharmaki, B. Komboz, and F. Matthes, “Data scarcity: Methods to improve the quality of text classification”, In ICPRAM, pp. 556-564. 2021.
[5] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets : A review”, GESTS international transactions on computer science and engineering , vol. 30, no. 1, pp. 25–36, 2006.
[6] Z. Xu, D. Shen, T. Nie, and Y. Kou, “A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data”, Journal of Biomedical Informatics, vol. 107, p. 103465, Jul. 2020.
[7] J. Tian, S. Chen, X. Zhang, and Z. Feng, “A graph-based measurement for text imbalance classification”, European Conference on Artificial Intelligence, pp. 2188–2195, 2020.
[8] H. He and E. A. Garcia, “Learning from imbalanced data”, IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, 2009.
[9]   T. Raksachat and R. Chawuthai, “Improving a text classifier using text augmentation: road traffic content from Twitter”, In 2023 20th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON) (pp. 1-4), 2023.
[10] A. Amin et al., “Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study”, IEEE Access, vol. 4, pp. 7940–7957, 2016.
[11] A. Sonak and R. A. Patankar, “A Survey on Methods to Handle Imbalance Dataset”, Int. J. Comput. Sci. Mob. Comput., vol. 4, no. 11, pp. 338–343, 2015.
[12] G. Douzas and F. Bacao, “Effective data generation for imbalanced learning using conditional generative adversarial networks”,  Expert Systems with applications, vol. 91, no. January 2018, pp. 464–471, 2018.
[13] H. Kaur, H. S. Pannu, and A. K. Malhi, “A systematic review on imbalanced data challenges in machine learning: Applications and solutions”, ACM Computing Surveys (CSUR), vol. 52, no. 4, 2019.
[14] Available online at: https://www.section.io/engineering-education/beginners-intro-to-generative-modeling/#discriminative-and-generative-modeling.
[15] T. Iqbal and S. Qureshi, “The survey: Text generation models in deep learning”, Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 6,  2022.
[16] مرضیه رحیمی، عرفان جلیلی جلال، حسین رحیمی، « تولید کلمات کلیدی متون فارسی با استفاده از یادگیری انتقالی»، مجله مهندسی برق دانشگاه تبریز، جلد 52، شماره 2، صفحات 123-115، 1401.
[17] I. Rivera-Trigueros, “Machine translation systems and quality assessment: a systematic review”,  Language Resources and Evaluation, vol. 56, no. 2, pp. 593–619, 2022.
[18] Y. Mori, H. Yamane, Y. Mukuta, and T. Harada, “Computational Storytelling and Emotions: A Survey”, arXiv (Cornell University), May 2022.
[19] W. S. El-Kassas, C. R. Salama, A. A. Rafea, and H. K. Mohamed, “Automatic text summarization: A comprehensive survey”, Expert Systems with Applications, vol. 165. Pergamon, p. 113679, 2021.
[20] M. Scholz, C. Brenner, and O. Hinz, “AKEGIS: automatic keyword generation for sponsored search advertising in online retailing”, Decision Support Systems, vol. 119, pp. 96–106, 2019.
[21] B. Ojokoh and E. Adebisi, “A review of question answering systems”, Journal of Web Engineering, vol. 17, no. 8. pp. 717–758, 2019.
[22] K. Shu, Y. Li, K. Ding, and H. Liu, “Fact-Enhanced Synthetic News Generation”, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 15, no. 15, pp. 13825–13833, 2021.
[23] S. Talafha and B. Rekabdar, “Arabic Poem Generation Incorporating Deep Learning and Phonetic CNNsubword Embedding Models”,  International Journal of Robotic Computing, pp. 64–91, 2019.
[24] W. Fedus, I. Goodfellow, and A. M. Dai, “MaskGaN: Better text generation via filling in the”, 6th Int. Conf. Learn. Represent. ICLR 2018 - Conf. Track Proc., 2018.
[25]    سعید دهقانی اشکذری، ولی درهمی، علی محمد زارع بیدکی، محمداحسان بصیری، « عقیده‌کاوی در زبان فارسی مبتنی بر یادگیری انتقالی»، مجله مهندسی برق دانشگاه تبریز، جلد 50، شماره 3، صفحات 1224-1215، 1399.
[26] M. Wielgosz et al., “Evaluation and implementation of n-gram-based algorithm for fast text comparison”, Computing and Informatics, vol. 36, no. 4, pp. 887–907, 2017.
[27] J. G. Saliby, “Survey on Natural Language Generation”, International Journal of Trend in Scientific Research and Development, vol. Volume-3, no. Issue-3, pp. 618–622, 2019.
[28] J. Weizenbaum, “ELIZA-A computer program for the study of natural language communication between man and machine”, Communications of the ACM, vol. 9, no. 1, pp. 36–45, 1966.
[29] K. M. Colby, “Artificial paranoia: A computer simulation of paranoid processes”, Behavior Therapy, vol. 7, no. 1, p. 146, Jan. 1976.
[30] G. Angeli, P. Liang, and D. Klein, “A simple domain-independent probabilistic approach to generation”, in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 502-512. 2010.
[31] R. Barzilay and L. Lee, “Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization”, arXiv (Cornell University), pp. 113–120, 2004.
[32] S. Santhanam and S. Shaikh, “A Survey of Natural Language Generation Techniques with a Focus on Dialogue Systems - Past, Present and Future Directions.,” arXiv (Cornell University), 2019
[33] T. Mikolov, M. Karafiát, L. Burget, C. Jan, and S. Khudanpur, “Recurrent neural network based language model”, Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, vol. 2, pp. 1045–1048,2010.
[34] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks”, Advances in Neural Information Processing Systems, vol. 4, no. January, pp. 3104–3112, 2014.
[35] I. Goodfellow et al., “Generative Adversarial Nets”, Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680, 2014.
[36] F. H. K. dos S. Tanaka and C. Aranha, “Data Augmentation Using GANs”, arXiv [cs.LG], 2019.
[37] Y. Zhang, “Deep Generative Model for Multi-Class Imbalanced Learning”, 2018..
[38] K. Wang and X. Wan, “Automatic generation of sentimental texts via mixture adversarial networks”, Artificial Intelligence, vol. 275, pp. 540–558, 2019.
[39] G. Douzas and F. Bacao, “Effective data generation for imbalanced learning using conditional generative adversarial networks”, Expert Systems with applications, vol. 91, pp. 464–471, 2018.
[40] W. Mao, Y. Liu, L. Ding, and Y. Li, “Imbalanced fault diagnosis of rolling bearing based on generative adversarial network: A comparative study”, IEEE Access, vol. 7, pp. 9515–9530, 2019.
[41] U. Fiore, A. De Santis, F. Perla, P. Zanetti, and F. Palmieri, “Using generative adversarial networks for improving classification effectiveness in credit card fraud detection”, Information Sciences, vol. 479, pp. 448–455, 2019.
[42] Y. Luo, H. Feng, X. Weng, K. Huang, and H. Zheng, “A novel oversampling method based on SeqGAN for imbalanced text classification”, 2019 IEEE International Conference on Big Data (Big Data), pp. 2891–2894, 2019.
[43] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial nets with policy gradient”, 31st AAAI Conf. Artif. Intell. AAAI 2017, pp. 2852–2858, 2017.
[44] S. Bej, N. Davtyan, M. Wolfien, M. Nassar, and O. Wolkenhauer, “LoRAS: an oversampling approach for imbalanced datasets”, Machine Learning, vol. 110, no. 2, pp. 279–301, 2021.
[45] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique”, Journal of artificial intelligence research , vol. 16, pp. 321–357, 2002.
[46] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning”, in 2008 IEEE international joint conference on neural networks, pp. 1322–1328.
[47] F. Rodríguez-Torres, J. F. Martínez-Trinidad, and J. A. Carrasco-Ochoa, “An Oversampling Method for Class Imbalance Problems on Large Datasets”, Applied Sciences, vol. 12, no. 7, 2022.
[48] M. Torres-Vásquez, J. Hernández-Torruco, B. Hernández-Ocaña, and O. Chávez-Bosquez, “Impact of oversampling algorithms in the classification of guillain-barré syndrome main subtypes”, Ingenius. Revista de Ciencia y Tecnología, no. 25, pp. 20–31, 2021.
[49] T. K. Ho, “Random decision forests”, in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, 1995, vol. 1, pp. 278–282, 1995.
[50] D. R. Cox, “The Regression Analysis of Binary Sequences”, Journal of the Royal Statistical Society: Series B (Methodological), vol. 20, no. 2, pp. 215–232, 1958.
[51] D. J. Hand and K. Yu, “Idiot’s Bayes—not so stupid after all?”, International statistical review, vol. 69, no. 3, pp. 385–398, 2001.
[52] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system”, In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016.
[53] R. Obiedat et al., “Sentiment analysis of customers’ reviews using a hybrid evolutionary SVM-based approach in an imbalanced data distribution”, IEEE Access, vol. 10, pp. 22260–22273, 2022.
[54] S. N. Almuayqil, M. Humayun, N. Z. Jhanjhi, M. F. Almufareh, and D. Javed, “Framework for improved sentiment analysis via random minority oversampling for user tweet review classification”, Electronics (Basel), vol. 11, no. 19, p. 3058, 2022.