Topic Detection on COVID-19 Tweets: A Comparative Study on Clustering and Transfer Learning Models

نوع مقاله : علمی-پژوهشی

نویسندگان

1 Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran

2 Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran

چکیده

Automatic topic detection seems unavoidable in social media analysis due to big text data which their users generate. Clustering-based methods are one of the most important and up-to-date categories in topic detection. The goal of this research is to have a wide study on this category. Therefore, this paper aims to study the main components of clustering-based-topic-detection, which are embedding methods, distance metrics, and clustering algorithms. Transfer learning and consequently pretrained language models and word embeddings have been considered in recent years. Regarding the importance of embedding methods, the efficiency of five new embedding methods, from earlier to recent ones, are compared in this paper. To conduct our study, two commonly used distance metrics, in addition to five important clustering algorithms in the field of topic detection, are implemented by the authors. As COVID-19 has turned into a hot trending topic on social networks in recent years, a dataset including one-month tweets collected with COVID-19-related hashtags is used for this study. More than 7500 experiments are performed to determine tunable parameters. Then all combinations of embedding methods, distance metrics and clustering algorithms (50 combinations) are evaluated using Silhouette metric. Results show that T5 strongly outperforms other embedding methods, cosine distance is weakly better than other distance metrics, and DBSCAN is superior to other clustering algorithms.

کلیدواژه‌ها


عنوان مقاله [English]

Topic Detection on COVID-19 Tweets: A Comparative Study on Clustering and Transfer Learning Models

نویسندگان [English]

  • E. Zafarani-Moattar 1
  • M. R. Kangavari 2
  • A. M. Rahmani 1
1 Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
2 Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
چکیده [English]

Automatic topic detection seems unavoidable in social media analysis due to big text data which their users generate. Clustering-based methods are one of the most important and up-to-date categories in topic detection. The goal of this research is to have a wide study on this category. Therefore, this paper aims to study the main components of clustering-based-topic-detection, which are embedding methods, distance metrics, and clustering algorithms. Transfer learning and consequently pretrained language models and word embeddings have been considered in recent years. Regarding the importance of embedding methods, the efficiency of five new embedding methods, from earlier to recent ones, are compared in this paper. To conduct our study, two commonly used distance metrics, in addition to five important clustering algorithms in the field of topic detection, are implemented by the authors. As COVID-19 has turned into a hot trending topic on social networks in recent years, a dataset including one-month tweets collected with COVID-19-related hashtags is used for this study. More than 7500 experiments are performed to determine tunable parameters. Then all combinations of embedding methods, distance metrics and clustering algorithms (50 combinations) are evaluated using Silhouette metric. Results show that T5 strongly outperforms other embedding methods, cosine distance is weakly better than other distance metrics, and DBSCAN is superior to other clustering algorithms.

کلیدواژه‌ها [English]

  • Topic Detection
  • Transfer Learning
  • Embedding Methods
  • Distance Metrics
  • Clustering Methods
  • COVID-19
[1]    Webpage, “Worldometers: Real Time World Statistics,” 2022. https://www.worldometers.info/coronavirus/?zarsrc=130.
[2]    U.S. CDC, “CDC COVID Data Tracker,” U.S. Centers for Disease Control and Prevention, 2022. https://covid.cdc.gov/covid-data-tracker/#datatracker-home.
[3]    X. Wang, S. Hegde, C. Son, B. Keller, A. Smith, and F. Sasangohar, “Investigating mental health of US college students during the COVID-19 pandemic: Cross-sectional survey study,” J. Med. Internet Res., vol. 22, no. 9, p. e22817, Sep. 2020, doi: 10.2196/22817.
[4]    A. Zandifar and R. Badrfam, “Iranian mental health during the COVID-19 epidemic,” Asian Journal of Psychiatry, vol. 51. Elsevier B.V., p. 101990, Jun. 2020, doi: 10.1016/j.ajp.2020.101990.
[5]    A. Rafea and N. A. Gaballah, “Topic Detection Approaches in Identifying Topics and Events from Arabic Corpora,” Procedia Comput. Sci., vol. 142, pp. 270–277, 2018, doi: 10.1016/j.procs.2018.10.492.
[6]    F. Atefeh and W. Khreich, “A survey of techniques for event detection in Twitter,” Comput. Intell., vol. 31, no. 1, pp. 133–164, Feb. 2015, doi: 10.1111/coin.12017.
[7]    M. Hasan, M. A. Orgun, and R. Schwitter, “A survey on real-time event detection from the Twitter data stream,” J. Inf. Sci., vol. 44, no. 4, pp. 443–463, 2018, doi: 10.1177/0165551517698564.
[8]    R. Ibrahim, A. Elbagoury, M. S. Kamel, and F. Karray, “Tools and approaches for topic detection from Twitter streams: survey,” Knowl. Inf. Syst., vol. 54, no. 3, pp. 511–539, 2018, doi: 10.1007/s10115-017-1081-x.
[9]    Z. Mottaghinia, M.-R. Feizi-Derakhshi, L. Farzinvash, and P. Salehpour, “A review of approaches for topic detection in Twitter,” J. Exp. Theor. Artif. Intell., pp. 1–27, Jun. 2020, doi: 10.1080/0952813X.2020.1785019.
[10]  M. Asgari-Chenaghlu, N. Nikzad-Khasmakhi, and S. Minaee, “Covid-Transformer: Detecting Trending Topics on Twitter Using Universal Sentence Encoder,” Sep. 2020, [Online]. Available: http://arxiv.org/abs/2009.03947.
[11]  S. R. Nayak, D. R. Nayak, U. Sinha, V. Arora, and R. B. Pachori, “Application of deep learning techniques for detection of COVID-19 cases using chest X-ray images: A comprehensive study,” Biomed. Signal Process. Control, vol. 64, p. 102365, Feb. 2021, doi: 10.1016/j.bspc.2020.102365.
[12]  M. Ahishali et al., “Advance Warning Methodologies for COVID-19 Using Chest X-Ray Images,” IEEE Access, vol. 9, pp. 41052–41065, 2021, doi: 10.1109/ACCESS.2021.3064927.
[13]  M. S. Iraji, M.-R. Feizi-Derakhshi, and J. Tanha, “COVID-19 Detection Using Deep Convolutional Neural Networks and Binary Differential Algorithm-Based Feature Selection from X-Ray Images,” Complexity, vol. 2021, pp. 1–10, Oct. 2021, doi: 10.1155/2021/9973277.
[14]  V. Ravi, H. Narasimhan, C. Chakraborty, and T. D. Pham, “Deep learning-based meta-classifier approach for COVID-19 classification using CT scan and chest X-ray images,” Multimed. Syst., Jul. 2021, doi: 10.1007/s00530-021-00826-1.
[15]  L. L. Wang et al., “CORD-19: The COVID-19 Open Research Dataset,” Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.10706.
[16]  X. Guo, H. Mirzaalian, E. Sabir, A. Jaiswal, and W. Abd-Almageed, “CORD19STS: COVID-19 Semantic Textual Similarity Dataset,” Jul. 2020, [Online]. Available: http://arxiv.org/abs/2007.02461.
[17]  S. Zong, A. Baheti, W. Xu, and A. Ritter, “Extracting COVID-19 Events from Twitter,” Jun. 2020, [Online]. Available: http://arxiv.org/abs/2006.02567.
[18]  C. E. Lopez, M. Vasu, and C. Gallemore, “Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset,” Mar. 2020, [Online]. Available: http://arxiv.org/abs/2003.10359.
[19]  E. Chen, K. Lerman, and E. Ferrara, “Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set,” JMIR Public Heal. Surveill., vol. 6, no. 2, p. e19273, May 2020, doi: 10.2196/19273.
[20]  R. Tang et al., “Rapidly Bootstrapping a Question Answering Dataset for COVID-19,” Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.11339.
[21]  D. Dimitrov et al., “TweetsCOV19 - A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Oct. 2020, pp. 2991–2998, doi: 10.1145/3340531.3412765.
[22]  R. K. Gupta, A. Vishwanath, and Y. Yang, “COVID-19 Twitter Dataset with Latent Topics, Sentiments and Emotions Attributes,” 2020. http://arxiv.org/abs/2007.06954.
[23]  R. Lamsal, “Design and analysis of a large-scale COVID-19 tweets dataset,” Appl. Intell., pp. 1–15, Nov. 2020, doi: 10.1007/s10489-020-02029-z.
[24]  J. Samuel, G. G. M. N. Ali, M. M. Rahman, E. Esawi, and Y. Samuel, “COVID-19 Public Sentiment Insights and Machine Learning for Tweets Classification,” Information, vol. 11, no. 6, p. 314, Jun. 2020, doi: 10.3390/info11060314.
[25]  H. Jelodar, Y. Wang, R. Orji, and H. Huang, “Deep Sentiment Classification and Topic Discovery on Novel Coronavirus or COVID-19 Online Discussions: NLP Using LSTM Recurrent Neural Network Approach,” IEEE J. Biomed. Heal. Informatics, pp. 1–1, Jun. 2020, doi: 10.1109/jbhi.2020.3001216.
[26]  J. Xue, J. Chen, C. Chen, C. Zheng, S. Li, and T. Zhu, “Public discourse and sentiment during the COVID-19 pandemic: using Latent Dirichlet Allocation for topic modeling on Twitter,” May 2020.
[27]  H. Yin, S. Yang, and J. Li, “Detecting Topic and Sentiment Dynamics Due to COVID-19 Pandemic Using Social Media,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 12447 LNAI, pp. 610–623, Jul. 2020, doi: 10.1007/978-3-030-65390-3_46.
[28]  R. Chandrasekaran, V. Mehta, T. Valkunde, and E. Moustakas, “Topics, Trends, and Sentiments of Tweets about the COVID-19 Pandemic: Temporal Infoveillance Study,” J. Med. Internet Res., vol. 22, no. 10, p. e22624, Oct. 2020, doi: 10.2196/22624.
[29]  A. Kruspe, M. Häberle, I. Kuhn, and X. X. Zhu, “Cross-language sentiment analysis of European Twitter messages during the COVID-19 pandemic,” 2020.
[30]  O. Gencoglu, “Large-Scale, Language-Agnostic Discourse Classification of Tweets During COVID-19,” Mach. Learn. Knowl. Extr., vol. 2, no. 4, pp. 603–616, Nov. 2020, doi: 10.3390/make2040032.
[31]  L. Li et al., “Characterizing the Propagation of Situational Information in Social Media during COVID-19 Epidemic: A Case Study on Weibo,” IEEE Trans. Comput. Soc. Syst., vol. 7, no. 2, pp. 556–562, Apr. 2020, doi: 10.1109/TCSS.2020.2980007.
[32]  Q. Jiao and S. Zhang, “A Brief Survey of Word Embedding and Its Recent Development,” IAEAC 2021 - IEEE 5th Adv. Inf. Technol. Electron. Autom. Control Conf., vol. 2021, pp. 1697–1701, 2021, doi: 10.1109/IAEAC50856.2021.9390956.
[33]  T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” Adv. Neural Inf. Process. Syst., Oct. 2013.
[34]  D. Nabergoj, A. D’Alconzo, D. Valerio, and E. Štrumbelj, “Topic extraction by clustering word embeddings on short online texts,” Elektroteh. Vestnik/Electrotechnical Rev., vol. 89, no. 1–2, pp. 64–72, 2022.
[35]  P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Trans. Assoc. Comput. Linguist., vol. 5, pp. 135–146, Dec. 2017, doi: 10.1162/tacl_a_00051.
[36]  A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks for Efficient Text Classification,” 2017.
[37]  J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.
[38]  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Oct. 2019, vol. 1, pp. 4171–4186, doi: 10.18653/v1/N19-1423.
[39]  M. T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, Aug. 2015, pp. 1412–1421, doi: 10.18653/v1/d15-1166.
[40]  C. Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” 2020.
[41]  A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems, Jun. 2017, vol. 2017-Decem, pp. 5999–6009.
[42]  M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” 1996.
[43]  M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, “OPTICS: Ordering Points to Identify the Clustering Structure,” SIGMOD Rec. (ACM Spec. Interes. Gr. Manag. Data), vol. 28, no. 2, pp. 49–60, Jun. 1999, doi: 10.1145/304181.304187.
[44]  A. Y. Ng and M. I. Jordan, “On Spectral Clustering: Analysis and an algorithm,” in Advances in neural information processing systems, 2002, pp. 849–856.
[45]  E. A. Patrick, “Clustering Using a Similarity Measure Based on Shared Near Neighbors,” IEEE Trans. Comput., vol. C–22, no. 11, pp. 1025–1034, 1973, doi: 10.1109/T-C.1973.223640.
[46]  A. Mirzal, “Statistical Analysis of Microarray Data Clustering using NMF, Spectral Clustering, Kmeans, and GMM,” IEEE/ACM Trans. Comput. Biol. Bioinforma., vol. 19, no. 2, pp. 1173–1192, 2022, doi: 10.1109/TCBB.2020.3025486.
[47]  M. Asgari-Chenaghlu, M.-R. Feizi-Derakhshi, L. Farzinvash, M.-A. Balafar, and C. Motamed, “TopicBERT: A Transformer transfer learning based memory-graph approach for multimodal streaming social media topic detection,” Aug. 2020.
[48]  S. Dehghani, V. Derhami, A. M. Zare Bidoki, and M. E. Basiri, “Persian Opinion Mining based on Transfer Learning,” Tabriz J. Electr. Eng., vol. 50, no. 3, pp. 1215–1224, 2020.
[49]  M. A. Z. C. S. Sharifatzadeh, “Compilation Instance Transfer and Feature-representation Transfer for Cross Project Defect Prediction,” Tabriz J. Electr. Eng., vol. 48, no. 1, pp. 101–112, 2018.
[50]  S. Smith, “Coronavirus (covid19) Tweets - early April,” Kaggle.com, 2020. https://www.kaggle.com/smid80/coronavirus-covid19-tweets-early-april.
[51]  S. Smith, “Coronavirus (covid19) Tweets - late April | Kaggle,” 2020. https://www.kaggle.com/smid80/coronavirus-covid19-tweets-late-april.