DynamicEvoStream: An EvoStream based Algorithm for Dynamically Determining The Number of Clusters in Data Streams

نوع مقاله : علمی-پژوهشی

نویسندگان

Department of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University, Hamedan, Iran

چکیده

EvoStream is a stream clustering algorithm which gradually clusters data in the idle times of the stream. In comparison with other algorithms in this field, EvoStream has a lower computation overload in the offline phase and has better accuracy. Also, in this algorithm, the number of clusters is taken as constant whereas in an authentic stream this number varies with the complexity of input data. In this work, we present DynamicEvoStream as an improved version of the original EvoStream. In this algorithm, we detect and exploit variations in the distribution and speed of the stream.  Also, we modified the cleanup function to merge overlapping clusters. Therefore, in contrast to the basic EvoStream, DynamicEvoStream identifies the number of clusters in a dynamic manner. Also, the speed of evolutionary steps is increased while improving the quality of the clusters. Finally, experiments using DynamicEvoStream on different streams showed that it can cluster the stream up to four times faster than the original EvoStream with fewer computation and memory resources. In the worst case, the quality of clusters is competitive to the original EvoStream, however improves the quality of clusters up to 30% in the majority of cases.

کلیدواژه‌ها


عنوان مقاله [English]

DynamicEvoStream: An EvoStream based Algorithm for Dynamically Determining The Number of Clusters in Data Streams

نویسندگان [English]

  • Z. Amighi
  • M. Yousef Sanati
  • M. Dezfoulian
Department of Computer Engineering, Faculty of Engineering, Bu-Ali Sina University, Hamedan, Iran
چکیده [English]

EvoStream is a stream clustering algorithm which gradually clusters data in the idle times of the stream. In comparison with other algorithms in this field, EvoStream has a lower computation overload in the offline phase and has better accuracy. Also, in this algorithm, the number of clusters is taken as constant whereas in an authentic stream this number varies with the complexity of input data. In this work, we present DynamicEvoStream as an improved version of the original EvoStream. In this algorithm, we detect and exploit variations in the distribution and speed of the stream.  Also, we modified the cleanup function to merge overlapping clusters. Therefore, in contrast to the basic EvoStream, DynamicEvoStream identifies the number of clusters in a dynamic manner. Also, the speed of evolutionary steps is increased while improving the quality of the clusters. Finally, experiments using DynamicEvoStream on different streams showed that it can cluster the stream up to four times faster than the original EvoStream with fewer computation and memory resources. In the worst case, the quality of clusters is competitive to the original EvoStream, however improves the quality of clusters up to 30% in the majority of cases.

کلیدواژه‌ها [English]

  • Data stream
  • Evostream Algorithm
  • Data stream clustering
  • Dynamic clustering
  • Online Clustering
[1] Shayesteh, V. Hakami, S.A. Mostafavi, A. Akbari Azirani, “A Novel Trust Computation Scheme for Internet of Things Applications”, Tabriz Journal of Electrical Engineering, vol 50, no. 2, pp. 743-755, 2020 (in persian).
[2] Chen, S. Mao, Y. Liu, “Big Data: A Survey”, Mobile Networks and Applications, vol. 19, pp. 171-209, 2014.
[3] Kolajo, O. Daramola, A. Adebiyi, “Big data stream analysis: a systematic literature review”, Journal of Big Data, vol.47, no. 6, 2019.
[4] Gurusamy, S. Kannan, K. Nandhini, "The Real Time Big Data Processing Framework: Advantages and Limitations", International Journal of Computer Sciences and Engineering, vol. 5, no. 12, pp. 305-312, 2017.
[5] Namiot, “On Big Data Stream Processing”, International Journal of Open Information Technologies, vol. 3, no. 8, pp. 48-51, 2015.
[6] Carnein, H. Trautmann, "EvoStream Evolutionary Stream Clustering Utilizing Idle Times", Big Data Research, vol. 14, pp. 101–111, 2018.
[7] K. Jain, M.N. Murty, P.J. Flynn, “Data clustering: a review”, ACM Computing Surveys, vol.31, no. 3, pp. 264-323, 1999.
[8] Ester, H.P. Kriegel, J. Sander, X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise”, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231, 1996.
[9] Behravan , S.H. Zahiri, S.M. Razavi, R. Trasarti, “Using Gray Wolf Optimization Algorithm in Big Data Clustering”, Tabriz Journal of Electrical Engineering, vol. 50, no. 1, 2020 (in persian).
[10] Leite DF, Costa P, Gomide F (2009) Evolving granular classification neural networks, IJCNN 2009, pp 1736–1743
[11] Smith, D. Alahakoon, “Growing self-organizing map for online continuous clustering”, Studies in computational intelligence, vol. 204, pp. 49–83, 2009.
[12] Dang, V. Lee, W. Ng, A. Ciptadi, K. Ong, “An EM-based algorithm for clustering data streams in sliding windows”, Lecture notes in computer science, vol. 5463, pp 230–235, 2009.
[13] Guha, N. Mishra, R. Motwani, L. O’Callaghan, “Clustering data streams”, Proceding 41st Annual Symposium Foundations Computer Science, 2000.
[14] Cao, M. Ester, W. Qian, and A. Zhou, “Density-based clustering over an evolving data stream with noise”, Proceeding of the Sixth SIAM Conference on Data Mining, 2006.
[15] Zhou, F. Cao, Y. Yan, C. Sha, X. He, “Distributed Data Stream Clustering: A Fast EM-Based Approach”, IEEE 23rd International Conference on Data Engineering, pp. 736-745, 2007.
[16] Tu, Y. Chen, “Stream data clustering based on grid density and Attraction”, ACM Transactions on Knowledge Discovery from Data, vol. 3, no. 3, pp. 1–27, 2009.
[17] H. Dang, V.C. Lee, W.K. Ng, K. Ong, “Incremental and Adaptive Clustering Stream Data Over Sliding Window”, 20th International Conference on Database and Expert Systems Applications. pp. 660-674, 2009.
[18] P. Barddal, H.M. Gomes, F. Enembreck, “SNCStream: A Social Network-Based Data Stream Clustering Algorithm”, Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp.935–940, 2015.
[19] Khalilian, N. Mustapha, N. Sulaiman, “Data Stream Clustering by Divide and Conquer Approach Based on Vector Model”, Journal of Big Data, vol. 3, no. 1, 2016.
[20] Hyde, P. Angelov, A. MacKenzie, “Fully Online Clustering of Evolving Data Streams Into Arbitrarily Shaped Clusters”, Information Sciences, vol. 382, pp. 96–114, 2017.
[21] Gu, P. Angelov, D. Kangin, J. Principe, “Self-Organised Direction Aware Data Partitioning Algorithm”, Information Sciences, vol. 423, pp. 80–95, 2017.
[22] Su, Y. Li, X. Zhao, “Data Stream Clustering by fast Density-Peak-Search”, Statistics and its Interface, vol. 11, no. 1, pp. 183–189, 2018.
[23] Ahmed, G. Dalkılıç, Y. Erten, “DGStream: High quality and efficiency stream clustering algorithm”, Expert Systems with Applications, vol. 141, 2020,
[24] Fahy, S. Yang and M. Gongora, "Ant Colony Stream Clustering: A Fast Density Clustering Algorithm for Dynamic Data Streams", IEEE Transactions on Cybernetics, vol. 49, no. 6, pp. 2215-2228, 2019.
[25] Longnguyen, Y. Kwongwoon,W. Keongng, "A Survey On Data Stream Clustering And Classification", Knowledge And Information Systems, vol. 45, pp. 535–569, 2014.
[26] Mansalis, E. Ntoutsi, N. Pelekis, Y. Theodoridis, "An Evaluation of Data Stream Clustering Algorithms", Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 11, no. 4, 2018.
[27] Carnein, H. Trautmann, "Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms", Business & Information Systems Engineering, vol. 61, no. 3, pp. 277–297, 2019.
[28] Mohiuddin, "Data Summarization: A Survey", Knowledge and Information Systems, vol. 58, pp 249-273, 2019.
[29] Whitley, Darrell. "A genetic algorithm tutorial", Statistics and Computing, vol.4, no.2, pp. 65–85, 1994.
[30] M. Spears, K.A. De Jong, T. Bäck, D.B. Fogel, H. de Garis, “An overview of evolutionary computation”, Lecture Notes in Computer Science, vol. 667, pp. 442-459, 1993.
[31] Maulik, S. Bandyopadhyay, “Genetic algorithm-based clustering technique”, Pattern Recognition, vol. 33, no. 9, pp. 1455–1465, 2000.
[32] C. Aggarwal, J. Han, J. Wang, P.S. Yu, “A framework for projected clustering of high dimensional data streams”, Proceedings of the thirtieth international conference on very large data bases, vol. 30, pp. 852–863, 2004.
[33] López-Ibáñez, J. Dubois-Lacoste, L. Pérez Cáceres, T. Stützle, M. Birattari, “The irace package: iterated racing for automatic algorithm configuration”, Operation Research Perspectives, vol. 3, pp. 43–58, 2016