Using Grey Wolf Optimization Algorithm in Big Data Clustering

Authors

1 Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran

2 KDD lab, ISTI-CNR, Pisa, Italy

Abstract

The huge amount of data created constantly with increasing rate from different sources such as smart phones, social media, imaging technologies and etc. becomes difficult to be analyzed by conventional data analytic tools. For this reason a new field of research called Big Data Analytics is growing faster in the research and industrial communities. Clustering big datasets is one of the important challenges which attracts more and more attentions among researchers. In this paper first a method for non-automatic big data clustering (when the number of clusters is known) and then a two-stage method for big data automatic clustering (able in finding the number of clusters) based on grey wolf optimization algorithm are introduced.  In the first stage the algorithm tries to find the number of clusters using a tree structure and in the second stage the main algorithm searches the solution space to find the position of centroids. The methodology is tested on 13 synthetics and 2 real big mobility datasets. The achieved results show its effectiveness in big data clustering.

Keywords


[1]      S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, "Big data, analytics and the path from insights to value," MIT sloan management review, vol. 52, p. 21, 2011.
[2]      S. Cheng, Y. Shi, Q. Qin, and R. Bai, "Swarm intelligence in big data analytics," in International Conference on Intelligent Data Engineering and Automated Learning, 2013, pp. 417-426.
[3]      J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets: Cambridge University Press, 2014.
[4]      A. K. Jain, M. N. Murty, and P. J. Flynn, "Data clustering: a review," ACM computing surveys (CSUR), vol. 31, pp. 264-323, 1999.
[5]      J. A. Hartigan, "Clustering algorithms (probability & mathematical statistics)," ed: John Wiley & Sons Inc New York, 1975.
[6]      R. Xu and D. Wunsch, "Survey of clustering algorithms," IEEE Transactions on neural networks, vol. 16, pp. 645-678, 2005.
[7]      J. A. Hartigan and M. A. Wong, "Algorithm AS 136: A k-means clustering algorithm," Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, pp. 100-108, 1979.
[8[ احسان نادرنژاد، حمید حسن‌پور و حسین میارنعیمی، «استفاده از مشخصه‌های آماری داده‌ها و پردازش بلوکی برای قطعه بندی تصاویر»، فصلنامه مهندسی برق دانشگاه تبریز، دوره‌ی ۳۹ شماره‌ی ۱، صفحه‌ی ۴۸ تا ۵۷، بهار ۱۳۸۸.
[9]      A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, and T. Herawan, "Big data clustering: a review," in International Conference on Computational Science and Its Applications, 2014, pp. 707-720.
[10]      R. C. Eberhart, Y. Shi, and J. Kennedy, Swarm intelligence: Elsevier, 2001.
[11]      S. Mirjalili, S. M. Mirjalili, and A. Lewis, "Grey wolf optimizer," Advances in engineering software, vol. 69, pp. 46-61, 2014.
[12]      A. Sinha and P. K. Jana, "A novel K-means based clustering algorithm for big data," in Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference on, 2016, pp. 1875-1879.
[13]      M. Jain and C. Verma, "Adapting k-means for Clustering in Big Data," International Journal of Computer Applications, vol. 101, pp. 19-24, 2014.
[14]      A. Saini, J. Minocha, J. Ubriani, and D. Sharma, "New approach for clustering of big data: DisK-means," in Computing, Communication and Automation (ICCCA), 2016 International Conference on, 2016, pp. 122-126.
[15]      D. Arthur and S. Vassilvitskii, "k-means++: the advantages of careful seeding," presented at the Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans, Louisiana, 2007.
[16]      S. H. Razavi, E. O. M. Ebadati, S. Asadi, and H. Kaur, "An efficient grouping genetic algorithm for data clustering and big data analysis," in Computational Intelligence for Big Data Analysis, ed: Springer, 2015, pp. 119-142.
[17]      S. Saitta, B. Raphael, and I. F. Smith, "A comprehensive validity index for clustering," Intelligent Data Analysis, vol. 12, pp. 529-548, 2008.
[18]      A. Abraham, S. Das, and S. Roy, "Swarm intelligence algorithms for data clustering," in Soft computing for knowledge discovery and data mining, ed: Springer, 2008, pp. 279-313.
[19]      S. H. Kwon, "Cluster validity index for fuzzy clustering," Electronics letters, vol. 34, pp. 2176-2177, 1998.
[20]      X. Cui, T. E. Potok, and P. Palathingal, "Document clustering using particle swarm optimization," in Swarm Intelligence Symposium, 2005. SIS 2005. Proceedings 2005 IEEE, 2005, pp. 185-191.
[21]      M. G. Omran, A. Salman, and A. P. Engelbrecht, "Dynamic clustering using particle swarm optimization with application in image segmentation," Pattern Analysis and Applications, vol. 8, p. 332, 2006.
[22]      U. Maulik and S. Bandyopadhyay, "Performance evaluation of some clustering algorithms and validity indices," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 1650-1654, 2002.
[23]      C. Zhang, D. Ouyang, and J. Ning, "An artificial bee colony approach for clustering," Expert Systems with Applications, vol. 37, pp. 4761-4767, 2010.
[24]      G. Krishnasamy, A. J. Kulkarni, and R. Paramesran, "A hybrid approach for data clustering based on modified cohort intelligence and K-means," Expert Systems with Applications, vol. 41, pp. 6009-6016, 2014.
[25]      Y. Lu, B. Cao, C. Rego, and F. Glover, "A Tabu search based clustering algorithm and its parallel implementation on Spark," Applied Soft Computing, vol. 63, pp. 97-109, 2018.
[26]      D. Karaboga and C. Ozturk, "A novel clustering approach: Artificial Bee Colony (ABC) algorithm," Applied soft computing, vol. 11, pp. 652-657, 2011.
[27]      X. Han, L. Quan, X. Xiong, M. Almeter, J. Xiang, and Y. Lan, "A novel data clustering algorithm based on modified gravitational search algorithm," Engineering Applications of Artificial Intelligence, vol. 61, pp. 1-7, 2017.
[28]      A. Banharnsakun, "A MapReduce-based artificial bee colony for large-scale data clustering," Pattern Recognition Letters, vol. 93, pp. 78-84, 2017.
[29]      C. Muro, R. Escobedo, L. Spector, and R. Coppinger, "Wolf-pack (Canis lupus) hunting strategies emerge from simple rules in computational simulations," Behavioural processes, vol. 88, pp. 192-197, 2011.
[30]      T. Caliński and J. Harabasz, "A dendrite method for cluster analysis," Communications in Statistics-theory and Methods, vol. 3, pp. 1-27, 1974.
[31]      School of Computing University of Eastern Finland. "clustering basic benchmarks,” June 10, 2018; https://cs.joensuu.fi/sipu/datasets/.
[32]      W. M. Rand, "Objective criteria for the evaluation of clustering methods," Journal of the American Statistical association, vol. 66, pp. 846-850, 1971.
[33]      D. Pelleg and A. W. Moore, "X-means: Extending k-means with efficient estimation of the number of clusters," in Icml, 2000, pp. 727-734.
[34]      Z. F. Knops, J. A. Maintz, M. A. Viergever, and J. P. Pluim, "Normalized mutual information based registration using k-means clustering and shading correction," Medical image analysis, vol. 10, pp. 432-439, 2006.
[35]      J. Demšar, "Statistical comparisons of classifiers over multiple data sets," Journal of Machine learning research, vol. 7, pp. 1-30, 2006.
[36]      P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems," Pattern Recognition, vol. 39, pp. 761-775, 2006.
[37]      I. Kärkkäinen and P. Fränti, Dynamic local search algorithm for the clustering problem: University of Joensuu, 2002.
[38]      P. Fränti, R. Mariescu-Istodor, and C. Zhong, "XNN graph," in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), 2016, pp. 207-217.
[39]      P. Franti, O. Virmajoki, and V. Hautamaki, "Fast agglomerative clustering using a k-nearest neighbor graph," IEEE transactions on pattern analysis and machine intelligence, vol. 28, pp. 1875-1881, 2006.
[40]       سمیرا رفیعی، پرهام مرادی، «بهبود عملکرد الگوریتم خوشه‌بندی فازی سی-مینز با وزن‌دهی اتوماتیک و محلی ویژگی‌ها»، مجله‌ی مهندسی برق دانشگاه تبریز، دوره‌ی ۴۶، صفحه‌ی ۷۵ تا ۸۶، تابستان ۱۳۹۵.
[41]      I. Aljarah and S. A. Ludwig, "Parallel particle swarm optimization clustering algorithm based on mapreduce methodology," in Nature and biologically inspired computing (NaBIC), 2012 fourth world congress on, 2012, pp. 104-111.
[42]      B. Wu, G. Wu, and M. Yang, "A mapreduce based ant colony optimization approach to combinatorial optimization problems," in Natural Computation (ICNC), 2012 Eighth International Conference on, 2012, pp. 728-732.
[43]      J. Li, X. Hu, Z. Pang, and K. Qian, "A parallel ant colony optimization algorithm based on fine-grained model with GPU-acceleration," International Journal of Innovative Computing, Information and Control, vol. 5, pp. 3707-3716, 2009.
[44]      D.-W. Huang and J. Lin, "Scaling populations of a genetic algorithm for job shop scheduling problems using MapReduce," in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, 2010, pp. 780-785.