استفاده از الگوریتم بهینه‌سازی گرگ خاکستری در خوشه‌یابی کلان‌داده‌ها

نویسندگان

1 دانشکده مهندسی برق و کامپیوتر- دانشگاه بیرجند

2 آزمایشگاه استخراج اطلاعات و داده‌کاوی- موسسه علوم و فناوری اطلاعات- پیزا- ایتالیا

چکیده

امروزه حجم بسیار زیادی از اطلاعات و داده‌ها از منابع مختلف نظیر گوشی‌های هوشمند، شبکه‌های اجتماعی، تکنولوژی‌های عکاسی و سایر منابع تولید می‌شود. بررسی و پردازش این حجم عظیم از اطلاعات چالش دهه‌های اخیر است که به آن کلان‌داده گفته می‌شود. یکی از روش‌های پرکاربرد استخراج اطلاعات، خوشه‌یابی است. خوشه‌یابیِ کلان‌داده‌ها چالش بزرگی است که توجه بسیاری از محققین را به خود جلب کرده است. در این پژوهش ابتدا یک روش خوشه‌یابی غیر خودکار (برای حالتی که تعداد خوشه‌ها از قبل مشخص است) و سپس یک روش خوشه‌یابی خودکار (قادر به یافتن تعداد خوشه‌ها) با استفاده از الگوریتم بهینه‌سازی گرگ خاکستری برای خوشه‌یابی کلان‌داده‌ها ارائه شده است. روش خوشه‌یابی خودکار یک روش دو مرحله‌ایست که در مرحله‌ی اول یک ساختار درخت گونه از الگوریتم مورد نظر برای یافتن تعداد خوشه‌ها اجرا می‌شود و در مرحله‌ی دوم الگوریتم اصلی فضا را برای یافتن موقعیت مراکز خوشه‌ها جست‌وجو می‌کند. عملکرد روش ارائه شده بر روی ۱۳ مجموعه داده‌ی مصنوعی و ۲ مجموعه کلان‌داده‌ی واقعی مربوط به مسیرهای طی شده  توسط خودروها در سطح شهر پیزا مورد ارزیابی قرار گرفته و نتایج آن بررسی شده است. نتایج به دست آمده نشان از دقت بالای این الگوریتم در خوشه‌یابی داده‌های بزرگ و حجیم دارد.

کلیدواژه‌ها


عنوان مقاله [English]

Using Grey Wolf Optimization Algorithm in Big Data Clustering

نویسندگان [English]

  • I. Behravan 1
  • S. H. Zahiri 1
  • S. M. Razavi 1
  • R. Trasarti 2
1 Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
2 KDD lab, ISTI-CNR, Pisa, Italy
چکیده [English]

The huge amount of data created constantly with increasing rate from different sources such as smart phones, social media, imaging technologies and etc. becomes difficult to be analyzed by conventional data analytic tools. For this reason a new field of research called Big Data Analytics is growing faster in the research and industrial communities. Clustering big datasets is one of the important challenges which attracts more and more attentions among researchers. In this paper first a method for non-automatic big data clustering (when the number of clusters is known) and then a two-stage method for big data automatic clustering (able in finding the number of clusters) based on grey wolf optimization algorithm are introduced.  In the first stage the algorithm tries to find the number of clusters using a tree structure and in the second stage the main algorithm searches the solution space to find the position of centroids. The methodology is tested on 13 synthetics and 2 real big mobility datasets. The achieved results show its effectiveness in big data clustering.

کلیدواژه‌ها [English]

  • Big data
  • Automatic clustering
  • Swarm intelligence methods
  • Grey wolf optimization algorithm
[1]      S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, "Big data, analytics and the path from insights to value," MIT sloan management review, vol. 52, p. 21, 2011.
[2]      S. Cheng, Y. Shi, Q. Qin, and R. Bai, "Swarm intelligence in big data analytics," in International Conference on Intelligent Data Engineering and Automated Learning, 2013, pp. 417-426.
[3]      J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets: Cambridge University Press, 2014.
[4]      A. K. Jain, M. N. Murty, and P. J. Flynn, "Data clustering: a review," ACM computing surveys (CSUR), vol. 31, pp. 264-323, 1999.
[5]      J. A. Hartigan, "Clustering algorithms (probability & mathematical statistics)," ed: John Wiley & Sons Inc New York, 1975.
[6]      R. Xu and D. Wunsch, "Survey of clustering algorithms," IEEE Transactions on neural networks, vol. 16, pp. 645-678, 2005.
[7]      J. A. Hartigan and M. A. Wong, "Algorithm AS 136: A k-means clustering algorithm," Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, pp. 100-108, 1979.
[8[ احسان نادرنژاد، حمید حسن‌پور و حسین میارنعیمی، «استفاده از مشخصه‌های آماری داده‌ها و پردازش بلوکی برای قطعه بندی تصاویر»، فصلنامه مهندسی برق دانشگاه تبریز، دوره‌ی ۳۹ شماره‌ی ۱، صفحه‌ی ۴۸ تا ۵۷، بهار ۱۳۸۸.
[9]      A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, and T. Herawan, "Big data clustering: a review," in International Conference on Computational Science and Its Applications, 2014, pp. 707-720.
[10]      R. C. Eberhart, Y. Shi, and J. Kennedy, Swarm intelligence: Elsevier, 2001.
[11]      S. Mirjalili, S. M. Mirjalili, and A. Lewis, "Grey wolf optimizer," Advances in engineering software, vol. 69, pp. 46-61, 2014.
[12]      A. Sinha and P. K. Jana, "A novel K-means based clustering algorithm for big data," in Advances in Computing, Communications and Informatics (ICACCI), 2016 International Conference on, 2016, pp. 1875-1879.
[13]      M. Jain and C. Verma, "Adapting k-means for Clustering in Big Data," International Journal of Computer Applications, vol. 101, pp. 19-24, 2014.
[14]      A. Saini, J. Minocha, J. Ubriani, and D. Sharma, "New approach for clustering of big data: DisK-means," in Computing, Communication and Automation (ICCCA), 2016 International Conference on, 2016, pp. 122-126.
[15]      D. Arthur and S. Vassilvitskii, "k-means++: the advantages of careful seeding," presented at the Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, New Orleans, Louisiana, 2007.
[16]      S. H. Razavi, E. O. M. Ebadati, S. Asadi, and H. Kaur, "An efficient grouping genetic algorithm for data clustering and big data analysis," in Computational Intelligence for Big Data Analysis, ed: Springer, 2015, pp. 119-142.
[17]      S. Saitta, B. Raphael, and I. F. Smith, "A comprehensive validity index for clustering," Intelligent Data Analysis, vol. 12, pp. 529-548, 2008.
[18]      A. Abraham, S. Das, and S. Roy, "Swarm intelligence algorithms for data clustering," in Soft computing for knowledge discovery and data mining, ed: Springer, 2008, pp. 279-313.
[19]      S. H. Kwon, "Cluster validity index for fuzzy clustering," Electronics letters, vol. 34, pp. 2176-2177, 1998.
[20]      X. Cui, T. E. Potok, and P. Palathingal, "Document clustering using particle swarm optimization," in Swarm Intelligence Symposium, 2005. SIS 2005. Proceedings 2005 IEEE, 2005, pp. 185-191.
[21]      M. G. Omran, A. Salman, and A. P. Engelbrecht, "Dynamic clustering using particle swarm optimization with application in image segmentation," Pattern Analysis and Applications, vol. 8, p. 332, 2006.
[22]      U. Maulik and S. Bandyopadhyay, "Performance evaluation of some clustering algorithms and validity indices," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 1650-1654, 2002.
[23]      C. Zhang, D. Ouyang, and J. Ning, "An artificial bee colony approach for clustering," Expert Systems with Applications, vol. 37, pp. 4761-4767, 2010.
[24]      G. Krishnasamy, A. J. Kulkarni, and R. Paramesran, "A hybrid approach for data clustering based on modified cohort intelligence and K-means," Expert Systems with Applications, vol. 41, pp. 6009-6016, 2014.
[25]      Y. Lu, B. Cao, C. Rego, and F. Glover, "A Tabu search based clustering algorithm and its parallel implementation on Spark," Applied Soft Computing, vol. 63, pp. 97-109, 2018.
[26]      D. Karaboga and C. Ozturk, "A novel clustering approach: Artificial Bee Colony (ABC) algorithm," Applied soft computing, vol. 11, pp. 652-657, 2011.
[27]      X. Han, L. Quan, X. Xiong, M. Almeter, J. Xiang, and Y. Lan, "A novel data clustering algorithm based on modified gravitational search algorithm," Engineering Applications of Artificial Intelligence, vol. 61, pp. 1-7, 2017.
[28]      A. Banharnsakun, "A MapReduce-based artificial bee colony for large-scale data clustering," Pattern Recognition Letters, vol. 93, pp. 78-84, 2017.
[29]      C. Muro, R. Escobedo, L. Spector, and R. Coppinger, "Wolf-pack (Canis lupus) hunting strategies emerge from simple rules in computational simulations," Behavioural processes, vol. 88, pp. 192-197, 2011.
[30]      T. Caliński and J. Harabasz, "A dendrite method for cluster analysis," Communications in Statistics-theory and Methods, vol. 3, pp. 1-27, 1974.
[31]      School of Computing University of Eastern Finland. "clustering basic benchmarks,” June 10, 2018; https://cs.joensuu.fi/sipu/datasets/.
[32]      W. M. Rand, "Objective criteria for the evaluation of clustering methods," Journal of the American Statistical association, vol. 66, pp. 846-850, 1971.
[33]      D. Pelleg and A. W. Moore, "X-means: Extending k-means with efficient estimation of the number of clusters," in Icml, 2000, pp. 727-734.
[34]      Z. F. Knops, J. A. Maintz, M. A. Viergever, and J. P. Pluim, "Normalized mutual information based registration using k-means clustering and shading correction," Medical image analysis, vol. 10, pp. 432-439, 2006.
[35]      J. Demšar, "Statistical comparisons of classifiers over multiple data sets," Journal of Machine learning research, vol. 7, pp. 1-30, 2006.
[36]      P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems," Pattern Recognition, vol. 39, pp. 761-775, 2006.
[37]      I. Kärkkäinen and P. Fränti, Dynamic local search algorithm for the clustering problem: University of Joensuu, 2002.
[38]      P. Fränti, R. Mariescu-Istodor, and C. Zhong, "XNN graph," in Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), 2016, pp. 207-217.
[39]      P. Franti, O. Virmajoki, and V. Hautamaki, "Fast agglomerative clustering using a k-nearest neighbor graph," IEEE transactions on pattern analysis and machine intelligence, vol. 28, pp. 1875-1881, 2006.
[40]       سمیرا رفیعی، پرهام مرادی، «بهبود عملکرد الگوریتم خوشه‌بندی فازی سی-مینز با وزن‌دهی اتوماتیک و محلی ویژگی‌ها»، مجله‌ی مهندسی برق دانشگاه تبریز، دوره‌ی ۴۶، صفحه‌ی ۷۵ تا ۸۶، تابستان ۱۳۹۵.
[41]      I. Aljarah and S. A. Ludwig, "Parallel particle swarm optimization clustering algorithm based on mapreduce methodology," in Nature and biologically inspired computing (NaBIC), 2012 fourth world congress on, 2012, pp. 104-111.
[42]      B. Wu, G. Wu, and M. Yang, "A mapreduce based ant colony optimization approach to combinatorial optimization problems," in Natural Computation (ICNC), 2012 Eighth International Conference on, 2012, pp. 728-732.
[43]      J. Li, X. Hu, Z. Pang, and K. Qian, "A parallel ant colony optimization algorithm based on fine-grained model with GPU-acceleration," International Journal of Innovative Computing, Information and Control, vol. 5, pp. 3707-3716, 2009.
[44]      D.-W. Huang and J. Lin, "Scaling populations of a genetic algorithm for job shop scheduling problems using MapReduce," in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, 2010, pp. 780-785.