خوشه‌بندی مبتنی بر فاصله چگالی گوسی تطبیقی (AGDD)

نوع مقاله : علمی-پژوهشی

نویسندگان

Department of Computer Engineering, Yazd University, Yazd, Iran.

چکیده

روش‌های خوشه‌بندی مبتنی بر فاصله، نمونه‌ها را با بهینه‌سازی یک معیار کلی دسته‌بندی کرده و خوشه‌هایی بیضوی با اندازه‌های تقریباً برابر ایجاد می‌کنند. در مقابل، تکنیک‌های خوشه‌بندی مبتنی بر چگالی با بهینه‌سازی یک معیار محلی، خوشه‌هایی با اشکال و اندازه‌های دلخواه تشکیل می‌دهند. اکثر این روش‌ها دارای چندین ابرپارامتر هستند و عملکرد آن‌ها به شدت به تنظیم این ابرپارامترها وابسته است. اخیراً روشی به نام فاصله چگالی گوسی (GDD) ارائه شده که معیار محلی را بر اساس ویژگی‌های فاصله و چگالی نمونه‌ها بهینه می‌کند. روش GDD  بدون پارامتر آزاده بوده و قادر است خوشه‌هایی با اشکال و اندازه‌های مختلف را شناسایی کند. با این حال، ممکن است نتواند خوشه‌های مناسب را به دلیل تداخل نمونه‌های خوشه‌بندی‌شده در برآورد ویژگی‌های چگالی و فاصله نمونه‌های باقی‌مانده،  شناسایی کند. در این پژوهش، روشی به نام فاصله چگالی گوسی تطبیقی (AGDD) معرفی شده که با به‌روزرسانی تطبیقی پارامترها در طول خوشه‌بندی، اثرات نامناسب نمونه‌های خوشه‌بندی‌شده را حذف می‌کند. این روش پایدار است و می‌تواند خوشه‌هایی با اشکال، اندازه‌ها و چگالی‌های مختلف را بدون افزودن پارامترهای اضافی شناسایی کند. معیارهای فاصله که عدم شباهت بین نمونه‌ها را محاسبه می‌کنند، می‌توانند بر عملکرد خوشه‌بندی تأثیر بگذارند. تأثیر استفاده از معیارهای فاصله مختلف نیز در این روش تحلیل شده است. نتایج آزمایش‌های انجام‌شده روی چندین مجموعه داده‌ی شناخته‌شده، کارایی روش پیشنهادی AGDD را در مقایسه با سایر روش‌های معروف خوشه‌بندی نشان می‌دهد.

کلیدواژه‌ها


عنوان مقاله [English]

Adaptive Gaussian Density Distance for Clustering

نویسندگان [English]

  • M. Yazdian-Dehkordi
  • F. Nadi
  • S. Abbasi
CoDepartment of Computer Engineering, Yazd University, Yazd, Iran.mputer Engineering, Faculty of Engineering, Yazd University, Yazd, Iran
چکیده [English]

Distance-based clustering methods categorize samples by optimizing a global criterion, finding ellipsoid clusters with roughly equal sizes. In contrast, density-based clustering techniques form clusters with arbitrary shapes and sizes by optimizing a local criterion. Most of these methods have several hyper-parameters, and their performance is highly dependent on the hyper-parameter setup. Recently, a Gaussian Density Distance (GDD) approach was proposed to optimize local criteria in terms of distance and density properties of samples. GDD can find clusters with different shapes and sizes without any free parameters. However, it may fail to discover the appropriate clusters due to the interfering of clustered samples in estimating the density and distance properties of remaining unclustered samples. Here, we introduce Adaptive GDD (AGDD), which eliminates the inappropriate effect of clustered samples by adaptively updating the parameters during clustering. It is stable and can identify clusters with various shapes, sizes, and densities without adding extra parameters. The distance metrics calculating the dissimilarity between samples can affect the clustering performance. The effect of different distance measurements is also analyzed on the method. The experimental results conducted on several well-known datasets show the effectiveness of the proposed AGDD method compared to the other well-known clustering methods.

کلیدواژه‌ها [English]

  • Density-based Clustering
  • Distance-based Clustering
  • Gaussian Density

[1] P. Saini, J. Kaur, and S. Lamba, “A Review on Pattern Recognition Using Machine Learning,” Lecture Notes in Mechanical Engineering, pp. 619–627, 2021.

[2] C. Li, F. Kulwa, J. Zhang, Z. Li, H. Xu, and X. Zhao, “A review of clustering methods in microorganism image analysis,” Advances in Intelligent Systems and Computing, vol. 1186, Springer, pp. 13–25, 2021.

[3] M. Subramaniam, A. Kathirvel, E. Sabitha, and H. A. Basha, “Modified firefly algorithm and fuzzy c-mean clustering based semantic information retrieval,” Journal of Web Engineering, vol. 20, no. 1, pp. 33–52, 2021.

[4] A. R., Sardar, and R. Havangi, “Performance improvement of automatic clustering algorithm of colored images through preprocessing using Self-Organizing Maps (SOM) neural network,” Tabriz Journal of Electrical Engineering, vol. 47, no. 3, pp. 1073-1082, 2017.

[5] M. Kearns, Y. Mansour, and A. Y. Ng, “An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering,” Learning in Graphical Models, Springer Netherlands, pp. 495–520, 1998.

[6] D. J. Bora and D. A. K. Gupta, “A Comparative study Between Fuzzy Clustering Algorithm and Hard Clustering Algorithm,” International Journal of Computer Trends and Technology, vol. 10, no. 2, pp. 108–113, 2014.

[7] S, Rafiei, and P. Moradi, “Improving Performance of Fuzzy C-means Clustering Algorithm using Automatic Local Feature Weighting.,” Tabriz Journal of Electrical Engineering, vol. 46, no. 2, 2016.

[8] Rasmussen and E. M, “Clustering algorithms.,” Information Retrieval: data Structures & algorithms, vol. 419, p. 442, 1992, Accessed: Apr. 29, 2021.

[9] S. Kaushik, D. Kundu, S. Ghosh, S. Das, and A. Abraham. “Data clustering using multi-objective differential evolution algorithms. .” Fundamenta Informaticae, vol. 97, no. 4, pp. 381-403, 2009.

[10] J. Ak and R. Dubes, “Algorithms for clustering data. Englewood Cliff,” NJ Prentice Hall, 1988, Accessed: Apr. 29, 2021.

[11] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–296, 1967, Accessed: Apr. 29, 2021.

[12] Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A. “A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects.” Engineering Applications of Artificial Intelligence, vol. 110, pp. 104743, 2022.

[13] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231, 1996, Accessed: Apr. 29, 2021.

[14] M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, “OPTICS: Ordering Points to Identify the Clustering Structure,” ACM Sigmod record, vol. 28, no. 2, pp. 49–60, 1999.

[15] Ram A, Jalal S, Jalal A S, Kumar M. “A density based algorithm for discovering density varied clusters in large spatial databases,” International Journal of Computer Applications, vol. 3, no. 6, pp. 1-4, 2010.

[16] A. Smiti and Z. Elouedi, “DBSCAN-GM: An improved clustering method based on Gaussian Means and DBSCAN techniques,” INES 2012 - IEEE 16th International Conference on Intelligent Engineering Systems, pp. 573–578, 2012.

[17] C. Fraley and A. E. Raftery, “Model-based clustering, discriminant analysis, and density estimation,” Journal of the American Statistical Association, vol. 97, no. 458, pp. 611–631, 2002.

[18] C. Fraley, A. E. Raftery, T. B. Murphy, and L. Scrucca, “mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation.” Technical Report, Department of Statistics, University of Washington, no. 597, 2012.

[19] Jenni, V. R., Dua, A., Shobha, G., Shetty, J., & Dev, R. “Hybrid Density-based Adaptive Clustering using Gaussian Kernel and Grid Search,” Recent Trends on Electronics, Information, Communication & Technology (RTEICT), pp. 221-226, 2021.

[20] E. Güngör and A. Özmen, “Distance and density based clustering algorithm using Gaussian kernel,” Expert Systems with Applications, vol. 69, pp. 10–20, 2017.

[21] A. S. Shirkhorshidi, S. Aghabozorgi, and T. Ying Wah, “A Comparison study on similarity and dissimilarity measures in clustering continuous data,” PLoS One, vol. 10, no. 12, 2015.

[22] C. Luo, Y. Li, and S. M. Chung, “Text document clustering based on neighbors,” Data & Knowledge Engineering, vol. 68, no. 11, pp. 1271–1288, 2009.

[23] K. Bache and M. Lichman, “UCI machine learning,”. https://ergodicity.net/2013/07/, accessed Apr. 29, 2021.

[24] “UCI machine learning repository,” Jul. 10, 2020. http://archive.ics.uci.edu/ml (accessed Apr. 29, 2021).

[25] A. K. Jain and M. H. C. Law, “Data Clustering : A User ’ s Dilemma,” pp. 1–10, 2005.

[26] S. Aghabozorgi, A. S. Shirkhorshidi, and T. Y. Wah, “Time-series clustering--a decade review,” Information Systems, vol. 53, pp. 16–38, 2015.

[27] J. C. Bezdek and N. R. Pal, “Some New Indexes of Cluster Validity,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 28, no. 3, pp. 301-315, 1998. Accessed: Apr. 29, 2021.

[28] H. Cui, M. Xie, Y. Cai, X. Huang, and Y. Liu, “Cluster validity index for adaptive clustering algorithms; Cluster validity index for adaptive clustering algorithms,” IET Communications, vol. 8, no. 13, pp. 2256–2263, 2013.

[29] R. Kashef and M. S. Kamel, “Enhanced bisecting k-means clustering using intermediate cooperation,” Pattern Recognition, vol. 42, no. 11, pp. 2557-2569, 2009.

[30] J. Lipor and L. Balzano, “Clustering quality metrics for subspace clustering.” Pattern Recognition, vol. 104, pp. 107328, 2020.

[31] M. Gaurav and S. K. Mohanty, “A fast hybrid clustering technique based on local nearest neighbor using minimum spanning tree,” Expert Systems with Applications, vol. 132, pp. 28-43, 2019.

[32] J. P. W. Pluim, J. B. A. Maintz, and M. A. Viergever, “Mutual-Information-Based Registration of Medical Images: A Survey,” IEEE Transactions on Medical Imaging, vol. 22, no. 8, 2003.

[33] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971.

[34] J. C. Rojas-Thomas, M. Santos, and M. Mora, “New internal index for clustering validation based on 

  1. [1] P. Saini, J. Kaur, and S. Lamba, “A Review on Pattern Recognition Using Machine Learning,” Lecture Notes in Mechanical Engineering, pp. 619–627, 2021.

    [2] C. Li, F. Kulwa, J. Zhang, Z. Li, H. Xu, and X. Zhao, “A review of clustering methods in microorganism image analysis,” Advances in Intelligent Systems and Computing, vol. 1186, Springer, pp. 13–25, 2021.

    [3] M. Subramaniam, A. Kathirvel, E. Sabitha, and H. A. Basha, “Modified firefly algorithm and fuzzy c-mean clustering based semantic information retrieval,” Journal of Web Engineering, vol. 20, no. 1, pp. 33–52, 2021.

    [4] A. R., Sardar, and R. Havangi, “Performance improvement of automatic clustering algorithm of colored images through preprocessing using Self-Organizing Maps (SOM) neural network,” Tabriz Journal of Electrical Engineering, vol. 47, no. 3, pp. 1073-1082, 2017.

    [5] M. Kearns, Y. Mansour, and A. Y. Ng, “An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering,” Learning in Graphical Models, Springer Netherlands, pp. 495–520, 1998.

    [6] D. J. Bora and D. A. K. Gupta, “A Comparative study Between Fuzzy Clustering Algorithm and Hard Clustering Algorithm,” International Journal of Computer Trends and Technology, vol. 10, no. 2, pp. 108–113, 2014.

    [7] S, Rafiei, and P. Moradi, “Improving Performance of Fuzzy C-means Clustering Algorithm using Automatic Local Feature Weighting.,” Tabriz Journal of Electrical Engineering, vol. 46, no. 2, 2016.

    [8] Rasmussen and E. M, “Clustering algorithms.,” Information Retrieval: data Structures & algorithms, vol. 419, p. 442, 1992, Accessed: Apr. 29, 2021.

    [9] S. Kaushik, D. Kundu, S. Ghosh, S. Das, and A. Abraham. “Data clustering using multi-objective differential evolution algorithms. .” Fundamenta Informaticae, vol. 97, no. 4, pp. 381-403, 2009.

    [10] J. Ak and R. Dubes, “Algorithms for clustering data. Englewood Cliff,” NJ Prentice Hall, 1988, Accessed: Apr. 29, 2021.

    [11] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–296, 1967, Accessed: Apr. 29, 2021.

    [12] Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A. “A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects.” Engineering Applications of Artificial Intelligence, vol. 110, pp. 104743, 2022.

    [13] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231, 1996, Accessed: Apr. 29, 2021.

    [14] M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, “OPTICS: Ordering Points to Identify the Clustering Structure,” ACM Sigmod record, vol. 28, no. 2, pp. 49–60, 1999.

    [15] Ram A, Jalal S, Jalal A S, Kumar M. “A density based algorithm for discovering density varied clusters in large spatial databases,” International Journal of Computer Applications, vol. 3, no. 6, pp. 1-4, 2010.

    1. Bhattacharjee and P. Mitra, “A survey of density based clustering algorithms,” Frontiers of Computer Science, vol. 15, no. 1. Higher Education Press Limited Company, 2021.

    [16] A. Smiti and Z. Elouedi, “DBSCAN-GM: An improved clustering method based on Gaussian Means and DBSCAN techniques,” INES 2012 - IEEE 16th International Conference on Intelligent Engineering Systems, pp. 573–578, 2012.

    [17] C. Fraley and A. E. Raftery, “Model-based clustering, discriminant analysis, and density estimation,” Journal of the American Statistical Association, vol. 97, no. 458, pp. 611–631, 2002.

    [18] C. Fraley, A. E. Raftery, T. B. Murphy, and L. Scrucca, “mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation.” Technical Report, Department of Statistics, University of Washington, no. 597, 2012.

    [19] Jenni, V. R., Dua, A., Shobha, G., Shetty, J., & Dev, R. “Hybrid Density-based Adaptive Clustering using Gaussian Kernel and Grid Search,” Recent Trends on Electronics, Information, Communication & Technology (RTEICT), pp. 221-226, 2021.

    [20] E. Güngör and A. Özmen, “Distance and density based clustering algorithm using Gaussian kernel,” Expert Systems with Applications, vol. 69, pp. 10–20, 2017.

    [21] A. S. Shirkhorshidi, S. Aghabozorgi, and T. Ying Wah, “A Comparison study on similarity and dissimilarity measures in clustering continuous data,” PLoS One, vol. 10, no. 12, 2015.

    [22] C. Luo, Y. Li, and S. M. Chung, “Text document clustering based on neighbors,” Data & Knowledge Engineering, vol. 68, no. 11, pp. 1271–1288, 2009.

    [23] K. Bache and M. Lichman, “UCI machine learning,”. https://ergodicity.net/2013/07/, accessed Apr. 29, 2021.

    [24] “UCI machine learning repository,” Jul. 10, 2020. http://archive.ics.uci.edu/ml (accessed Apr. 29, 2021).

    [25] A. K. Jain and M. H. C. Law, “Data Clustering : A User ’ s Dilemma,” pp. 1–10, 2005.

    [26] S. Aghabozorgi, A. S. Shirkhorshidi, and T. Y. Wah, “Time-series clustering--a decade review,” Information Systems, vol. 53, pp. 16–38, 2015.

    [27] J. C. Bezdek and N. R. Pal, “Some New Indexes of Cluster Validity,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 28, no. 3, pp. 301-315, 1998. Accessed: Apr. 29, 2021.

    [28] H. Cui, M. Xie, Y. Cai, X. Huang, and Y. Liu, “Cluster validity index for adaptive clustering algorithms; Cluster validity index for adaptive clustering algorithms,” IET Communications, vol. 8, no. 13, pp. 2256–2263, 2013.

    [29] R. Kashef and M. S. Kamel, “Enhanced bisecting k-means clustering using intermediate cooperation,” Pattern Recognition, vol. 42, no. 11, pp. 2557-2569, 2009.

    [30] J. Lipor and L. Balzano, “Clustering quality metrics for subspace clustering.” Pattern Recognition, vol. 104, pp. 107328, 2020.

    [31] M. Gaurav and S. K. Mohanty, “A fast hybrid clustering technique based on local nearest neighbor using minimum spanning tree,” Expert Systems with Applications, vol. 132, pp. 28-43, 2019.

    [32] J. P. W. Pluim, J. B. A. Maintz, and M. A. Viergever, “Mutual-Information-Based Registration of Medical Images: A Survey,” IEEE Transactions on Medical Imaging, vol. 22, no. 8, 2003.

    [33] W. M. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971.

    [34] J. C. Rojas-Thomas, M. Santos, and M. Mora, “New internal index for clustering validation based