روشی کارا برای پیاده‌سازی موازی الگوریتم دسته بندی بسته درخت سلسله‌مراتبی بر روی واحد پردازش گرافیکی

نویسندگان

دانشگاه بوعلی سینا همدان

چکیده

چکیده: دسته­بندی بسته­ها، پردازشی اساسی در پردازنده­های شبکه­ای است. در این فرآیند، بسته­ها­ی ورودی از طریق تطبیق با مجموعه­ای از فیلترها به جریان­های مشخص طبقه­بندی می­شوند. پیاده‌سازی‌های نرم‌افزاری الگوریتم­های دسته­بندی با وجود هزینه کم‌تر و توسعه‌پذیری بیش‌تر نسبت به پیاده‌سازی­های سخت‌افزاری، سرعت پایین‌تری دارند. در این مقاله، از قابلیت پردازش موازی پردازنده‌های گرافیکی برای تسریع الگوریتم درخت سلسله‌مراتبی دسته­بندی بسته­ها، استفاده نموده و سناریوهای متفاوتی را بر اساس معماری حافظه‌های سراسری و اشتراکی آن‌ها پیشنهاد می­نماییم. نتایج پیاده‌سازی این سناریوها، ضمن تأیید پیچیدگی­های زمانی و حافظه­ای محاسبه­شده، نشان می­دهد کارایی ­سناریوهایی که مجموعه فیلتر را به‌صورت زیردرخت­هایی کوچک‌تر یا مساوی حافظه اشتراکی تقسیم و به آن کپی می­کنند کم‌تر از سناریویی است که کل ساختار داده را در حافظه سراسری نگه می­دارد. کارایی این سناریوها، با کاهش تعداد زیردرخت­ها و فیلترهای تکراری افزایش می­یابد علاوه بر این، سناریویی که بتواند درخت سلسله‌مراتبی و مجموعه فیلترهای متناظر را، بدون افراز در حافظه اشتراکی جای دهد برترین سناریو است. نتایج آزمـایش نـشان می­دهد که نرخ گذرداد حاصله در این سناریو نسبت به روش­های موجود بر روی یک GPU یکسان تا 1/2 برابر بهبود می­­یابد.

کلیدواژه‌ها


عنوان مقاله [English]

An Efficient Method for Parallel Implementation of H-Trie Packet Classification Algorithm on GPU

چکیده [English]

Abstract: Packet classification is a fundamental process in network processors. In this process, input packets are classified into distinct set of flows via matching against a set of filters. Software implementation of packet classification algorithms, though having lower cost and more scalability as compared with hardware implementations, are slower. In this paper, we use parallel processing capabilities of the graphical processors to accelerate Hierarchical-Trie packet classification algorithm and propose different scenarios based on the architecture of their global and shared memories. Results of implementing these scenarios, conforming computed time and memory complexities, show that the performance of the scenarios that divide the filter set into sub-trees, equal to/ smaller than the shared memory and copy them to it, is lower than that of a scenario which keeps the total data structure in the global memory. The performance of these scenarios increases by decreasing the number of sub-trees and duplicated filters. Moreover, a scenario that can keep hierarchical tree and corresponding filters in shared memory, without any partitioning, is the best scenario. The experimental results show that, on a same GPU, this scenario attains a throughput of approximately 2.1 times compared to the 

کلیدواژه‌ها [English]

  • Keywords: Packet classification
  • H-trie algorithm
  • graphical processing unit
  • CUDA
  • memory hierarchy
  • complexity
  • performance
[1] D. Pao and Z. Lu, "A multi-pipeline architecture for high-speed packet classification," Computer Communications,vol. 54, pp. 84-96, 2014.
[2] B. S. Tumari and W. Lakshmipriya, "FPGA Implementation of Binary-tree-based High Speed Packet ClassificationSystem," International Journal of Combined Research & Development (IJCRD ),vol. 2, pp. 17-22, 2014.
[3] K. Zheng, H. Che, Z. Wang and B. Liu, "TCAM-based distributed parallel packet classification algorithm with range-matching solution," in INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings IEEE, pp. 293-303, 2005.
[4] K. Zheng, H. Che, Z. Wang, B. Liu and X. Zhang, "DPPC-RE: TCAM-based distributed parallel packet classification with range encoding," Computers, IEEE Transactions on,vol. 55, pp. 947-961, 2006.
[5] Z. Cao, M. Kodialam and T. Lakshman, "Traffic steering in software defined networks: planning and online routing," ACM SIGCOMM Computer Communication Review - SIGCOMM'14,vol. 44, pp. 65-70, 2014.
[6] K. Guerra Perez, X. Yang, S. Scott-Hayward and S. Sezer, "A configurable packet classification architecture for Software-Defined Networking," in 27th IEEE International System-on-Chip Conference (SOCC), pp. 353-358, 2014.
[7] S. Han, K. Jang and S. Moon, "PacketShader: a GPU-accelerated software router," ACM SIGCOMM Computer Communication Review,vol. 41, pp. 195-206, 2011.
[8] K. G. Perez, X. Yang, S. Scott-Hayward and S. Sezer, "Optimized packet classification for Software-Defined Networking," in International Conference on  Communications (ICC), IEEE, pp. 859-864, 2014.
[9] D. E. Taylor, "Survey and taxonomy of packet classification techniques," ACM Computing Surveys (CSUR), vol. 37, pp. 238-275, 2005.
[10] S. Zhou, Y. R. Qu and V. K. Prasanna, "Multi-core implementation of decomposition-based packet classification algorithms," in Parallel Computing Technologies, vol. 7979, ed: Springer, pp. 105-119, 2013.
[11] V. Srinivasan, S. Suri and G. Varghese, "Packet classification using tuple space search," ACM SIGCOMM Computer Communication Review,vol. 29, pp. 135-146, 1999.
[12] H. Lim, Y. Choe, M. Shim and J. Lee, "A Quad-Trie Conditionally Merged with a Decision Tree for Packet Classification," Communications Letters, IEEE,vol. 18, pp. 676 - 679, 2014.
[13] سعید پارسا و محمد حمزه‌ئی، «کاشی‌بندی حلقه‌های تودرتو با در نظر گرفتن محلیت داده‌ها به‌منظور اجرای موازی بر روی پردازنده‌های چندهسته‌ای»، مجله مهندسی برق دانشگاه تبریز، جلد 45، شماره 3، صفحه 17-26، 1394.
[14] H. Lim, S. Lee and E. E. Swartzlander Jr, "A new hierarchical packet classification algorithm," Computer Networks,vol. 56, pp. 3010-3022, 2012.
[15] NVIDIA. NVIDIA CUDA Compute Unified Device Architecture Programming Guide, version 6.5, August 2015, http://docs.nvidia.com/cuda/pdf/CUDA_C_ Programming_Guide.pdf
[16] AMD: Global Provider of Innovative Graphics, Processors, August 2015 Available: http://www.amd.com
[17] Y. Li, D. Zhang, A. X. Liu and J. Zheng, "GAMT: a fast and scalable IP lookup engine for GPU-based software routers," in Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems, pp. 1-12, 2013.
[18] T.-H. Li, H.-M. Chu and P.-C. Wang, "IP address lookup using GPU," in 14th International Conference on High Performance Switching and Routing (HPSR), IEEE, pp. 177-184, 2013.
[19] R. S. Sinha, S. Singh, S. Singh and V. K. Banga, "Speedup Genetic Algorithm Using C-CUDA," in Fifth International Conference on Communication Systems and Network Technologies (CSNT), pp. 1355-1359, 2015.
[20] M. Beyeler, N. Oros, N. Dutt and J. L. Krichmar, "A GPU-accelerated cortical neural network model for visually guided robot navigation," Neural Networks, In Press,2015.
[21] Y. Lu, Y. Zhu, M. Han, J. S. He and Y. Zhang, "A survey of GPU accelerated SVM," in Proceedings of the 2014 ACM Southeast Regional Conference, pp. 15-23, 2014.
[22] P. Przymus and K. Kaczmarski, "Dynamic compression strategy for time series database using GPU," in New Trends in Databases and Information Systems, ed: Springer, pp. 235-244, 2014.
[23] G. Vasiliadis, E. Athanasopoulos, M. Polychronakis and S. Ioannidis, "PixelVault: Using GPUs for securing cryptographic operations," in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 1131-1142, 2014.
[24] C.-L. Hung, Y.-L. Lin, K.-C. Li, H.-H. Wang and S.-W. Guo, "Efficient GPGPU-based parallel packet classification," in Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 1367-1374, 2011.
[25] A. Nottingham and B. Irwin, "GPU packet classification using OpenCL: a consideration of viable classification methods," in Proceedings of the 2009 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists, pp. 160-169, 2009.
[26] Y. Deng, X. Jiao, S. Mu, K. Kang and Y. Zhu, "NPGPU: Network Processing on Graphics Processing Units," in Theoretical and Mathematical Foundations of Computer Science, ed: Springer, pp.  313-321, 2011.
[27] K. Kang and Y. S. Deng, "Scalable packet classification via GPU metaprogramming," in Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1-4, 2011.
[28] D. E. Taylor and J. S. Turner, "Classbench: A packet classification benchmark," in INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies, pp. 2068-2079, 2005.
[29] S. Zhou, S. G. Singapura and V. K. Prasanna, "High-Performance Packet Classification on GPU," in High Performance Extreme Computing Conference (HPEC), pp. 1-6, 2014.
[30] M. Varvello, R. Laufer, F. Zhang and T. Lakshman, "Multi-Layer Packet Classification with Graphics Processing Units," in Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, pp. 109-120, 2014.
[31] J. Zheng, D. Zhang, Y. Li and G. Li, "Accelerate Packet Classification Using GPU: A Case Study on HiCuts," in Computer Science and its Applications, ed: Springer, pp. 231-238, 2015.
[32] احسان اولیائی ترشیزی و حسین شریفی، «ارائه دو الگوریتم دیکدینگ هیبرید جدید با عملکرد بسیار خوب و پیچیدگی بسیار کم برای دیکدینگ کدهایLDPC»، مجله مهندسی برق دانشگاه تبریز، جلد 45، شماره 4، صفحه 27-37، 1394.
[23] L. Ma, K. Agrawal and R. D. Chamberlain, "A memory access model for highly-threaded many-core architectures," Future Generation Computer Systems, vol. 30, pp. 202-215, 2014.
[34] J. S. Kirtzic, O. Daescu, "A parallel algorithm development model for the GPU architecture," in Proc. of International Conf. on Parallel and Distributed Processing Techniques and Applications, pp. 1-9, 2012.
[35] M. Amarıs, D. Cordeiro, A. Goldman and R. Y. de Camargo, "A Simple BSP-based Model to Predict Execution Time in GPU Applications," in 22nd annual IEEE International Conference on High Performance Computing (HiPC 2015), pp. 285-294, 2015.
[36] K. Nakano, "The hierarchical memory machine model for GPUs," in IEEE 27th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 591-600, 2013.
[37] K. Nakano, "Simple memory machine models for GPUs," International Journal of Parallel, Emergent and Distributed Systems,vol. 29, pp. 17-37, 2014.
[38] S. A. Haque and N. Xie, "A many-core machine model for designing algorithms with minimum parallelism overheads," in arXiv preprint arXiv:1402.0264,2014.
[39] S. Hong and H. Kim, "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness," in ACM SIGARCH Computer Architecture News, pp. 152-163, 2009.
[40] W. Liu, W. Müller-Wittig and B. Schmidt, "Performance predictions for general-purpose computation on GPUs," in International Conference on Parallel Processing (ICPP), pp. 50-50, 2007.
[41] L. Ma, R. D. Chamberlain and K. Agrawal, "Performance modeling for highly-threaded many-core GPUs," in 25th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 84-91, 2014.
[42] S. H. Roosta, Parallel processing and parallel algorithms: theory and computation, Springer Science & Business Media, 2012.
[43] D. A. Jacobsen, J. C. Thibault and I. Senocak, "An MPI-CUDA implementation for massively parallel incompressible flow computations on multi-GPU clusters," in 48th AIAA aerospace sciences meeting and exhibit, pp. 1-16, 2010.
[44] M. Bernaschi, M. Bisson and M. Fatica, "Colloquium: Large scale simulations on GPU clusters," The European Physical Journal B,vol. 88, pp. 1-10, 2015.