Real-time enhanced efficient thread level parallelism scheme for performance improvement in heterogeneous edge computing

Indra Gandhi; Kannan G.; P. K. Jawahar

doi:10.31893/multiscience.2024145

Keywords:

parallel computing

high performance computing

GPU programming

Abstract

In the era of technology, there is a need to rely on new high performance Heterogeneous embedded computing device to process a huge amount of data for various smart applications. Packing different architecture processor into a system on chip provides a substantial potential improvement in computing horsepower, but the maximum processing power of this heterogeneous edge computing processor can only be harnessed if the target software is actually configured to utilize all the processing elements. The proposed Enhanced Efficient thread level parallelism (EETLP) is implemented using CUDA in CPU-GPU based heterogeneous edge computing platform and analyzed with different size of matrix multiplication. From the experiment results, it was clearly observed that for the matrix size 1024x1024, Efficient Thread Level Parallelism (ETLP) using quad core CPU processor reduces 71% execution time and EETLP reduces 99% execution time compared to Basic Sequential Execution (BSE). In terms of Speedup, EETLP has achieved 5.5Kx speedup compare with ETLP and 19Kx speedup against BSE on CPU.
References
1. Acheampong, A., Zhang, Y., & Xu, X. (2023). A parallel computing based model for online binary computation offloading in mobile edge computing. Computer Communications, 203, 248-261.
2. Amaral, V. (2016). Programming Languages for Data-Intensive HPC Applications: a Systematic Mapping Study. Parallel Computing, 2016. https://doi.org/ttps://doi.org/10.1016/j.parco.2019.102584
3. Ben Abdallah, A. (2017). Heterogeneous Computing: An Emerging Paradigm Embedded Systems design. Computational Frameworks, 2017. https://doi.org/10.1016/b978-1-78548-256-4.50003-x
4. Bernabe, S., Jimenez, L. I., Garcia, C., Plaza, J., & Plaza, A. (2018). Multicore Real-Time Implementation of a Full Hyperspectral Unmixing Chain. IEEE Geoscience and Remote Sensing Letters, 15(5), 744-748.
5. Chang, C.-H., Lu, C.-W., Yang, C.-T., & Chang, T.-C. (2016). An approach of performance comparisons with OpenMP and CUDA parallel programming on multicore systems. Concurrency and Computation: Practice and Experience, 2016. https://doi.org/10.1002/cpe.3829.
6. Daga, M., Tschirhart, Z. S., & Freitag, C. (2015). Exploring Parallel Programming Models for Heterogeneous Computing Systems. In: IEEE International Symposium on Workload Characterization, 2015, Atlanta, GA, pp. 98-107.
7. Garea, A. S., Heras, D. B., Argüello, F., & Demir, B. (2022). A hybrid CUDA, OpenMP, and MPI parallel TCA-based domain adaptation for classification of very high-resolution remote sensing images. The Journal of Supercomputing, 2022.1-20.
8. Goyal, A., Li, Z., & Kimm, H. (2017). Comparative Study on Edge Detection Algorithms Using OpenACC and Open-MPI on Multicore Systems. In: IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2017, Seoul, pp. 67-74.
9. Hahn Kim, Bond R. (2009). Multicore software technologies. IEEE Signal Processing Magazine, 2009, pp. 80-89.
10. Hari Singh, Chander, Dinesh, Bhatt, Ravindara. (2019). Parallel computing of matrix multiplication in OpenMP supported code blocks. Advances and Applications in Mathematical Sciences, 18, 775-787.
11. Indragandhi, K., & Jawahar, P. K. (2020). An Application-based Efficient Thread Level Parallelism Scheme on Heterogeneous Multicore Embedded System for Real-Time Image Processing. Scalable Computing: Practice and Experience, 21(1), 47-56.
12. Indragandhi, K., Jawahar, P.K. (2021). Core Performance Based Packet Priority Router for NoC-Based Heterogeneous Multicore Processor. In: Satapathy, S., Bhateja, V., Janakiramaiah, B., Chen, YW. (eds) Intelligent System Design. Advances in Intelligent Systems and Computing, vol 1171. Springer, Singapore. https://doi.org/10.1007/978-981-15-5400-1_40
13. Kang, S. J., Lee, S. Y., & Lee, K. M. (2015). Performance Comparison of OpenMP, MPI and MapReduce in Practical Problems. Advances in Multimedia, 2015. https://doi.org/http://dx.doi.org/10.1155/2015/575687.
14. Kwedlo, W., & Czochanski, P. J. (2019). A Hybrid MPI/OpenMP Parallelization of K-Means Algorithms Accelerated Using the Triangle Inequality. IEEE Access, 2019, 7, 42280-42297.
15. Ledur, C. L., Zeve, C. M., & dos Anjos, J. C. (2013). Comparative analysis of OpenACC, OpenMP and CUDA using sequential and parallel algorithms. In: 11th Workshop on Parallel and Distributed Processing (WSPPD), 2013.
16. Marongiu, A., Capotondi, A., Tagliavini, G., & Benini, L. (2015). Simplifying Many-Core-Based Heterogeneous SoC Programming With Offload Directives. IEEE Transactions on Industrial Informatics, 11(4), 957-967.
17. Mittal, S., & Vetter, J. (2015). A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Computing Surveys, 47(4), 1-35.
18. Monteiro, A., Oliveira, D., Oliveira, R., & Silva, T. (2018). Embedded application of convolutional neural networks on Raspberry Pi for SHM. Electronics Letters, 54(11), 680-682.
19. Nickolls, J., Buck, I., Garland, M., & Skadron, K. (2008). Scalable parallel programming with CUDA. Queue, 6(2), 40–53.
20. Pei, S., Zhang, J., Xiong, N., Kim, M.-S., & Gaudiot, J. L. (2018). Energy Efficiency of Heterogeneous Multicore System based on the Enhanced Amdal's Law. International Journal of High-Performance Computing and Networking, 12(3), 261–269. https://doi.org/10.1504/IJHPCN.2018.094944
21. Pekturk, M. K., Ozuzun, Y., & Ozsancaktar, O. (2019). Implementation of SAM and MF Algorithms with Multi-Core Programming. In: 9th International Conference on Recent Advances in Space Technologies (RAST), 2019, Istanbul, Turkey, 619-625.
22. Rathore, Y. S., & Kumar, D. (2014). Performance Evaluation Of Matrix Multiplication Using OpenMP For Single Dual and Multi-Core Machines. IOSR Journal of Engineering, 4(1).
23. Singh, T., Srivastava, D. K., & Aggarwal, A. (2017). A novel approach for CPU utilization on a multicore paradigm using parallel quick sort. In: 3rd International Conference on Computational Intelligence & Communication Technology (CICT), 2017, Ghaziabad, 1-6.
24. Tagliavini, G., Cesarini, D., & Marongiu, A. (2018). Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking. IEEE Transactions on Parallel and Distributed Systems, 29(9), 2150-2163.
25. Yegnanarayanan, V. (2013). An application of matrix multiplication. Resonance Springer Journal, 18(4), 368–377.
26. Zahran, M. (2017). Heterogeneous computing. Communications of the ACM, 60, 42-45. https://doi.org/10.1145/3024918.
27. Zu. (2020). Deep learning parallel computing and evaluation for embedded system clustering architecture processor. Design Automation for Embedded Systems, 2020. https://doi.org/10.1007/s10617-020-09235-5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

How to cite

Gandhi, I., G., K., & Jawahar, P. K. (2024). Real-time enhanced efficient thread level parallelism scheme for performance improvement in heterogeneous edge computing. Multidisciplinary Science Journal, 6(9), 2024145. https://doi.org/10.31893/multiscience.2024145

[1] Acheampong, A., Zhang, Y., & Xu, X. (2023). A parallel computing based model for online binary computation offloading in mobile edge computing. Computer Communications, 203, 248-261.

[2] Amaral, V. (2016). Programming Languages for Data-Intensive HPC Applications: a Systematic Mapping Study. Parallel Computing, 2016. https://doi.org/ttps://doi.org/10.1016/j.parco.2019.102584

[3] Ben Abdallah, A. (2017). Heterogeneous Computing: An Emerging Paradigm Embedded Systems design. Computational Frameworks, 2017. https://doi.org/10.1016/b978-1-78548-256-4.50003-x

[4] Bernabe, S., Jimenez, L. I., Garcia, C., Plaza, J., & Plaza, A. (2018). Multicore Real-Time Implementation of a Full Hyperspectral Unmixing Chain. IEEE Geoscience and Remote Sensing Letters, 15(5), 744-748.

[5] Chang, C.-H., Lu, C.-W., Yang, C.-T., & Chang, T.-C. (2016). An approach of performance comparisons with OpenMP and CUDA parallel programming on multicore systems. Concurrency and Computation: Practice and Experience, 2016. https://doi.org/10.1002/cpe.3829.

[6] Daga, M., Tschirhart, Z. S., & Freitag, C. (2015). Exploring Parallel Programming Models for Heterogeneous Computing Systems. In: IEEE International Symposium on Workload Characterization, 2015, Atlanta, GA, pp. 98-107.

[7] Garea, A. S., Heras, D. B., Argüello, F., & Demir, B. (2022). A hybrid CUDA, OpenMP, and MPI parallel TCA-based domain adaptation for classification of very high-resolution remote sensing images. The Journal of Supercomputing, 2022.1-20.

[8] Goyal, A., Li, Z., & Kimm, H. (2017). Comparative Study on Edge Detection Algorithms Using OpenACC and Open-MPI on Multicore Systems. In: IEEE 11th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2017, Seoul, pp. 67-74.

[9] Hahn Kim, Bond R. (2009). Multicore software technologies. IEEE Signal Processing Magazine, 2009, pp. 80-89.

[10] Hari Singh, Chander, Dinesh, Bhatt, Ravindara. (2019). Parallel computing of matrix multiplication in OpenMP supported code blocks. Advances and Applications in Mathematical Sciences, 18, 775-787.

[11] Indragandhi, K., & Jawahar, P. K. (2020). An Application-based Efficient Thread Level Parallelism Scheme on Heterogeneous Multicore Embedded System for Real-Time Image Processing. Scalable Computing: Practice and Experience, 21(1), 47-56.

[12] Indragandhi, K., Jawahar, P.K. (2021). Core Performance Based Packet Priority Router for NoC-Based Heterogeneous Multicore Processor. In: Satapathy, S., Bhateja, V., Janakiramaiah, B., Chen, YW. (eds) Intelligent System Design. Advances in Intelligent Systems and Computing, vol 1171. Springer, Singapore. https://doi.org/10.1007/978-981-15-5400-1_40

[13] Kang, S. J., Lee, S. Y., & Lee, K. M. (2015). Performance Comparison of OpenMP, MPI and MapReduce in Practical Problems. Advances in Multimedia, 2015. https://doi.org/http://dx.doi.org/10.1155/2015/575687.

[14] Kwedlo, W., & Czochanski, P. J. (2019). A Hybrid MPI/OpenMP Parallelization of K-Means Algorithms Accelerated Using the Triangle Inequality. IEEE Access, 2019, 7, 42280-42297.

[15] Ledur, C. L., Zeve, C. M., & dos Anjos, J. C. (2013). Comparative analysis of OpenACC, OpenMP and CUDA using sequential and parallel algorithms. In: 11th Workshop on Parallel and Distributed Processing (WSPPD), 2013.

[16] Marongiu, A., Capotondi, A., Tagliavini, G., & Benini, L. (2015). Simplifying Many-Core-Based Heterogeneous SoC Programming With Offload Directives. IEEE Transactions on Industrial Informatics, 11(4), 957-967.

[17] Mittal, S., & Vetter, J. (2015). A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Computing Surveys, 47(4), 1-35.

[18] Monteiro, A., Oliveira, D., Oliveira, R., & Silva, T. (2018). Embedded application of convolutional neural networks on Raspberry Pi for SHM. Electronics Letters, 54(11), 680-682.

[19] Nickolls, J., Buck, I., Garland, M., & Skadron, K. (2008). Scalable parallel programming with CUDA. Queue, 6(2), 40–53.

[20] Pei, S., Zhang, J., Xiong, N., Kim, M.-S., & Gaudiot, J. L. (2018). Energy Efficiency of Heterogeneous Multicore System based on the Enhanced Amdal's Law. International Journal of High-Performance Computing and Networking, 12(3), 261–269. https://doi.org/10.1504/IJHPCN.2018.094944

[21] Pekturk, M. K., Ozuzun, Y., & Ozsancaktar, O. (2019). Implementation of SAM and MF Algorithms with Multi-Core Programming. In: 9th International Conference on Recent Advances in Space Technologies (RAST), 2019, Istanbul, Turkey, 619-625.

[22] Rathore, Y. S., & Kumar, D. (2014). Performance Evaluation Of Matrix Multiplication Using OpenMP For Single Dual and Multi-Core Machines. IOSR Journal of Engineering, 4(1).

[23] Singh, T., Srivastava, D. K., & Aggarwal, A. (2017). A novel approach for CPU utilization on a multicore paradigm using parallel quick sort. In: 3rd International Conference on Computational Intelligence & Communication Technology (CICT), 2017, Ghaziabad, 1-6.

[24] Tagliavini, G., Cesarini, D., & Marongiu, A. (2018). Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking. IEEE Transactions on Parallel and Distributed Systems, 29(9), 2150-2163.

[25] Yegnanarayanan, V. (2013). An application of matrix multiplication. Resonance Springer Journal, 18(4), 368–377.

[26] Zahran, M. (2017). Heterogeneous computing. Communications of the ACM, 60, 42-45. https://doi.org/10.1145/3024918.

[27] Zu. (2020). Deep learning parallel computing and evaluation for embedded system clustering architecture processor. Design Automation for Embedded Systems, 2020. https://doi.org/10.1007/s10617-020-09235-5

Abstract

References

How to cite