Responsive image
博碩士論文 etd-0004122-135347 詳細資訊
Title page for etd-0004122-135347
論文名稱
Title
具動態處理器分派之超多純量指令分析器設計
Design of Instruction Analyzer with Dynamically Dispatching Processor Mechanism in Hyperscaler Architecture
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
141
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2021-12-29
繳交日期
Date of Submission
2022-01-04
關鍵字
Keywords
多核心、多量核心、超多純量、指令並行度、多核心群組管理單元、迴圈語意、環狀緩衝
multiple cores, many cores, hyperscaler, ILP, GMU, semantic of loop, circular buffer
統計
Statistics
本論文已被瀏覽 99 次,被下載 0
The thesis/dissertation has been browsed 99 times, has been downloaded 0 times.
中文摘要
隨著現今社會的發展進度,面對高效能計算需求時代的來臨,促使在多核心處理器的運用趨勢上日益顯著,而多核心(Multiple Cores)亦朝向多量核心(Many Cores)之趨勢發展。隨著這樣的趨勢,如何有效的運用多核心處理器來達到最大的運算效能,是目前多核心處理器架構上的一大課題。本實驗室所提出之超多純量架構(Hyperscaler Architecture)為一種可彈性整合多處理器供單一執行緒執行超純量(Superscaler Architecture)運算模式的多量核心架構。整體系統架構的運算效率取決於執行緒指令層級並行度(Instruction Level Parallelism, ILP)對於超純量運算模式的處理器使用率,其中指令分析器(Instruction Analyzer, IA)為超多純量架構用以分派指令至處理器群組進行超純量運算模式之主要單元。然而,在傳統基礎式處理器群組分派上為固定式之分配策略,亦即執行緒執行之初其處理器群組已被設定配置完成,在執行過程中不再因應執行緒指令並行度需求變化而改變處理器群組中處理核心之重新配置,如此固定式分派策略將影響該執行緒執行效率,也將使得整體處理器的使用效率無法達到最佳化。
為提升多核心處理器運算效能,本論文以Hyperscaler多核心架構為基礎,提出具動態處理器分派之超多純量指令分析器(Instruction Analyzer, IA)設計架構,隨著動態調整分析執行緒指令層級並行度(ILP)以及搭配群組管理單元(Group Management Unit, GMU)處理並記錄單晶片多核心在系統層執行緒適當排程下的群組資訊,動態分析執行緒ILP在執行期間能夠動態依ILP的需求調整處理器群組的處理器個數組態以增加執行緒的執行效率。整體系統架構上可分為三大部分: 第一部分的迴圈程式生成與指令快取列的前置預取(instruction cacheline prefetch),設計因應迴圈程式需求的緩衝器作為指令快取列提供後續指令分析用途之預取機制;第二部分的指令層級並行度偵測單元(ILP Degree Detection Unit),進行指令視窗內指令的相依性分析,將所偵測之指令層級並行度(ILP)排入指令層級並行度佇列(ILP Queue)以及指令層級並行度暫存器(Instruction Level Parallelism Register, ILPR);最後一部分的群組管理單元(Group Management Unit, GMU)處理與紀錄系統層群組執行緒資訊,透過分析系統層群組執行緒的指令,將多顆核心群組做重新組態的挪用調動。
在驗證系統架構上,將成效分析程式(Benchmark),經過Raspberry Pi compile後生成的ARM組語與機械碼作為驗證之輸入,以進行指令層級並行度偵測單元(ILP Degree Detection Unit)在硬體架構上完成功能的驗證與合成;最後將所得參數帶入整體軟體系統完成執行效能之驗證。經實測搭配適切系統層執行緒核心群組策略的排程相對於固定式核心群組策略排程,不同成效分析執行緒程式可以得到近12%~40%左右的執行效能提升。
Abstract
Upon the development prograss through nowadays society, the era of high-performance computing requirement is coming, which make the tendency of multi-processors usage more and more significant since then. Also, the development from Multiple Cores toward Many Cores is growing as well. In the wake of these development trends, it is a quite big issue of how to make good use of multi-processors efficiently to reach the peak of computing efficiency. Hyperscaler Architecture proposed by MPD Lab, is an architecture that provide a flexible integration of many cores for single thread execution under Superscaler Architecture computing mode. The whole system computing performance depend on the usage rate of thread Instruction Level Parallelism, ILP under Superscaler Architecture computing mode, and among, Instruction Analyzer, IA is a key major element of Hyperscaler Architecture using for dispatching instructions of cores grouping in Superscaler Architecture computing mode. However, the core grouping is basically fixed in traditional processor grouping strategy, which means the processor grouping has already been set and configured, and no longer make any chages for reconfiguring processor grouing that respond to different resource requirement by different ILP during thread execution. Such static grouping strategy influence the thread execution performance, and also make the processor usage efficiency can’t be optimized.
For the purpose of enhancing multi-porcessors computing efficacy in this thesis, we propose a design of Instruction Analyzer with dynamically dispatching processor mechanism in Hyperscaler based architecture. Along with analyzing threads ILP Degree dynamically and together with Group Management Unit that deal and record chip multi-processors grouping information under appropriate Operating System OS grouping strategy thread, the mechanism are able to enhance execution efficacy by adjusting grouping cores’ configuration according to ILP requirement while analyzing thread ILP dynamically. The whole system mechanism can be devide into three major parts: For the first part, the looped threads/programs generation and instruction cacheline prefetched, designed an instruction cacheline prefetched and circular buffer mechanism according to the demands of looped program that provid for the later instruction analysis. Then the second part, ILP Degree Detection Unit, go on instruction dependence analysis inside instruction window, at the mean while, queue in the detected ILP Degree into a ILP Queue and Instruction Level Parallelism Register ILPR. The last part, Group Management Unit GMU that deal with and record information of Operating System OS grouping thread. Adjust the multi-cores grouping to reconfigure the grouping by analyzing instruction from Operating System OS grouping thread.
On the verification of system mechanism, we use the ARM assembly and machine code of Benchmark threads/program that generated after compiling by Raspberry Pi as input, for the ILP Degree Detection Unit hardware structure functional verification and synthesization. At last, put these parameters into software mechanism to complete verification. After verification, approximately 12%~40% execution efficacy enhancement is available under different extent of loop threads/programs size with appropriate OS grouping threads scheduling refer to static OS grouping .
目次 Table of Contents


論文審定書 i
致謝 ii
摘要 iii
Abstract v
目錄 vii
圖次 x
表次 xiv
第一章 簡介 1
1. 1
1.1. 研究動機 1
1.2. 研究目標 3
1.3. 論文架構 4
第二章 相關研究 5
2. 5
2.1. 超多純量(Hyperscalar)架構介紹 5
2.1.1. 指令分析器 7
2.2. ARM指令集架構介紹 10
2.2.1. ARM/THUMB指令 10
2.2.2. Keil C平台指令格式與生成 19
2.2.3. Raspberry Pi平台指令格式與生成 20
2.3. GMU系統層指令介紹 23
2.4. 多層環狀緩衝架構 25
2.5. 迴圈語意分析 26
第三章 具動態處理器分派之超多純量指令分析器 28
3. 28
3.1. 整體架構 28
3.1.1. 系統架構設計概念 28
3.1.2. 系統架構 29
3.2. 指令ILP偵測單元 35
3.2.1. 指令雙層環狀緩衝 36
3.2.2. 指令型態辨識 40
3.2.3. 指令運算元運算子 46
3.2.4. 指令間資料相依性 57
3.2.5. 指令控制相依性 60
3.2.6. 指令相依並行程度計算 64
3.2.7. 指令ILP FIFO佇列 69
3.2.8. 找尋佇列極值 71
3.2.9. 群組指令ILPR暫存器 73
3.2.10. 指令ILP偵測單元軟體模擬 76
3.2.11. 指令ILP偵測單元硬體模擬 82
3.3. GMU系統指令單元 83
3.3.1. CRTG新增指令 89
3.3.2. AILPR新增指令 90
3.3.3. 系統程序排程 91
第四章 系統架構模擬與分析 93
4. 93
4.1. 系統架構軟體模擬 93
4.1.1. 軟體架構模擬流程 93
4.1.2. 測試程式 96
4.2. ILP偵測單元硬體模組Design Compiler合成 105
4.3. 結果分析討論 116
第五章 結論與未來方向 121
參考文獻 123



參考文獻 References
[1] Shu-Jung Chao, “Improving ILP with Semantic-Based Loop Unrolling Mechanism in Hyperscaler Architecture”,2017 ,Department of Electrical Engineering National Sun Yat-Sen University
[2] Zhitao Wan, “A Dynamic Core Grouping Approach to Improve Raw Architecture Many-core Processor Performance”,2011 ,Sixth International Symposium on Parallel Computing in Electrical Engineering, pp. 31-35
[3] Katarzyna Porada, “A Many-Core Parallelizing Processor”,2017 ,International Conference on High Performance Computing & Simulation (HPCS) , pp. 875-877
[4] Xiaohang Wang, Amit Kumar Singh, Bing Li, Yang Yang, Homg Li, Terrence Mak, “Bubble Budgeting: Throughput optimization for Dynamic Workloads by Exploting Dark Cores in Many Core Systems”,2018 ,IEEE Transactions on Computers, vol.67, no. 2, pp. 178-192
[5] S. Jadon and R. S. Yadav, “Multicore processor: Internal structure, architecture, issues, challenges, scheduling strategies and performance,” 2016 11th International Conference on Industrial and Information Systems (ICIIS), Dec. 2016 , pp. 381-386
[6] X. H. Sun and Y. Chen, "Reevaluating Amdahl's law in the multicore era", J. Parallel Distrib. Comput., vol. 70, no. 2, pp. 183-188, February 2010.
[7] Ren-Bo Hu, “Instruction Analyzer with Nested Loop Unrolling”,2020 ,Department of Electrical Engineering National Sun Yat-Sen University
[8] Mark Barnell, Courtney Raymond, Chris Capraro, Darrek Isereau, Chris Cicotta, Nathan Stokes, “High-Performance Compting(HPC) and Machine Learning Demonstration in Flight Using Agile Condor”,2018 ,IEEE High Performance extreme Computing Conference (HPEC), DOI:10.1109/HPEC.2018.8547797
[9] Y. F. Ma, Y. Cao, S. Vrudhula and J. S. Seo, “Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol: 26, issue. 7, pp. 1354-1367,July 2018.
[10] Po-Kai Chen, “ESL Model of the Hyper-scaler Processor on a chip”,2007 ,Department of Electrical Engineering National Sun Yat-Sen University
[11] Yu-Lian Chou, “Study of the Hyperscalar Multi-core Architecture”,2011 ,Department of Electrical Engineering National Sun Yat-Sen University
[12] J. C. Chiu, Y. J. Huang and Y. L. Ye, “Design of the Optimized Group management Unit by Detecting Thread Parallelism on the Hyperscalar Architecture,” National Computer Symposium, Dec. 2013.
[13] Y. X. Lu, J. C. Chiu and S. J. Chao, “Design of Instruction Analyzer with Semantic-Based Loop Unrolling Mechanism in the Hyperscalar Architecture,” New Trends in Computer Technologies and Applications, pp.3-19, 2019.
[14] Yin-Jou Huang, “Design of the Optimized Group Management Unit by Detecting Thread Parallelism on the Hyperscalar Architecture”,2013 ,Department of Electrical Engineering National Sun Yat-Sen University
[15] ARM Instruction Set. [Online]. Available: https://developer.arm.com/documentation/ddi0210/c/Introduction/Instruction-set-summary/Format-summary
[16] ARM Processoer [Online]. Available: https://developer.arm.com/ip-products/processors
[17] Raspberry Pi assembly code generate: https://www.thegeekstuff.com/2012/09/objdump-examples/
[18] R. S. Bajwa et al., Instruction buffering to reduce power in processors for signal processing, IEEE VLSI, 1997.
[19] E. Rotenberg, S. Bennett, and J.E. Smith, “Trace cache: a low latency approach to high bandwidth instruction fetching,” in MICRO-29.Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 24 –34, 1996
[20] T. Conte, K. Menezes, P. Mills, and B. Patel, “Optimization of instruction fetch mechanisms for high issue rates,” in 22nd Intl. Symp. on Computer Architecture, pp. 333-344, June 1995
[21] J. C. Chiu, R. M. Shiu, S. A. Chi and C. P. Chung, “Instruction cache prefetching directed by branch prediction,” IEE Proc. Computers & Digital Techniques, vol.146, no. 5, pp. 241-246, Sep. 1999.
[22] Kai-Ming Yang, “Improving the Fetching Performance of Instruction Stream Buffer for VLIW Architectures with Compressed Instructions”,2006 ,Department of Electrical Engineering National Sun Yat-Sen University
[23] David A. Patterson and John L. Hennessy, “Computer Organization & Design”,Dartmouth Publishers ,1998
[24] J.L. Hennessy and D.A. Patterson, “Computer Architecture A Quantitative Approach,” 2nd Edition, 1995
[25] N. Bellas, I. Hajj, C. Polychronopoulos, and G. Stamoulis, “Energy and Performance Improvements in Microprocessor Design using a Loop Cache,” in ICCD, 1999
[26] J.C. Huang and T. Leng, ” Generalized loop-unrolling: a method for program speedup,” in. Proceedings of 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology, 1999
[27] J.W. Davidson and S. Jinturkar, “Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation,” in Proceedings of the 28th Annual International Symposium on Microarchitecture, pp. 125 –132, 1995
[28] C. K. Cho, J. C. Shim and M. H. Lee, “A loop transformation for maximizing parallelism from single loops with nonuniform dependencies,” Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97, April 1997.
[29] S. Prema, R. Jehadeesan, B. K. Panigrahi and S. A. V. Satya Murty,“Dependency analysis and loop transformation characteristics of auto-parallelizers,”2015 National Conference on Parallel Computing Technologies (PARCOMPTECH), Feb. 2015.
[30] Yi-Lin Ye, “Design Instruction Analyzer in the Hper-scaler Architecture”,2015 ,Department of Electrical Engineering National Sun Yat-Sen University
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code