Table Of Content

ANNÉE 2017 THÈSE / UNIVERSITÉ DE RENNES 1 sous le sceau de l’Université Bretagne Loire pour le grade de DOCTEUR DE L’UNIVERSITÉ DE RENNES 1 Mention : Informatique Mathmatiques et Sciences et Technologies de l’Information et de la Communication (MATHSTIC) présentée par Andrea Mondelli préparée à l’unité de recherche INRIA Institut National de Recherche en Informatique et Automatique Université de Rennes 1 Revisiting Thèse soutenue à Rennes le 12 Septembre 2017 Wide devant le jury composé de : Steven Derrien Superscalar Professeur à l’Université de Rennes 1/Président Bernard Goossens Microarchitecture Professeur à l’Université de Perpignan/rapporteur Smail Niar Professeuràl’UniversitédeValenciennes/rapporteur Karine Heydemann Maître de conference/examinateur André Seznec Directeur de recherche, INRIA Rennes/directeur de thèse Pierre Michaud Chargéderecherche,INRIARennes/co-directeurde thèse Dear Mary, do you [know who] will be on the boats? I’m still in Gaza, waiting for you. I will be at the boat to greet you. Stay human. Vik. Vittorio Arrigoni Remerciements Contents Résumé en Français 5 1 Introduction 9 1.1 Purpose of this work . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 State of the Art 13 2.1 From Pipeline to Superscalar. . . . . . . . . . . . . . . . . . . . . 13 2.2 Performance Technique of Superscalar Processors . . . . . . . . . 16 2.2.1 Instruction Level Parallelism . . . . . . . . . . . . . . . . . 16 2.2.2 The Branch Predictor . . . . . . . . . . . . . . . . . . . . 18 2.2.3 The Register Renaming . . . . . . . . . . . . . . . . . . . 18 2.2.4 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . 20 2.3 Wide-issue complexity . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Limits of performance scaling . . . . . . . . . . . . . . . . 22 2.3.2 Impact of critical components . . . . . . . . . . . . . . . . 24 2.3.3 Front-end bandwidth . . . . . . . . . . . . . . . . . . . . . 25 2.3.4 Level-one data cache . . . . . . . . . . . . . . . . . . . . . 25 2.3.5 Bypass network paths . . . . . . . . . . . . . . . . . . . . 26 2.3.6 Load/Store queues . . . . . . . . . . . . . . . . . . . . . . 27 2.3.7 Reduce the complexity by limiting pipeline activity . . . . 29 2.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 Clustered microarchitectures . . . . . . . . . . . . . . . . . 30 2.4.2 Issue Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.3 Steering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.4 Write Specialization . . . . . . . . . . . . . . . . . . . . . 35 2.4.5 Bypass network and intercluster delay . . . . . . . . . . . 37 2.4.6 Clustering in Commercial Superscalar Processors . . . . . 37 2.4.7 Clustered VLIW/DSP architectures . . . . . . . . . . . . . 39 2.5 Energy saving exploiting loops . . . . . . . . . . . . . . . . . . . . 40 1 2 CONTENTS 2.5.1 Saving loop energy in the front-end . . . . . . . . . . . . . 40 2.5.2 Using a dedicated cache for loops . . . . . . . . . . . . . . 41 2.5.3 Saving loop energy in the back-end . . . . . . . . . . . . . 44 2.5.4 Loop accelerators . . . . . . . . . . . . . . . . . . . . . . . 45 2.5.5 Industrial adoption of loop cache solutions . . . . . . . . . 46 3 Wide Issue Clustered Microarchitecture 47 3.1 A case of increasing single-thread IPC . . . . . . . . . . . . . . . 47 3.2 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.3 Baseline Microarchitecture . . . . . . . . . . . . . . . . . . 51 3.3 Potential IPC gains from a more complex superscalar microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Dual-Clustered Configurations . . . . . . . . . . . . . . . . . . . . 55 3.4.1 Dual-Cluster with baseline instruction window size . . . . 57 3.4.2 Dual-cluster with double instruction window . . . . . . . . 57 3.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.4 Possible steps toward the proposed dual-cluster configuration 62 3.5 Energy Considerations . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.1 Static EPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.2 Gating intercluster communications for reduced dynamic EPI 64 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4 Exploiting loops for reducing energy in a superscalar out-of-order core 67 4.1 Baseline and Experimental Setup . . . . . . . . . . . . . . . . . . 68 4.2 Loop Buffer and Loop Detector . . . . . . . . . . . . . . . . . . . 69 4.2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.2 Loop buffer size . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.3 Tuning MinIter . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Redundant Micro-Op Removal . . . . . . . . . . . . . . . . . . . . 76 4.3.1 Proposed mechanism . . . . . . . . . . . . . . . . . . . . . 76 4.3.2 Identification of redundant micro-ops . . . . . . . . . . . . 77 4.3.3 Modification of register renaming . . . . . . . . . . . . . . 78 4.3.4 Loads and stores . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . 80 4.3.6 Compiler Optimization impact: a case study . . . . . . . . 82 4.4 Reducing the energy of loads . . . . . . . . . . . . . . . . . . . . . 86 4.4.1 Speculative load execution . . . . . . . . . . . . . . . . . . 86 4.4.2 DL1/STQ gating? . . . . . . . . . . . . . . . . . . . . . . . 88 CONTENTS 3 4.4.3 Store Queue and DL1 gating in Loop Mode . . . . . . . . 90 4.4.4 STQ gating alone . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.5 DL1 gating alone . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.6 STQ gating and DL1 gating combined . . . . . . . . . . . 92 4.4.7 Gating the memory dependence predictor table . . . . . . 94 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5 Conclusion 97 5.1 The superscalar architecture of the future . . . . . . . . . . . . . 97 5.2 Exploiting loops for power consumption . . . . . . . . . . . . . . . 99 5.3 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Bibliography 103 Author’s Publications 118 List of Acronyms 123 List of Figures 125

Description:

à cause des courants de fuite et de la température, la finesse de gravure des pro- cesseurs a atteint sa .. When this job is entirely done by the compiler, we call it Static Instruction-level Parallelism. Overall, the clustered paradigm was also used in DSP compilers [Des98] and embedded domain

Andrea Mondelli PDF

135 Pages·2017·2.57 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Andrea Mondelli

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.