Extending Amber with Virtual Memory Jakob Dagsland Knutsen Master of Science in Computer Science Submission date: Januar 2015 Supervisor: Donn Morrison, IDI Norwegian University of Science and Technology Department of Computer and Information Science i Abstract Sincethebirthofmoderncomputerarchitecture,computerperformancehaswitnessed an exponential growth, driven mainly by increased transistor density. In the last decade, performance advancement has become increasingly difficult as computers are limited by power budgets due to problems with heat dissipation. As transistor technology continues to advance, full utilization of the available silicon is becoming ever challenging. As a result, new avenues of design exploration have emerged to track down alternative routes for continued performance enhancement. One such research field concerns heterogeneous computer architectures, where cores that excel at different areas of computation are combined on a single chip in order to provide performance scaling and the assignment of process workloads to optimally suited hardware. The SHMAC project applies heterogeneous research in a single ISA environment, wherevarioushardwareisdeployedinamatrixofinterchangeabletiles, withtheonly general purpose processor tile being an ARMv4T compliant Amber processor core. While programs and operating systems have been proven to successfully run on the processor core, their practical use is limited by the available memory space as the core does not feature a memory management system. ThisthesisintroducesamemorymanagementunitintotheAmberprocessor,enabling virtual memory support at the hardware level. The final contribution is Vilma; an ARMv4T compliant core featuring a memory management unit, verified using hardware simulation and an assembly test suite. ii Sammendrag Siden moderne dataarkitektur først ble til har ytelsen til datamaskiner overvært en eksponensiell vekst. Denne fremgangen er først og fremst drevet av en økt transis- tortetthet. Gjennom det siste tiåret har forbedring av ytelse jevnt blitt vanskeligere å realisere på grunn av begrensede strømbudsjetter forårsaket av problemer med varmeutslipp. Ettersom transistorteknologien fortsetter å avansere begynner det å bli enda mer utfordrende å realisere full utnyttelse av tilgjengelig silikon. Som en følge av dette trer det frem nye veier i utforskningen av systemdesign, med det målet å finne alternative løsninger for å jobbe videre med ytelsesforbedring. Ett av forsknigsområdene innen ytelse omhandler heterogen dataarkitektur hvor kjerner som utmerker seg på forskjellige områder innen databehandling er forbundet på en enkel chip. Dette skal sørge for ytelsesskalering og tildeling av prosessoppgaver til optimalt tilpasset hardware. SHMAC-prosjektet bruker heterogen forskning i et enkelt ISA-miljø hvor ulike typer hardware er plassert i en matrise som utskiftsbare brikker hvor den eneste prosessorbrikken er en ARMv4T ettergivende Amber prosessorkjerne. Selv om programmer og operativsystemer har vist seg å kjøre uten problemer på prosessorkjernen,såerdenpraktiskebrukenbegrensetavtilgjengeligplasspåminnet siden kjernen ikke har et fungerende minnehåndteringssystem. Denne avhandlingen introduserer en minnehåndteringsenhet inn i Amberprosessoren og muliggjør støtte for virtuelt minne på hardwarenivå. Det endelige produktet er Vilma; en ARMc4T ettergivende kjerne som inneholder en minnehåndteringsen- het. Enheten har blitt verifisert ved bruk av hardwaresimulering og et assembly testrammeverk. iii Problem Description Task: Improving the AMBER-based Processor Core for the Heteroge- neous SHMAC Multi-Core Prototype Current multi-core processors are con- strained by energy. Consequently, it is not possible to improve performance further without increasing energy efficiency. A promising option for making increasingly energy efficient CMPs is to include processors with different capabilities. This im- provement in energy efficiency can then be used to increase performance or lower energy consumption. Currently, it is unclear how system software should be developed for heterogeneous multi-core processors. A main challenge is that little heterogeneous hardware exists. It is possible to use simulators, but their performance overhead is a significant limitation. An alternative strategy that offers to achieve the best of both worlds is to leverage reconfigurable logic to instantiate various heterogeneous computer architectures. These architectures are fast enough to be useful for investigating systems software implementation strategies. At the same time, the reconfigurable logicofferstheflexibilitytoexplorealargepartoftheheterogeneousprocessordesign space. The Single-ISA Heterogeneous MAny-core Computer (SHMAC) project aims to develop an infrastructure for instantiating diverse heterogeneous architectures on FPGAs. A prototype has already been developed. The current processor model is a 5-stage in-order pipelined AMBER processor. To ease software development and increase heterogeneity, this implementation should be extended. The absence of virtual memory (VM) support is currently a major limi- tation to SHMAC software development. The main task in this assignment is the development and implementa- tion of virtual memory support for Amber. Specifically, the student shall design and implement a memory management unit (MMU) in hardware and ensure that the software component (e.g., operating system) man- ages the virtual address space and assignments to virtual memory from real memory. An important part of a virtual memory system is a mech- anism for paging such that programs can make use of secondary storage. This has until now been a bottleneck in SHMAC, which has limited main memory. It is expected that the student will conduct a review on VM implementations, including in heterogeneous settings. Preface This thesis is submitted to the Norwegian University of Science and Technology. Acknowledgements I would like to thank my supervisor Donn Morrison, and co-supervisor Antonio Garcia Guirado, for their guidance. I would also like to give thanks to Benjamin Bjørnseth for keeping up with my questions and always striving to give thorough answers. I would also like to give special thank Isa Agnete Halmøy Fredriksen for her support throughout this semester. I am truly grateful. J.D.K. v Contents List of Figures ix List of Tables x Listings xii 1 Introduction 1 1.1 The Dark Silicon Effect . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 SHMAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Address translation. . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Resource balanced . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.3 Portable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.4 Verified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Memory Management . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 Introducing Virtual Memory . . . . . . . . . . . . . . . . . . 8 2.1.4 Memory Management Unit . . . . . . . . . . . . . . . . . . . 9 2.1.5 Virtual Memory in Operating Systems . . . . . . . . . . . . . 10 2.2 Fast Address Translation. . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Translation Lookaside Buffer . . . . . . . . . . . . . . . . . . 11 2.3 ARM MMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 About the MMU Architecture and System Overview . . . . . 11 2.3.2 MMU Control . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.4 Page Faults and Access Control . . . . . . . . . . . . . . . . . 14 2.3.5 Walking the Page Table . . . . . . . . . . . . . . . . . . . . . 14 vii viii CONTENTS 2.3.6 TLB and Caches . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Virtual Memory in Multicore Systems . . . . . . . . . . . . . . . . . 17 2.5 Amber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.1 Wishbone Bus . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Vilma 21 3.1 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 MMU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.1 MMU Schematic . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 TLB Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Translation Table Walk Hardware . . . . . . . . . . . . . . . . . . . 28 3.6 Access Control Hardware . . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Coprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Evaluation & Discussion 31 4.1 FPGA Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.1 MMU table walk . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Bugs in the pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.1 Load hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.2 Store hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Conclusions and Further Work 37 5.1 Assignment Requirements . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.1 Address translation. . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.2 Resource balanced . . . . . . . . . . . . . . . . . . . . . . . . 38 5.1.3 Portable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.1.4 Verified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.1 Investigate coprocessor bugs . . . . . . . . . . . . . . . . . . . 39 5.2.2 Increase TLB associativity. . . . . . . . . . . . . . . . . . . . 39 5.2.3 Support entry invalidation . . . . . . . . . . . . . . . . . . . . 39 5.2.4 Support other data structures . . . . . . . . . . . . . . . . . . 39 5.2.5 Make Caches Virtually Indexed, Physically Tagged . . . . . . 39 Bibliography 41 A Synthesis Report 43 A.1 Vilma Xilinx Mapping Report File . . . . . . . . . . . . . . . . . . . 43 A.2 Amber Xilinx Mapping Report File . . . . . . . . . . . . . . . . . . . 45
Description: