ebook img

Taxim: A Toolchain for Automated and Configurable Simulation for Embedded Multiprocessor Design PDF

0.93 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Taxim: A Toolchain for Automated and Configurable Simulation for Embedded Multiprocessor Design

Taxim: A Toolchain for Automated and Configurable Simulation for Embedded Multiprocessor Design Gorker Alp Malazgirt Deniz Candas Arda Yurdakul DepartmentofComputer IstanbulerGymnasium DepartmentofComputer Engineering 34112Fatih,Istanbul,Turkey Engineering BogaziciUniversity [email protected] BogaziciUniversity 34342Bebek,Istanbul,Turkey 34342Bebek,Istanbul,Turkey [email protected] [email protected] 6 ABSTRACT 1 0 Multicoreembeddedsystemshavebeenconstantlyresearched 2 toimprovetheefficiencybychangingcertainmetrics,suchas processor,memory,cachehierarchiesandtheircacheconfig- n urations. Using Multi2Sim and McPAT simulators in com- a J bination allows the user to design various multiprocessing architectures and estimate performance, power, area and 3 timing metrics. However, the design time required to sim- 1 ulate these systems is daunting and prone to human error. In this paper, we introduce Taxim, a toolchain that can ] C automaticallycreaterequestedmulticoreon-chiptopologies D along with minimizing the simulation time due to repeti- tive tasks between architectural power, energy and timing . s simulations. Taxim’sdecision-tree-basedtopologysynthesis c tool creates processor configuration files that can be highly [ erroneouswhengeneratedmanually. Thetoolchainalsoau- 1 tomates the steps from design entry to output report ex- v traction by running automation scripts, and listing the re- 1 sults. Our experiments show that multiprocessing archi- 4 tectures with 32 cores and irregular cache hierarchies are Figure 1: Comparison of manual and automatic ex- 3 more than 1k lines of code in Multi2Sim’s processor con- periment generation with respect to time, creating 3 figuration format and Taxim can create such a file in less 4-core multiprocessor topologies 0 than 10 milliseconds. The source code is freely available at . https://github.com/bouncaslab/TaXim/. 1 0 ergy efficiency, and production costs that are dominated by 6 CCSConcepts thedesigntimeandthechiparea. Thesimulationtoolsare 1 •Computersystemsorganization→Multicorearchi- essential for designing complex systems because these tools v: tectures; Embedded hardware; Superscalar architectures; help the designer in searching the design space and with- i out experimenting on physical implementations, which are X Keywords time consuming and costly. In the embedded systems de- r sign domain, cache [16], memory [15], architectural [18],[3] a Multiprocessors, embedded systems, simulation, decision- and energy [12] simulators exist so as to evaluate the en- trees ergy or performance of processor architectures in the large designexplorationspace. Thus,simulatorsarecoupledwith 1. INTRODUCTION design space exploration (DSE) tools [22],[7] where optimal designs are sought using algorithmic approaches based on Thecurrenttechnologicalchallengefacedwithembedded designer’s requirements. In general, this procedure consists systems is balancing for varying domains performance, en- ofalargenumberofiterationsbetweenDSE’soptimizerand the simulators, and continues until sufficiently good results areachieved. Hence,ingeneral,ateachiterationthedesign under simulation undergoes hardware or software modifica- tions and this process requires automation. Multiprocessor design exploration requires generation of various multiprocessor topologies. These topologies do not HIP3ES’16January18–20,2016,Prague,CzechRepublic only differ in cache organizations and number of cores, but they also vary in the number of caches, cache levels, and ACMISBN000-0-0000-0000-0. DOI:000.0000/0000 data/instructioncachesharingmetrics. Specifically,formul- tiprocessing with many cores and caches, manual cache- processor-memory connectivity creation is time consuming and prone to human error. In this work we present Taxim, a toolchain that auto- matesarchitecturegenerationandthesimulationofembed- ded multiprocessing systems using Multi2Sim [18] and Mc- Pat [12] simulators that have been used extensively in re- searchandDSEtools. Multi2Simoffersthedesignerafully customizable environment that simulates both the devices and their interactions. McPat on the other hand allows a detailed exploration of crucial efficiency metrics. Automated architecture generation is fast and not prone tofactorssuchasfatigue. InFigure1,wepresentthenum- berofexperimentcreationwithrespecttotime. Experiment creation task includes the generation of 4-core multiproces- sortopologyliststhathavelessthan200linesofMulti2Sims [22]architecturerepresentationfiles,automationscriptsand simulation output preparation. Based on our observations on a group of three students, we have measured that the rateofdesigngenerationdecreasesovertimeduetofatigue. Thus, However, with an automated tool, after an initial setuptimetospecifydesignmetricsandprepareatopology list, experiment preparation continues at a constant pace. In this paper, Taxim’s contributions can be summarized Figure 2: Decision tree that represents different as: pathways taken by Taxim to create a given topol- 1. In Taxim, we introduce a decision tree based method ogy for that creates processor multi-core processing archi- tectures by using decision trees. Thus, Taxim speeds up design time considerably when compared with the First,wepresentanautomatedprocessorarchitecturegen- manual generation of the architectures. erationmethodbasedonadecisiontree. Itallowsustogen- erate very complicated multi core architectures with multi- 2. Itprovidesanautomatedtoolchainthatallowsdesign- ple levels of cache hierarchy with ring bus network support ers to run multiple simulations using Multi2Sim [18] whennecessary. Automaticarchitecturegenerationtoolalso andMcPat[12]. Inaddition,Taximisbuilttoprepare validatescardinalityandconnectivityofgeneratedarchitec- structured output reports of numerous results such as tures. Thisallowsautomaticgenerationmoreadvantageous processor datapath specifics, cache line utilization to over manual generation specifically for complex designs. leakage power consumption. Design space exploration Second, we present an automatic toolchain, which auto- (DSE) tools [22, 7] can utilize these reports. mates the step from design entry to output, report extrac- tion for architectural and power, area and timing simula- 3. While setting up the environment for multiple sim- tions. Inthiswork,unlikeexistingresearchworkswepresent ulations, the designer might enter some parameters thesestepsdecoupledfromothertopicssuchasdesignspace wrong. Hence,thesimulationsmighthavetogetaborted. exploration [21, 19, 10] or simulation methods [19]. In this Yet,theexistenceoferroneoussimulationset-upsmight way,ourtool-kitcanbeusedasaplug-inbythedesignspace truly slow down the completion time of multiple sim- exploration tools. Since the environment enables multiple ulations. In order to prevent erroneous simulations concurrent simulations, the simulation time can be consid- due to human factor, we introduce intelligent control erablyreducedinmulti-coremachinesandsupercomputers. in Taxim. In this way, multiple simulations can be Obviouslythiswillreducethedesignexplorationtimewhen carried out smoothly in the shortest possible time. Taxim is coupled with DSE tools. We present Taxim in the following way. We first present Inthedesignspaceexplorationliterature,therehavebeen relatedworksinthenextsection. Section3detailsourarchi- plentyofstudiesthatrequirelargenumberofsimulationsfor tecture generation and validation method. In Section 4 we findingparetosetorthesingle”best”designpointfromsin- detail Taxim’s process flow. Section 5 presents the experi- gleormultipledesignparameters[6,17,19,10,1,11,21,5, mentalresultsthatcompareTaximandmanualapproaches. 8]. Forthisreason,thisworkexhibitsanautomaticarchitec- We conclude the paper in Section 6 with our conclusions. ture topology generation method that can be used in DSE systems that employ Multi2Sim [22] and McPat [4]. Along- side automated architecture generation, automated design 2. RELATEDWORK entry to extraction of output reports allows DSE tools to Simulatorshavebeenanimportantpartofdesigningpro- processdesignpointsinarapidandcontinuousmanner. The cessing systems [3, 20, 6, 14, 17, 8]. The first step before authors in [1] explore energy consumptions of various two runningasimulationisdesignentry,thatallowdesignersto level cache configurations using McPat simulations. There customize the hardware and the benchmark that will exe- has not been any automation regarding the architectural cute. Our work differs from available works as follows: simulationsduetothefactthattheunderlyingprocessorar- chitectureisfixed. Similarly,theworkin[11]explorescache hierarchy by estimations and simulations. However, cache topologyhasbeenfixed,thusautomatedcachetopologygen- Algorithm 1: Architecture generation from string rep- eration has been omitted. However, our work first presents resentation automaticarchitecturegeneration,thenexplainstherestof Input: An architecture’s string representation the tasks that automate whole simulation process from de- Output: Multi2Sim processor configuration file signentrytooutputextraction,usingMulti2SimandMcPat Step 1: Parse given architectural string and tokenize environments. cores, instruction caches and data caches; Authors in [6] have presented a design space exploration Step 2: Determine number of cores, then determine of computer architectures that searches in a large design the number of first, second and third level instruction space. The automated tool customizes cache architectures and data caches, group them with respect to levels; and processor internals, however there has not been any Step 3: Traverse the decision tree and select architectural exploration such as connectivity of different architecture type from available types: (Regular, cachelevels,cachesharingandprocessorsinMulti2Simand Semi-Hybrid, Hybrid, 2nd Level Bypass and 3rd Level McPat. In addition, the generation scripts have only been Bypass); limited to single-core architectures. The work in [8] takes Step 4: For each core in the system: a range of high and low-level parameters to improve accu- Connect each core to an empty first level data cache; racy in the design of a multiprocessor system on a chip. If there does not exist an empty first level data cache; Authors automate generation of processor specifics such as Connect each core to the least connected first level data theinstructionset,numberofpipelinesandpipelinestages. cache; However, cache topology is fixed whereas our architecture Connect each core to an empty first level instruction generationtoolalsoautomatesthecachetopologycreation. cache; Cache hierarchy exploration engine presented by Yessin If there does not exist an empty first level data cache; etal[21]firstdeterminestheboundariesofthedesignspace Connect each core to the least connected first level by determining the ranges of the design variables and pre- instruction cache; pares the memory traces through simulation. Next, an op- Step 5: For each level in the cache hierarchy: timization algorithm iteratively determines the most suited Connecteachdata/instructioncachetoanemptyupper cache hierarchy from the given benchmark set. By using level data cache an automated toolchain similar to Taxim, the optimization If there is no empty cache; unit of the system can request additional simulations, thus Connect each core to the least connected upper level the feedback loop can be expanded with more simulation data cache; results. Similarly, the work in [19] provides an mpsoc sim- If there is no upper level cache; ulation environment where designers can provide automa- Connect each cache to the memory; tion scripts to generate simulations with various design pa- Step 6: If BP exists in the string representation: rameters via the parameter file. The presented environ- For each last level memory component connect a ment allows embedding different memory units with differ- network switch: ent Network-On-Chip (NOC) architecture support. In this For each incoming cache connection to each: last level way, the simulation tool can be connected to various de- memory signautomationtools. Thedesignspacecanbeenlargedby For each cache level in the system do separately for combiningsimulationandestiomationof architectureswith each data and instruction cache: tolerable error margins. The authors in [10] also provides Group data/instruction cache connections in two and amethodforperformanceestimationofpipelinedmultipro- connect to a switch; cessor system-on-chip architectures. Analytical models are Connect each switch pair with a new switch; combined with simulation data and exploration is handled If total number of caches is odd, connect the remaining on the aggregated design space. The design of the experi- cache to one of the pairing cache switch; ments requires numerous experiments therefore both simu- Connect each switch pairs to the memory; lationandestimationrequiresarchitecturalpreparationand application settings. Thus, when the number of architec- turesandbenchmarksincreases,atoolchainlikeTaximplays animportantroleforautomatingarchitecturalpreparation, simulation/estimation and results extraction tasks. Symbol Definition C Core Count 3. DETAILSOFARCHITECTUREGENER- Shared Data/Instruction ATION L<x> Level 1, 2 or 3 Cache In this section we detail the architecture generation and IL<x> Instruction Level 1 or 2 Cache validationmethods. OurnomenclatureispresentedinTable DL<x> Data Level 1 or 2 Cache 1. We define the number of cores with C. When data and BP Exists a bypass connection instruction caches are shared, we represent these caches as L. When instruction and data caches are separate, they are Table 1: Nomenclature of the Topology Represen- specifiedasILandDLrespectively. Ifthereexistsabypass tation connection [13], it is shown with BP. We can show usage of the nomenclature in two examples: • 2C 4L1 1L2 explains that there exists two cores, four of decision tree is to extract the maximum available infor- L1 shared caches and one L2 shared cache mation from the string representation and separate domain information parsing from architecture generation. In Algo- • 2C 2DL1 2IL1 1DL2 BP shows us that there exists rithm 1, we show the architecture generation procedure of two cores with two data L1 cache, two instruction L1 Taxim. caches, one data L2 cache and there exists a bypass Algorithm1startsbyparsingtheinputthatisthetopol- connection from instruction caches to the memory ogy string representation. After string representation is parsed, Taxim tokenizes the cores, instruction/data caches Thisnomenclatureisfollowedtocreatetopologyrepresen- and their levels. Based on this information, the decision tations of designated architectures. The designer can enter tree is traversed and type information of the topology is in string format to receive configurations recognizable by extracted that is shown in Line 3. Then, the algorithm the simulator. The designer can also opt to work with pro- proceeds by connecting each core to first level data caches. vided topology sets that are available in the library of the Basedonavailabledatacachesandinstructioncaches,each toolchain. The topology creation tool relies on strictly de- core is either connected to an available cache, or to a cache fined”TopologyCreationRules”asoutlinedin[13]. Theuser that has the least amount of connection. After all cores are may specify the desired topologies in a text file by entering connected, first level cache connections occur. Each cache their names according to the designated nomenclature. levelconnectstoanupperlevelcachethatisfreeorhasthe Architecture Generation: Duringarchitecturegenera- leastamountofconnection. Atthehighestcachelevel,each tion, Taxim first tries to extract the type of topology by cache is connected to the main memory. These steps are parsing the names of the architectures. The parser also representedbetweenLine4and5. Basedonthesesteps,the checks compliance of the architecture name to the ”Topol- connectivityofthetopologycanbedefinedasthefollowing: ogy Generation Rules”[13]. Common design rule errors are Letncbethenumberofcomponentsatleveli,mcbethe missingunderscores,wrongorderofcaches,missingorextra the number of components at level y such that i < y, and letters that do not conform to generation rules. After pars- from Topology Generation Rules [13], nc>=mc ing the nomenclature without any errors, a simple decision Components at level i must be connected to components tree is employed as shown in Figure 2. The decision tree inlevelyandaftereachcomponentatleveliisconnectedto (DT)guidesthegenerationtoolforthetypeofroutingthat anavailablecomponentatlevely,thenumberofconnections should be applied to construct the given topology. c created by Algorithm 1 becomes: The decision tree consists of four rules to call five differ- iy ent methods. These methods are: Regular Fat Tree, Semi- c =(cid:98)nc/mc(cid:99) (1) iy Hybrid, Hybrid, 2nd Level Bypass and 3rd Level Bypass. If the number of components is odd, the remaining connec- Thegenerationrulescanbesummarizedasfollowsandthey tionsareconnectedinjectively,andthenumberofremaining are derived from ”Topology Creation Rules”[13]: connections cr become: iy • Rule1: Taximchecksifthereexistsseparatedataand cr =nc%mc (2) instructioncaches. Ifnot,thegeneratoremploysReg- iy ularmethodwhereVonNeumannorHarvardarchitec- Thus,thetotalnumberofconnectionsbetweenleveliandy tures can be employed ct becomes: iy • Rule 2: Taxim checks if there exists second level data ctiy =ciy∗mc+criy (3) unit. Absence of second level data caches employs Aftercore/cacheconnectionsareestablished,Taximchecks Semi-Hybrid method. In this method, the generated if there are any bypass representation in the given topol- architectures are hybrid architectures because there ogy representation in Line 6. If there exists any bypassing, existssharedinstructioncachesatthefirstlevel. Nev- Taximstartstoaddnetworkswitchesforformingthebypass ertheless, above first level, all data and instruction structure. The first step is to connect a network switch to caches are shared without any bypassing thelastlevelmemorysuchasthemainmemoryorlastlevel caches. Then, it starts grouping data caches into pairs. If • Rule3: Taximchecksifgiventopologyrepresentation thenumberofdatacachesisodd,thelastremainingcacheis has BP tag that stands for by-passing a cache. Ab- connectedtoapairmakingwhichmakesitatriple. Then,a senceofBPtagmeansthatthegivenarchitectureisa newswitchisaddedtoaneachpairofcaches. Similarly,each hybridarchitectureanddatacachesareseparatefrom switch pair is connected to a new switch until all switches instruction caches at all levels arecovered. Iftotalnumberofcachepairswitchesisodd,it • Rule4: Taximchecksifthereisasecondlevelinstruc- connectstoanexistingcacheswitchpair,makingitatriple. tion cache. The absence of second level data cache Afterdatacacheswitchesarecreated,samestepsareapplied employs the 2nd level bypass method that generates for instruction caches as well. When all data/instruction architectureswithfirstlevelinstructioncachesbypass- cache switches are generated, these switches are connected ing second level cache to main memory. Finally, the to the main memory switch. 3rd level bypass method is employed that means sec- Figure4visualizesthenetworkcreationprocessexplained ond level instruction caches are bypassed third level in Algorithm 1. After Taxim connects DL2 and IL1 caches caches. to main memory in Figure 3A, first a switch is connected to the main memory. Then, DL2 caches are grouped and Themethodtypes,briefexplanationsandatwocoresam- they are connected to a switch. Next, IL1 caches are also pletopologiesareshowninFigure3. Thesemethodsarede- grouped and connected with a switch. The main memory rived from ”Topology Creation Rules”[13]. The advantage switch is then connected with the cache grouping switches. Figure 3: Comparison and explanation of topology methods and their examples Figure 4: Logical representation (left) and imple- mentation (right) of the network of 5C 5DL1 2IL1 - 2DL2 BP Figure 6: Topology of 5C 5DL1 2IL1 2DL2 BP Ar- chitecture given topology string representation is tokenized and cores, data/instruction caches in every cache level are extracted, cardinality of these tokens are compared with the cardinal- ity of the generated cores and caches. For determining the connectivity of the caches and cores in general, the valida- Figure 5: The switch network connections for a by- tion method traverses each component in the generated ar- pass structure, there exists no more than three con- chitecture topology and computes the number connections nections from lower level to upper level in the hier- described in Equations (1), (2) and (3). archy Asanexample,inFigure6,thereexists5Coresand5DL1 caches, thus each core is connected to a DL1 cache. How- ever, there are 2 IL1 caches, thus after each cache accepts InFigure5,networkswitchconnectionsofamorecompli- (cid:98)5/2=2(cid:99) connections, the remaining 5%2 = 1 connection cated topology, 24C 24DL1 12IL1 6DL2 2L3 BP (a second must be connected to the first available cache in the cache levelbypass)isshown. First,anL3cacheisconnectedwith list,increasingthenumberofconnectionsto3foroneofthe a switch. Pairs of IL1 caches are connected with a switch IL1 caches. Similarly, at the second cache level, the con- and three IL1 switches are connected to a switch and then nectivity between 5 DL1 and 2 DL2 are validated the same connected to the the L3 switch. Three DL2 switches are way. connected with a switch and then this switch is connected to the L3 switch. These connections are repeated for the 4. TAXIMTOOLFLOW second L3 cache, starting with creating a switch for the L3 cache. In this section we present the process flow of Taxim and As a design example, let us assume that the designer detailtheimportantparts. Figure7showstheprocessflow. requests the creation of 5C 5DL1 2IL1 2DL2 BP which is The first step is defined as design entry. In this step the showninFigure6. Accordingtothedecisiontree,firstrule designer determines design and simulation specific proper- istocheckwhetherthistopologyhasafirstleveldatacache ties. In the second step, Taxim’s stencil scripts read design (DL1) component. If it is available, the requested topology entryfilesandcreatethesimulationenvironmentandarchi- can be a regular type. Next, the system checks whether a tectural topologies that are used by Multi2Sim and McPat second level data cache (DL2) component also exists and if [12]. Third step is composed of simulation with intelligent not,itwouldbeasemi-hybridmemoryconfiguration. Inthe controlthatallowsearlyterminationofpoorperformingde- thirdrule,itischeckedwhetherthesuffixBPhasbeenadded signwithrespecttoauserdetermineddesignparametersuch to the nomenclature. If true, it is checked whether there is as latency. The last step includes extraction of simulation asecondlevelinstructionunit. Sinceinourcasethereisno results. Results are also stored for later usage for analysis second level instruction unit, the system identifies that we purposes. have a type ”Level 2 Bypass”. The bypassing occurs from The outlined steps are intended to guide the user from second level instruction caches to the main memory. designerspecificationstoaggregatedresultsofnumerouspa- The topology generation algorithm connects each core in rameters that are selected by the designer. As the diagram the architecture to a single DL1. Then, each core is con- displays,thedesignerrequirementsregulatemostofthepro- nected to a free IL1, however there are only two IL1 in the cess flow. The input files are tailored according to chosen configuration. Hence, Core 1, 3 and 5 are connected to one topologiesandbenchmarks,whichallowMulti2SimandMc- IL1 and Core 2 and 4 are connected to the other IL1 as Pat[12]toexaminetheireffectiveness. Theacquireddatais designatedbythealgorithm. Afterthecoresareconnected, filtered through selected parameters to extract the data for first level data caches are connected to second level caches. the further evaluation of the designer. TherearetwoDL2inthetopology,thereforethreeDL1are 4.1 DesignEntry connected the first DL2 and the other two are connected to the second DL1. Then, there are not any third level Inthisstep,designerentersallthefilesthatTaximneeds caches in the topology, therefore all DL2 are connected to forrunningsimulations. AsshowninFigure8,thesearethe the memory. Since there is no IL2 in the topology, all IL1 benchmarks, their input data sets, processor architectures areconnectedtothememory,bypassingsecondlevelcaches. and design parameters that user aims to collect. Bench- Checkingcomponentcardinalityandconnectivityofthegen- marks should be entered in executable form with their re- eratedarchitecturevalidatesthegeneratedtopologies. After quired input data sets. Unlike some other platforms in the Figure 8: An example environment creation script used by Taxim to allow experimentation in succes- Figure 7: Process flow of Taxim representing all sion steps from user input until results extraction Component Number of Lines designer to conduct experiments in parallel by calling mul- tiple instances in the terminal or adding related scripts in Core 6 the same file. L1Cache 5-6 L2andL3Cache 6-7 4.3 SimulationandIntelligentControl MainMemory 5-6 In this step, Taxim starts Multi2Sim [18] architectural CacheGeometryDefinitions 5-15 simulation. Following the architectural simulation, power, Cache-Core&Cache-CacheConnection 4 area and timing simulations require McPAT [12] simulator. SingleRingBusNetworkElement 14 InordertocustomizeMcPat[12],Multi2Simsimulationre- ports are extracted. McPat [12] requires simulation output Table 2: Lines of code required per component that represents all the components that are present in the architecture. These are datapath details of each core such as the number of instructions in the pipeline stages, cache literature [3],[11], Taxim accepts plain text format for ar- and memory communication details such as cache miss/hit chitecturegenerationanddesignparameterselection. Thus, ratio, and interconnection network details such as sent and these files can also be edited manually. received packets etc. The output of McPat [12] simulation 4.2 SimulationPreparation are located in designer designated locations and the results are reported together with performance outputs. The de- Inordertosupportabroadrangeofsystems,Taximpro- signer does not need to do any additional work in order to videsautomationbashscriptsthatcanbeexecutedonoper- extractMcPat[12]output. Theoutputdataextractionand atingsystemsthatsupportbash. Taximprovidesinputand reporting is explained in Section 4.5. outputlocationswhichMulti2Sim[18]andMcPat[12]sim- Taxim introduces two simple methods for early termina- ulators read inputs and write output values. Automation tion of simulation: scripts are initially in stencils that are found in the envi- ronment. They are processed after the designer manually • While setting up the environment for multiple sim- completes them. As an example, in Figure 8, we present ulations, the designer might enter some parameters a stencil which relates the scripts to the Design Entry files wrong. Hence, the simulations have to be aborted. that are shown as Benchmarks, Topologies and Design Pa- Taxim prevents designers to simulate erroneous tests. rameters. In the Design Entry step, the designer enters the When a very large number of experiments are con- locations of the mentioned design entry files, thus Taxim figured with many benchmarks, input data sets and populates the stencil scripts. This design also allows the topologies, the designer can opt for first testing the benchmarks,theirinputandthetopologiesbyrunning is understandable that the complexity between topologies a functional simulation that is a lot faster than cycle is not entirely dependent on the amount of cores, but the accurate simulation. If there exists no errors during amount of connections that have to be defined for further functional simulation, then the cycle accurate simula- modules. Hence, 18C 6L1 3L2 is very easy to code, but tion starts. 6C 4DL1 3IL1 2L2 is a bigger challenge due to varying in- put and output streams between caches. The same is true • The designer can provide a design parameter and a forcomplicatednetworkstoo,as33C 23DL1 17IL1 12DL2 - satisfactioncondition. Iftheconditionisnotmetafter 4L3 BP has a higher line count than 37C 28DL1 19IL1 - a predetermined simulation time which is chosen by 13DL2 8IL2 5L3 BP. This is mainly due to their network thedesigner,Taximdropsthesimulationandproceeds files, since 33C 23DL1 17IL1 12DL2 4L3 BP 17 IL1 caches with the next topology in the simulation queue. For bypasssecondandthirdlevelandconnecttomemory,whereas example, user can put a condition on Latency metric. in37C 28DL1 19IL1 13DL2 8IL2 5L3 BP8IL2bypassthird IfthedesiredLatencyisnotmetafteracertainamount levelcacheandconnecttothememory. AccordingtoTable of simulation time, Taxim stops simulating the design III,networkswitchesthatareusedinbypasstopologiescon- under test. sume the most lines of code, therefore 33C 23DL1 17IL1 - 12DL2 4L3 BPconsumesmorelinesthan37C 28DL1 19IL1 - 4.4 ReportingandStorage 13DL2 8IL2 5L3 BP. Taximextractsnecessaryoutputparametersfromthesim- Table3alsoillustratesthat,managingevensimpletopolo- ulation output files. Taxim recognizes all design parame- gies requires diligence for many connections, and creating tersthatMulti2Sim[18]andMcPat[12]provide. Currently, topologies with higher complexity manually is prone to er- Taxim supports Comma Separated Values (csv) and plain rors. Hence, by automating the creation and validation, we textfileoutputstostorethesedesignparameters. Theout- can eliminate mistakes, optimize the topology creation pro- put locations and designated design parameters are config- cess and interconnections of topologies themselves. ured by the designer in Step 1 that is shown in Figure 7. Intelligent Simulation Control: The intelligent sim- Thegeneratedresultscanbeusedforexploration[13],aug- ulation control capabilities of Taxim that are explained in mented to DSE tools [22] or stored in databases. Section4.3,havesavedconsiderableamountsimulationtime in our experiments. As an example, Vips [2] benchmark on 5. EXPERIMENTS 4-corebypassandhybridarchitecturestakes15hoursonav- erage to complete in our simulation test bed. Using early In this section, we present how our automated tool im- terminationmethod,wehaveoptedtosimulateVipsbench- proves the overall simulation process that can be used as a marksontopologieswhichyieldinstructionspercycle(IPC) standalone or part of a DSE tool. We have used Taxim in value greater or equal to 1.0. After thirty minutes of initial exploring various symmetric multiprocessing systems [13]. simulation,topologiesthatdonotmeettheIPCcriteriahave Duringourexperimentsin[13],500designpointshavebeen beeneliminated. Thus,10topologiesoutof64hadIPCless consideredforsimulation. Thesesimulationsinvolvedaround than1.0. Remaining54topologieshavebeensimulatedwith 100differenttopologiesand5differenttestcasesfromPAR- cycleaccuratesimulation. Thiseliminationhasreducedthe SEC [2] and MiBench [9] benchmarks. overall simulation time by 12.2 5.1 TopologyGenerationMetrics AnotheraspectofTaxim’sintelligentcontrolistoprevent wastingsimulationtimeduetoerroneousdesignentrybythe The lines of simulation codes generated by the topology designer. Thisisovercomebyfirstrunningfunctionalsimu- generationalgorithmisproportionaltothethetopology. As lationoneachtopologybeforethecycle-accuratesimulation anexample, acomplex topologysuchas 20C 10DL1 4IL1 - [18]. The functional simulation emulates underlying archi- 5DL2 2L3 BP requires much more lines of code (LOC) for tectureandtakessignificantlylessamountoftimecompared connectivity than a simple 2C 2DL1 2IL1 2DL2 2IL2 1L3 to detailed simulation time. In our testbed, x264 detailed architecture. Taxim applies Algorithm 1 in order to build simulation on a 4-core architecture takes 12 hours on aver- Multi2Sim[18]topologyconfigurationfileofthe20-corear- age to complete whereas the functional simulation takes 6 chitecture. Thus,allthecomponentsandconnectivitystruc- minutes. Assuming that 60 simulations will be carried out turesaredefined. InTable2,wepresentthenumberoflines sequentially, and 10simulations are faulty due toerroneous thatisrequiredtodescribecomponentsandconnectionsbe- designentrybythedesigner. Ifpre-checkfunctionalsimula- tween them. Hence, increasing number of components and tionhadnotbeenopted,120hoursofsimulationtimewould bypasstopologiestakemuchmorelinesofcodetorepresent. be wasted. However, by opting pre-check, the designer can According to Table 3, the sample 20-Core design takes 443 fix the faults after functional simulations. Thus, in this ex- LOC. However, for 2-core architecture, it is only 106 LOC. perimental scenario, intelligent pre-check simulation saved Thus, the automation Taxim provides will be more visible 114 hours overall simulation time. and helpful for future non-regular complex topologies. To further illustrate this point, we have created a basic 6. CONCLUSION andverycomplicatedmodelofeachtopologymethodusing Taxim. Creating a topology list took less than five minutes In this paper, we present Taxim, a toolchain that auto- and all the topologies were created within 70 milliseconds. mates architectural, power, area and timing simulation us- Manual construction would take five to thirty minutes for ing Multi2Sim [18] and McPat [12] tools. Taxim provides each basic topology and an indefinite time for very compli- significantspeedupforcreatingprocessorarchitecturesand cated topologies. output generation compared to manual methods. Decision Comparing the LOC required by 13C 9L1 5L2 3L3 and treebasedautomatictopologygenerationallowsgenerating 18C 9DL1 6IL1 3L2, which is only a 5 line difference, it complexmulticorearchitecturesfromsimpletextrepresen- Component Basic Topology LOC Generation Time (ms) Regular 2C2L12L21L3 76 9.09 Semi-Hybrid 4C4DL12IL11L2 83 8.9 Hybrid 3C3DL13IL13DL21IL21L3 122 8.74 2ndLevelBypass 2C2DL12IL11DL21L3BP 130 9.24 3rdLevelBypass 4C4DL12IL12DL22IL21L3BP 182 9.28 Component Complex Topology LOC Generation Time (ms) Regular 13C9L15L23L3 227 8.64 Semi-Hybrid 18C9DL16IL13L2 232 8.65 Hybrid 17C11DL18IL15DL23IL22L3 321 9.09 2ndLevelBypass 32C23DL117IL112DL24L3BP 1105 9.39 3rd Level Bypass 37C 28DL1 19IL1 13DL2 8IL2 5L3 BP 979 9.28 Table 3: Examples of for each topology, followed by the amount of lines and generation times tations. Intelligent control mechanism embedded in Taxim [6] R. Chis, M. Vintan, and L. Vintan. Multi-objective prevents wasting simulation time. Taxim source code is dse algorithms’ evaluations on processor optimization. availableforuseathttps://github.com/bouncaslab/TaXim/. In Intelligent Computer Communication and Inourexperimentswithtaxim,wehaveshownthatcomplex Processing (ICCP), 2013 IEEE International multiprocessingarchitecturescangetmorethanathousand Conference on, pages 27–33. IEEE, 2013. linesinMulti2Sim’srepresentationformat,whichisverydif- [7] V. Desmet, S. Girbal, A. Ramirez, A. Vega, and ficult and slow to generate manually. However, Taxim can O. Temam. Archexplorer for automatic design space generate complex architectures in milliseconds. Alongside exploration. IEEE micro, (5):5–15, 2010. automatic processor generation, intelligent control prevents [8] R. Devigo, L. Duenha, R. Azevedo, and R. Santos. designers to save valuable simulation time due to human Multiexplorer: A tool set for multicore system-on-chip errors. design exploration. In Application-specific Systems, Architectures and Processors (ASAP), 2015 IEEE 7. ACKNOWLEDGMENTS 26th International Conference on, pages 160–161. IEEE, 2015. Funding from the Turkish Ministry of Development un- [9] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. der the TAM Project, number 2007K120610 and Bogazici Austin, T. Mudge, and R. B. Brown. Mibench: A free, University Scientific Projects number 7060 was received. commercially representative embedded benchmark suite. In Workload Characterization, 2001. WWC-4. 8. REFERENCES 2001 IEEE International Workshop on, pages 3–14. IEEE, 2001. [1] A. Bengueddach, B. Senouci, S. Niar, and B. Beldjilali. Energy consumption in reconfigurable [10] H. Javaid, A. Ignjatovic, and S. Parameswaran. mpsoc architecture: two-level caches optimization Performance estimation of pipelined multiprocessor oriented approach. In Design and Test Symposium system-on-chips (mpsocs). Parallel and Distributed (IDT), 2013 8th International, pages 1–6. IEEE, 2013. Systems, IEEE Transactions on, 25(8):2159–2168, 2014. [2] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterization and [11] Z. J. Jia, A. D. Pimentel, M. Thompson, T. Bautista, architectural implications. In Proceedings of the 17th and A. Nu´n˜ez. Nasa: A generic infrastructure for international conference on Parallel architectures and system-level mp-soc design space exploration. In compilation techniques, pages 72–81. ACM, 2008. Embedded Systems for Real-Time Multimedia (ESTIMedia), 2010 8th IEEE Workshop on, pages [3] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, 41–50. IEEE, 2010. A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. [12] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. ACM SIGARCH Computer Architecture News, Tullsen, and N. P. Jouppi. Mcpat: an integrated 39(2):1–7, 2011. power, area, and timing modeling framework for multicore and manycore architectures. In [4] H. Calborean, R. Jahr, T. Ungerer, and L. Vintan. Microarchitecture, 2009. MICRO-42. 42nd Annual Optimizing a superscalar system using multi-objective IEEE/ACM International Symposium on, pages design space exploration. In Proceedings of the 18th 469–480. IEEE, 2009. International Conference on Control Systems and Computer Science (CSCS), Bucharest, Romania, [13] G. A. Malazgirt, B. Kiyan, D. Candas, K. Erdayandi, volume 1, pages 339–346, 2011. and A. Yurdakul. Exploring embedded symmetric multiprocessing with various on-chip architectures. In [5] T. E. Carlson, W. Heirman, and L. Eeckhout. Sniper: Embedded and Ubiquitous Computing (EUC), 2015 exploring the level of abstraction for scalable and IEEE 13th International Conference on, pages 1–8, accurate parallel multi-core simulation. In Proceedings 2015. of 2011 International Conference for High Performance Computing, Networking, Storage and [14] R. P. Mohanty, A. K. Turuk, and B. Sahoo. Analysis, page 52. ACM, 2011. Performance evaluation of multi-core processors with varied interconnect networks. In Advanced Computing, mpsoc simulation environment for dynamic Networking and Security (ADCONS), 2013 2nd application processing. In Computer and Information International Conference on, pages 7–11. IEEE, 2013. Technology (CIT), 2010 IEEE 10th International [15] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Conference on, pages 1880–1886. IEEE, 2010. Dramsim2: A cycle accurate memory system [20] K. Yan and X. Fu. Energy-efficient cache design in simulator. Computer Architecture Letters, 10(1):16–19, emerging mobile platforms: the implications and 2011. optimizations. In Proceedings of the 2015 Design, [16] D. Tarjan, S. Thoziyoor, and N. P. Jouppi. Cacti 4.0. Automation & Test in Europe Conference & Technical report, Technical Report HPL-2006-86, HP Exhibition, pages 375–380. EDA Consortium, 2015. Laboratories Palo Alto, 2006. [21] G. Yessin, A.-H. Badawy, V. Narayana, D. Mayhew, [17] C. Thompson, M. Gould, and N. Topham. High speed T.ElGhazawi,etal.”cere”: Acacherecommendation cycle approximate simulation for cache-incoherent engine: Efficient evolutionary cache hierarchy design mpsocs. In Embedded Computer Systems: space exploration. In High Performance Computing Architectures, Modeling, and Simulation (SAMOS and Communications, 2014 IEEE 6th Intl Symp on XIII), 2013 International Conference on, pages 88–95. Cyberspace Safety and Security, 2014 IEEE 11th Intl IEEE, 2013. Conf on Embedded Software and Syst (HPCC, CSS, [18] R. Ubal, J. Sahuquillo, S. Petit, and P. Lopez. ICESS), 2014 IEEE Intl Conf on, pages 566–573. Multi2sim: A simulation framework to evaluate IEEE, 2014. multicore-multithreaded processors. In Computer [22] V.Zaccaria,G.Palermo,F.Castro,C.Silvano,andG.Mar- Architecture and High Performance Computing, 2007. iani.Multicubeexplorer: Anopensourceframeworkforde- SBAC-PAD 2007. 19th International Symposium on, sign space exploration of chip multi-processors. In Archi- pages 62–68, 2007. tecture of Computing Systems (ARCS), 2010 23rd Interna- [19] N. Ventroux, A. Guerre, T. Sassolas, L. Moutaoukil, tional Conference on, pages 1–7. VDE, 2010. G. Blanc, C. Bechara, and R. David. Sesam: An

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.