ebook img

DTIC ADA441135: Exact Analysis of the Cache Behavior of Nested Loops PDF

13 Pages·0.17 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA441135: Exact Analysis of the Cache Behavior of Nested Loops

Exact Analysis of the Cache Behavior of Nested Loops (cid:3) Siddhartha Chatterjee Erin Parker Philip J. Hanlon Alvin R. Lebeck y y z x [email protected] [email protected] [email protected] [email protected] DepartmentofComputerScience DepartmentofMathematics DepartmentofComputerScience y z x TheUniversityofNorthCarolina UniversityofMichigan DukeUniversity ChapelHill,NC27599-3175 AnnArbor,MI48109 Durham,NC27708 ABSTRACT mustexhibitgoodlocalityofreferenceintheirmemoryaccessse- quencesinordertorealizetheperformancebenefitofcaches. We develop from first principles an exact model of the behavior Optimizingcompilersattempttospeedupprogramsbyperform- ofloopnestsexecutinginamemoryhierarchy,byusinganontra- ingsemantics-preserving codetransformations. Looptransforma- ditionalclassificationofmissesthathasthekeyproperty ofcom- tionssuchasiterationspacetiling[62]areamajorsourceofper- posability. WeusePresburgerformulastoexpressvariouskindsof formance benefits. They restructure loop iterations in ways that missesaswellasthestateofthecacheattheendoftheloopnest. make the memory reference sequence more cache-friendly. The Weuseexistingtoolstosimplifytheseformulasandtocountcache theoryoflooptransformationsiswell-developedintermsofdecid- misses. The model is powerful enough to handle imperfect loop ingthelegalityofaproposed transformationandgenerating code nestsandvariousflavorsofnon-lineararraylayoutsbasedonbitin- for the transformed loop. However, models of the expected per- terleavingofarrayindices. Wealsoindicatehowtohandlemodest formancegainsofperformingagivenlooptransformationareless levelsofassociativity,andhowtoperformlimitedsymbolicanaly- well-developed [19, 38, 45, 48, 50, 51, 61]. Where such models sisofcachebehavior.Thecomplexityoftheformulasrelatestothe exist,theyareoftenheuristicorapproximate. Forexample, tiling staticstructureoftheloopnestratherthantoitsdynamictripcount, requiresthechoiceoftilesizes,andtheperformanceofaloopnest allowingourmodeltogainefficiencyincountingcachemissesby istypicallyanon-smoothfunctionoftheextentsoftheloopbounds, exploitingrepetitivepatternsofcachebehavior. Validationagainst thetilesizes,andthecacheparameters[13,19,38].Themodelwe cache simulation confirms the exactness of our formulation. Our develop in this paper can be used to quantitatively determine the methodcanserveasthebasisforastaticperformancepredictorto numberofcachemissesofaproposedtransformationwithoutex- guideprogramanddatatransformationstoimproveperformance. plicitsimulation. Ultimately,suchamodelcouldbeusedtoguide thechoiceofparametersinsuchprogramtransformations. 1. INTRODUCTION Acomplementarymethodforimprovingsequentialprogramper- Thegrowinggapbetweenprocessorcycletimeandmainmem- formancethathasbeeninvestigatedinrecentyearsisthatoftrans- oryaccesstimemakesefficientuseofthememoryhierarchyever formingthememorylayoutofitsdatastructures. Suchdatalayout moreimportantforperformance-orientedprograms.Manycompu- transformations can vary in complexity; examples include trans- tations running on modern machines are often limited by the re- positionandstridereordering[32],arraymerging[39],intra-and sponseofthememorysystemratherthanbythespeedofthepro- inter-arraypadding [50,51],datacopying[38],andnon-linearar- cessor. Cachesareanarchitecturalmechanismdesignedtobridge raylayouts[14]. Onceagain,properchoiceofparametervaluesis thisspeedgap,bysatisfyingthemajorityofmemoryaccesseswith ofparamountimportanceingettinggoodperformanceoutofsuch low latency and at close to processor speed. However, programs transformations, but the models guiding this optimization are of- teninexact. Forinstance,RiveraandTseng[50,51]useheuristics ThisworkwassupportedinpartbyDARPAGrantDABT63-98- (cid:3) todetermineinter-arraypad. However,thereisempiricalevidence 1-0001,NSFGrantsEIA-97-26370andCDA-95-12356,NSFCa- that almost every choice of pad can be catastrophically bad for a reer Award MIP-97-02547, The University of North Carolina at ChapelHill,DukeUniversity,andanequipmentdonationthrough programassimpleasmatrixtransposition[16]. Bettermodelsare IntelCorporation’sTechnologyforEducation2000Program. Erin clearlyneededtoguidesuchoptimizations.Ourworkinthispaper Parker is supported by a Lawrence Livermore Computer Science isastepinthisdirection. GraduateFellowship. Theviewsandconclusionscontainedherein An aggressive form of data optimization is the use of certain arethoseoftheauthorsandshouldnotbeinterpretedasrepresent- familiesofnon-lineararraylayoutsthatarebasedoninterleaving ing the official policies or endorsements, either expressed or im- thebitsinthebinaryexpansion oftherowandcolumn indicesof plied,ofDARPAortheU.S.Government. arrays. Previous studies have demonstrated performance gainsas well as robustness of performance resulting from the use of such layouts[14,15]. Yetitisdifficulttoascertain,shortofsimulation, thememorybehaviorofaprogram givenaparticulardatalayout. Permission tomake digital orhardcopies ofall orpartofthis workfor personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare Thispaperworkstowardsbuildingananalyticalmodelofcachebe- notmadeordistributedforprofitorcommercialadvantageandthatcopies haviorforsuchlayoutsthatcanprovideinsightintotherelationship bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to betweensuchdatalayoutsandmemorybehavior. republish,topostonserversortoredistributetolists,requirespriorspecific Ourmodelisanalternativetothewell-knownCacheMissEqua- permissionand/orafee. tions(CME)modelofGhoshetal.[26]. ComparedtoCME,our PLDI2001Snowbird,UTUSA Copyright2001ACM0-89791-88-6/97/05..$5.00 modelhasthefollowingstrengthsandweaknesses. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2001 2. REPORT TYPE - 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Exact Analysis of the Cache Behavior of Nested Loops 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Defense Advanced Research projects Agency,3701 North Fairfax REPORT NUMBER Drive,Arlington,VA,22203-1714 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT see report 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 12 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 OurmodelisexactasaconsequenceofouruseofPresburger 2.1 Basicsofmemoryhierarchies (cid:15) arithmetic as the underlying formalism. Ghosh et al. [26] We assume a simplified memory hierarchy that processes one usetheabstractionofreusevectorstosimplifytheanalysis. memory access at a time, with no distinction between memory Reusevectorsdonotexistforallloopnests,andcertainlydo readsandwrites. notexistinthepresenceofnon-lineararraylayouts. 2.1.1 Cachestructure Ourmodelaccuratelydeterminesthestateofthecacheatthe (cid:15) endofexecutingaloopnest. Thisfunctionalityisimportant Thestructureofasinglelevelofamemoryhierarchy—acache— isgenerallycharacterizedbythreeparameters[30]: Associativity, foraccuratelycountingcompulsorymisses[30],forrapidly Blocksize,andCapacity.Capacityandblocksizeareinunitsofthe leap-frogging up to a certain point in the computation, and minimummemoryaccesssize(usuallyonebyte).Acachecanhold forhandlingmultipleloopnests. amaximumof bytes. However,duetophysicalconstraints,the Ourmodelhandlesimperfectloopnestsinadditiontoperfect cacheisdividedCintocacheframesofsize thatcontain contigu- (cid:15) loopnests.WeapplyatransformationofAhmedetal.[2,3] ousbytesofmemory—calledamemorybBlock.TheassoBciativity toanimperfectloop nest, thereby converting ittoaperfect specifiesthenumberofdifferentframesinwhichamemoryblocAk loopnestwithguardsonstatements. Ghoshetal.[26]con- can reside. Ifablock can resideinany frame(i.e., ),the C sideronlyasingleperfectloopnest. cacheissaidtobefullyassociative;if ,thecaAche=isBdirect- mapped; otherwise, the cache is -waAy s=et1associative. A cache Ourmodelhandlesavarietyofarraylayoutfunctions,from setisthegroupofframesinwhichAamemoryblockcanreside,and (cid:15) row-andcolumn-majortonon-linear. Wewillsubsequently thenumberofcachesets, ,isgivenby . refer to row- and column-major layouts as canonical lay- C Weassume a two-levelSmemory hieraSrc=hy,AcBonsisting of an - outs[17]. Theformulationfornon-linearlayoutsisnew,to waysetassociativecachewithblocksizeof bytesandtotalcAa- thebestofourknowledge. pacityof bytesfollowedbymainmemory.BWealsoassumethat Ourmodelhandlescacheswithmodestlevelsofassociativ- main memCory is large enough to hold all the data referenced by (cid:15) ity in a natural way. While Ghosh et al. [25] can handle theprogram. Thefunction convertsamemorybyteaddressinto set-associativecaches,theirsolutionmethodisequivalentto a memory block address (wBith ). The function simulationintheworstcase. convertsamemoryblockaddresBs(tao)th=ecbaach=eBscettowhichitmapSs (thus, ). Ourmodeliscapable ofsymbolicanalysis. Thisisadirect S(b)=bmodS (cid:15) consequenceofouruseofthePresburgerformalism.Forex- 2.1.2 Cachedynamics ample,wecansimplifyaformulaforthecross-interference For an access to memory address , the cache controller de- betweentwoarrayswhilekeepingthedifferenceoftheirstart- termineswhethermemoryblock misresidentinanyofthe ingaddressessymbolic.Thesimplifiedformulacanberapidly cacheframesincacheset B(m. I)fthememoryblockisresAi- evaluatedforspecificvaluesofthisvariable. dent,acachehitissaidtoSoc(cBu(rm,a)n)dthecachesatisfiestheaccess afteritsaccesslatency.Ifthememoryblockisnotresident,acache Theenhanced capabilitiesofourmodelcomeatthecostof (cid:15) computational complexity,intheformofsuper-exponential mississaidtooccur. Thestateofthecacherepresentsthememoryblock(s)contained worst case behavior of algorithms for satisfiability check- ineachsetofthecacheatanypointduringaprogram’sexecution. ing and quantifier elimination of Presburger formulas [60]. Thus, in a direct-mapped cache where each set holds one frame, Whilewehaveaprototypeimplementationofourmodelas the cache state maps set to the address of the memory block a SUIF [55] pass, and the analysis and formula generation containedthere.CIngeneral,s isamapfromcachesetstothesets portionsoftheimplementationareacceptablyefficient,sig- C ofmemoryblocksthattheycontain. isemptyforacacheset nificant improvements are necessary to the robustness and towhichnoblockhasbeenmappedC.(s) efficiencyofthesimplificationandcountingparts. s 2.2 Classificationofcache misses Comparedtoexplicitsimulation,ourformulascapturetemporal patternsofcachebehaviorthatmaynotbeapparentinsimulation. From an architectural standpoint, cache misses fall into one of Moreover, an analyticalcache modelprovides deeper insight into threeclasses:compulsory,capacity,andconflict[30].Capacityand thebehaviorthanwhatmaybelearnedfromsimulation.Weantici- conflictmissesareoftencombinedandcalledreplacementmisses. patethatsuchinformationwillmakeitpossibletoguidethechoice Thisclassificationisextremelyusefulforunderstandingtheroleof of data layouts that optimize cache behavior. Wevalidate the re- capacityandassociativityintheperformanceofacache;however, sults of all our formulas against simulation in Section 4, thereby itdoesnothavethepropertyofcomposability. confirmingtheirexactness. Consider two program fragments and , where, for Theremainderofthispaperisstructuredasfollows.Section2re- , fragment incurs cold Pm1isses aPn2d replacemien2t viewsbackgroundmaterialfordiscussingourapproachtothecache f1;2g Pi Ci Ri misses.Nowconsidertheprogramfragment formed analysis problem: basics of cache memory (Section 2.1), a new def bysequentialcompositionof and ,andPs1u2pp=osPe1th;aPt2itincurs classificationofcache misses(Section2.2), thepolyhedral model coldmissesand replPa1cemenPt2misses. Thereisnosimple (Section 2.3), and Presburger arithmetic (Section 2.4). Section 3 Cre1la2tion connecting thRe12misses of the whole to the misses of the constructsourmodel. Section4providessomepreliminaryresults parts.Inparticular, .Composi- obtained using our cache analysis model. Section 5discusses re- tionisafundamentaClo12pe+raRtio12ni6=ntChe1c+onRs1tru+ctCio2n+ofRp2rogramsand latedwork.Section6presentsconclusionsandfuturework. inthedefinitionofprogramminglanguagesemantics. Aswewish tocount cache missesforindividual programfragments and their 2. BACKGROUND compositions,weproposeadifferentclassificationthatiscompos- Thissectionprovidesbackground materialanddefinesnotation able. fortheremainderofthepaper. Weclassifymissesfromaprogramfragmentintothefollowing twoclasses. Thetheorem provides additional leverage ifsymbolic anal- (cid:15) ysis of the atomic program fragments is possible. For ex- Interiormissesarethosedatareferencesthatareguaranteed ample, block-recursive codes [4] employ multiple dynamic (cid:15) tomiss,independentoftheinitialcachestatewhenthefrag- instancesofthesameloopnestdifferingonlyinthestarting mentbeginsexecution.Inotherwords,giventhecode,thear- addressesofthedataarraysonwhichtheyoperate.Symbolic raylayouts,andthestructuralparametersofthecache,such analysisofsuchfragmentswouldallowthecostofanalysis misses can be identified/enumerated/counted by analyzing tobeamortizedovermultipleusesoftheresultingformulas. thefragmentinisolation. Potentialboundarymissesarethosedatareferencesthatmay Notethatboundarymissesforafragmentareboundedfrom (cid:15) eitherhitormiss,dependingontheinitialcachestatewhen (cid:15) abovebythecachefootprintofthedatastructuresitaccesses, thefragmentbeginsexecution. Thepotentialoccurrence of whichisinturnboundedfromabovebythenumberofcache such misses can be identified by analyzing the fragment in frames.Thisnumberistypicallymuchsmallerthanthenum- isolation,buttheactualoccurrenceofthemisscanbedeter- ber of interior misses. We could therefore avoid the calcu- minedonlyafterconsideringtheinitialcachestate. lation of cache state and approximate the number of cache missesofthecompositeprogramby ,withanaccom- Another equivalent viewofthisclassificationisthatwecanstati- panyingerrorbound. I1+I2 callyexamineaprogramfragmentinisolationandplaceeachdata memoryaccessthatitmakesintooneofthreecategories:thosethat 2.3 The polyhedral model are guaranteed to hit, those that are guaranteed to miss (interior Ourmodel for analyzing cache behavior of loop nestsisbased misses), and those that could hit or miss depending on the initial onthewell-knownpolyhedralmodel[20]. Theprogramfragment cachestate(potentialboundary misses). Inasecondstep,wefur- whosecachebehaviorwearetryingtoanalyzeisanestednormal- therpartitionthepotentialboundarymissesintohitsandmissesby izedloopwith levelsofnesting,numbered through from resolvingthemagainstthecachestatewhentheprogramfragment outermost to indnermost. We first consider p0erfect loopdn(cid:0)es1ts; we startsexecuting. Wecallthesemissesboundarymisses. Itfollows willextendthemodeltoimperfectloopnestsinSection3.4. The that,inordertocomposeprogramfragments,wealsoneedtode- upperbound of ,theloopcontrolvariable(LCV)forloop ,is terminethestateofthecacheafterexecutingaprogramfragment. anaffinefuncUtijono(cid:19)fjtheLCVs through .Theiterationspjace Fora given program fragment and an initialcache state , we isthesetofallvalidcombina(cid:19)t0ionsofLC(cid:19)jV(cid:0)1valuesthatarewithin willlet denotethefinaPlcachestateafterfragment S has tIheboundsoftheloopnest. Thenotation de- complet(cid:9)ed(Pex;eSc)ution. P notesagenericpointintheiterationspace‘=. [T‘0h;e::it:e;ra‘tdi(cid:0)on1]Tspace possesses a total order , which in the poIlyhedral model is the Theorem2.1 Letprogramfragment executingfrominitialcache lexicographic ordering. (cid:30)The order specifies thetemporal order in state incur interiormissesandP1 boundarymissesand whichtheiterationpointsintheiterationspaceareexecuted. produCc0e final cIa1che state B1(C0). Let program frag- Theloopaccesseselementsofarrays through .Ar- ment executing from inCi1tial=ca(cid:9)ch(eP1s;taCt0e) incur interior rayvariable has dimensions,withY(0)beingtheYex(mte(cid:0)nt1)ofthe missesPa2nd boundarymissesandprodCu1cefinalIc2achestate Y(i) thdi nj B2(C1) arrayinthe dimension. Thedataindexspace corre- fCro2m=in(cid:9)it(iaPl2c;aCc1h)e.sLtaetteprogrianmcurfragmienntetrPio12rmd=eifssPe1s;aPn2dexecuting spondingto(ajrra+y1)(i)istheCartesianproduct Di boundarymissesandproCd0ucefinaIl1c2achestate B12(C0). . Y [0;n0(cid:0)1](cid:2)(cid:1)(cid:1)(cid:1)(cid:2) Thenthefollowingrelationshold. C12 =(cid:9)(P12;C0) [0;Tnhdei(cid:0)st1at(cid:0)emth1e]ntsintheloopbodymakekreferencestoarrayvari- ables.The reference hasthreecomponents: ,thenameof thearrayreiferenced(soRthiat forsome Ni ); I12+B12(C0) = I1+B1(C0)+I2+B2(C1) ,theindexexpressionoftNheir=efeYre(nj)ce, whichijde2nt[i0fi;emsth(cid:0)e1c]o- C12 = C2 Foridinates of thearray elementaccessed by thisreference atitera- tionpoint ;and ,thestatementthatcontainsreference . To include sta‘tementSh in the definition of reference mayRiseem PROOF. Theproof followsimmediately from the semantics of excessive at this pSoihnt, but it will be useful in SectRioin 3.4 when programcompositionandfromthedeterministicnatureofthepro- weconsiderimperfectloopnests. Theindexexpression iscon- gramfragmentsandofthecache. strained to be an affine function of in each of its comFiponents. Thus, isafunctionfromtheiterat‘ionspace tothedataindex Theorem2.1hasseveralimportantconsequences. spaceFi . I Thetheoremenablestheanalysisofcachemissesofacom- BorDroNwiingterminologyfromGhoshetal.[26],wecallastatic (cid:15) positeprogramfragmentintermsofthecachemissbehavior instanceofamemoryreadorwriteareference,andadynamicin- ofitsparts. Eachpartcanbeanalyzed inisolation,and the stanceofthatreadorwriteanaccess. Areferenceandaniteration resultsoftheseanalysescanbecombinedusingcachestates. point uniquely define an access. The total order on iterations Wewillshow laterhowto efficientlypropagate cache state almostinducesasimilartotalorderonaccesses;ho(cid:30)wever,twoac- acrossaprogramfragment. cessesinthesameiterationneedtobeorderedaswell.Wecompose thetotalorder ontheiterationspaceandtheorderamongrefer- Strongerassertions,like ,donotholdingen- encesofaniter(cid:30)ationtodefineatotalorder“precedes”(written ) (cid:15) eral. I12 = I1+I2 amongaccesses. Thus,access precedesaccess (cid:1)iff Thetheoremissilentaboutthenatureofprogramfragments . (Ri;u) (Rj;v) (cid:15) and orabout how to calculateboundary and interior (uS(cid:30)evve)ra_lq(uuan=titvie^sair<eajs)sociatedwitharray : alayoutfunc- (i) mP1isses fPo2r them. In the remainder of the paper, we will tion ,whichisa1-1mapfrom intothemeYmoryaddressspace chooseloopnestsasouratomicprogramfragmentsanduse ;Li,thestartingbyteaddressDoifthearray;and ,thenumber Presburgerformulastocodifythevariouskindsofmisses. oZf+0by(cid:22)teisperarrayelement. Applying toaneleme(cid:12)nitofthearray Li Mathematical 3. THE CACHEANALYSISMODEL Object Representation Theproblemofcentralinteresttousisthefollowing. Aniterationpoint tharrayreference ‘ Givenacacheconfiguration asinSection2.1,aloop (j) iAccessmadeby at Ri =(Y ;Fi;Sh) nest meetingtheconditionsofSection2.3,thelayout L ArrayelementacRceissed‘by at (Ri;‘) functions of the arrays accessed in , and an initial Byteaddressof Ri ‘ ei =Y(j)[Fi(‘)] cachestateCin: L Blockaddressofei mi=(cid:22)j+Lj(Fi(‘))(cid:1)(cid:12)j counttheinteriormissesincurredby ; Cachesettowhicmhi maps bi =B(mi) (cid:15) L bi si =S(bi) counttheboundarymissesincurredby ; (cid:15) L findthecachestate afterexecutionof . Table1:Tableofnotation. (cid:15) Cout L Asimplestrategytoaccomplishallofthesegoalsisthroughsim- producesanoffset,andmultiplyingtheoffsetby givesthebyte ulationofthecode. Thisispreciselywhatcache simulators [29, offsetfromthestartingaddressofthearrayinmem(cid:12)oiry.Addingthis 40, 54, 56]do. Themaindrawback ofsimulationisitsslowness: offsetto thengivesthebyteaddressoftheelement. ittakestimeproportionaltotherunningtimeofthecode, usually Putting(cid:22)aillofthisnotationtogether,wehavetheobjectsofinter- withasignificantmultiplicativefactor( istypical).Inthe estandtheirmathematicalrepresentationsshowninTable1. matrixmultiplicationkernelofExample101,(cid:0)th1is0t0imeis .Our 3 goalistodevelopmuchfasteralgorithms,whoseexist(cid:2)en(cnei)ssug- Example1 Consider the following loop nestfor matrixmultipli- gestedbytheregularityofthearrayaccesspatternsandthelimited cation,whichwepresentinastylizedpseudo-codeinanattemptto numberofcachesetstowhichtheymap. remainlanguage-neutral. Section3.1providesthebasicPresburgerformulasnecessaryto do i = 0, n-1 describethecacheeventsinSection3.2.Section3.3discusseshow do j = 0, n-1 we count cache misses, given such Presburger formulas. Section do k = 0, n-1 3.4 extends our model to analyze imperfect loop nests. Section S0: C[i,j] = A[i,k]*B[k,j]+C[i,j] 3.5showshowtoextendourformulaforinteriormissestohandle end end modest levels of associativity. Section 3.6 reviews array layouts end basedonbitinterleaving,andprovidesthePresburgerformulasto describe them. Section 3.7 discusses issues related to physically Thisloopnesthasdepth . TheLCVsare , , and . The loop nedst=acc3esses three array(cid:19)s0: = i (cid:19)1 = j, indexedcaches. (0) (1)(cid:19)2 = k, and (2) . Each array is two-dimeYnsion=al,Aso 3.1 DescribingcachestructureusingPresbur- tYhat = B Y = C . There arefour gerformulas arrayDr0ef=ereDnc1es=: D2 = [0;n(cid:0),1](cid:2)[0;n(cid:0)1], (the read access),Ran0d= A[i;k] R1(t=heBw[rkit;eja]ccRes2s)=. TChe[i;inj-] We now present the basic formulas that will be combined in R3 = C[i;j] Section 3.2 todescribe cache events. The translations aremostly dex expressions of the four references are , 1 0 0 straightforwardorwell-known[18,49]. F0 = (cid:20) 0 0 1 (cid:21) , and . Allrefer- 3.1.1 Validiterationpoint 0 0 1 1 0 0 F1 = F2 = F3 = encesa(cid:20)re0con1tain0ed(cid:21)instatement . (cid:20) 0 1 0 (cid:21) Thepredicate describes the factthatiterationpoint 2.4 Presburger arithmSe0tic [‘0;:::;‘d(cid:0)1]T b‘el2ongIstotheiterationspace. ‘ = Presburgerarithmetic[31]isthesubset offirstorderlogiccor- respondingtothetheoryofintegerswithaddition. Presburgerfor- d(cid:0)1 (1) def mulasconsistofaffineconstraintsonintegervariables,whichcan ‘2I = 06‘i<Ui beeitherconstraintsofequalityorinequality. Theconstraintsare i^=0 3.1.2 Lexicographicalorderingofaccesses linked by the logical operators , and , and the quantifiers and . Ithasbeenusedtomode:lv^arious_aspectsofprogrammin8g Whenconsideringallaccessesthatoccurbeforeaccess , langu9ages, aswellasinotherareassuchastimingverification[6, we include any access occurring at an iteration , such(thRavt;m) 7].WeusePresburgerformulastodefinepolytopeswhosecontents . To be complete, we must also include any ac‘cess made a‘ti(cid:30)t- describeinterestingeventslikecachemisses. emration by a reference that occurs before . The predicate Presburgerarithmeticisdecidable; however, aquantifierelimi- m describesthefactthatthemRemvoryaccessmade nationdecisionprocedure hasasuperexponential upperbound on b(Ryure;f‘e)re(cid:1)nc(eRv;mat)iteration precedesthememoryaccessmadeby performance. More precisely, the truth of a sentence of length at . Ru ‘ can be determined within 2pn time, for some constant n Rv m 2 [46]. Theboundistight[620]. Boundedquantifiereliminatpion>ha1s worst-caseupperandlowerboundsof 2n [60]. Thecomplex- (Ru;‘)(cid:1)(Rv;m)d=ef ‘2I^m2I^ ityisrelatedtothenumberofalternatin(cid:2)g(b2loc)ksof and quanti- d(cid:0)1 i(cid:0)1 fiers[52]aswellastothenumericalvaluesofthein8tegerc9onstants ( (‘i<mi^ ‘j =mj)_ andtheirco-primalityrelationships. i_=0 j^=0 Weuse the Omega library[34] to manipulate and simplifyour Presburger formulas, and have found itsmethods reasonably effi- d(cid:0)1 (2) ( ‘j =mj^u<v)) cientforourapplications. j^=0 3.1.3 Mappingmemorylocationstocachesets reuse vectors. This abstraction is valid when the array index ex- Let = associativity, = block size, = capacity, and pressionsareuniformlygeneratedinadditiontobeingaffineinthe Anumber of cacheBsets. Then memCory location mapSs=to LCVs. Weavoidthisconditionbydispensingwiththenotionofa C cAaBche=set .Thiscanbetranslatedtothemfollowing mostrecentaccessinourformulas. m Presburgesrf=orbmBulca,mwohderSetheauxiliaryvariable represents the Todetermineifanaccesstoamemoryblock resultsinanin- “cache wraparound”. Suppose that is the awrray referencing teriormiss,itisenoughtoknowtwothings:thatbthereisanearlier (x) memorylocation ,andlet betheYnumberofelementsin . accesstoadifferentmemoryblockmappingtothesamecacheset (x) m (cid:11)x Y as ; and that there is no access to between this earlier access andbthecurrentaccessto . Letreferebnce at (x) Map def iterationpoint accessmbemoryblock ,Ranud=let(Yrefer;eFncue;Sp) (m;w;s) = 06s<S^ iatiterationpoint accebsusmemoryblock .RSvup=- (y) B(wS+s)6m<B(wS+s)+B^ p(Yoseth;Fatva;cScqes)s precedesjaccess ,recallingbthve“pre- (3) cedes” relation (fRrovm;jS)ection 3.1.2; th(aRtu;i)and are distinct (cid:22)x(cid:0)B <B(wS+s)<(cid:22)x+(cid:12)x(cid:11)x Thelastclauseinformula(3)bounds thepossiblevaluesof , memory blocks; but that both and mbuap to thbev same cache andisusedtoboundcertaindirectionsoftheunderlyingpolytowpe set . Then, access sufbfuers anbivnterior miss if there does that would otherwise be unconstrained. This bounding is needed notsexist a reference(Ru;i) atiteration access- (z) forefficiencyinthecountingstepthatfollowsformulasimplifica- ingmemoryblock R,swuc=hth(Yat ;Fw;Sr) k and tion. The quantity represents the address of the first .Thefollobwwingformula(eRxpvr;ejs)se(cid:1)st(hRiswc;okn)d(cid:1)itio(Rn.u;i) byte in the block coBn(tawinSin+g sm)emory location , which must be bu=bw within the memory locations containing array m . However, if the startingaddress is notaligned on amemYo(rxy) block bound- IntMiss( ) def ((Ru;i)2 L ) = i2I^ ary, assertingthat (cid:22)x iswrong. Asshownbelow, Map theaddressofthefi(cid:22)rxst6bytBei(nwtShe+mse)moryblockcontaining ’s 9d;s: (Lx(Fu(i));d;s)^ (x) firstelementmayactuallybelessthan . Restricting sucYhthat 9e;j;v:(Rv;j)(cid:1)(Ru;i)^ Map iscorrectwhethe(cid:22)rxornotthestartwingaddress (Ly(Fv(j));e;s)^ (cid:22)x(cid:0)isBali<gneBd(ownSa+mse)moryblockboundary. :(9k;w:(Rv;j)(cid:1)(Rw;k)(cid:1)(Ru;i)^ (cid:22)x B(wS+s) m B(wS+s)+B Map (6) (Lz(Fw(k));d;s))^d6=e Note that it is not necessary to have in order to (z) (x) have and access the saYme m=emYory block. This flexib(ilRityu;aic)comm(oRdawt;esk)thepossibilityofarrayaliasing. µx−B µx 3.2.2 Boundarymisses 3.1.4 Datalayoutsinmemory Recallthatboundarymissesarethosethataredependentonthe initial cache state. Therefore, we are interested only in those ac- Row-andcolumn-majorlayoutsareeasilyexpressedusingPres- cessesthatarethefirsttomaptoacachesetduringtheexecution burger formulas. Consider reference and iterationpoint .Let Ru = (.Y(x);Fu;Sh) oftheloopnest. Forallotheraccesses,thecachesetalreadycon- ‘ Fu(‘)=[i0;:::;idx(cid:0)1]T tains a memory block accessed during the execution of the loop nest, and initial cache state is irrelevant. To determine an actual Row-maj boundarymissforanaccessthatisthefirsttomaptothecacheset, def (m= (Fu(‘);(cid:22)x)) = m>0 itsimplyremainstocheckifthememoryblockaccessedisresident dx(cid:0)2 dx(cid:0)1 (4) intheinitialcachestateoftheset. ^m=(cid:22)x+( ( nk)ij+idx(cid:0)1)(cid:12)x Anaccess (x) tomemoryblock suffers Xj=0 k=Yj+1 aboundarym(iRssuif=th(eYredo;eFsun;oStpex);isit)anaccess pbureceding and accessing a memory block map(pRinvg;jt)o the same Col-maj c(Racuh;ei)set,and isnotintheinitialcachebvstate atset . Note def (m= (Fu(‘);(cid:22)x)) = m>0 that,unlikeintbhueformulaforinteriormisses,theCreinisnocosnstraint dx(cid:0)1 j(cid:0)1 (5) . ^m=(cid:22)x+(i0+ ( nk)ij)(cid:12)x bu6=bv Xj=1 kY=0 Section3.6discussesnonlineardatalayouts. BoundMiss def ((Ru;i)2 (L;Cin)) = i2I^ Map 3.2 Describing cache behavior usingPresbur- 9d;s: (Lx(Fu(i));d;s)^ ger formulas :(9e;j;v:(Rv;j)(cid:1)(Ru;i)^ Map The various pieces described in Section 3.1 fit together to de- (Ly(Fv(j));e;s))^ scribeeventsinthecache. WenowconstructPresburgerformulas (7) forinteriormisses,boundarymisses,andcachestate,asdefinedin B(Lx(Fu(i)))62Cin(s) Sections2.1and2.2. Weconsiderdirect-mapped cachesfornow, 3.2.3 Cachestate andextendtheformulationtoset-associativecachesinSection3.5. If the loop nest contains no memory access mapping to set L ,thefinalcachestateofset , ,isthesameastheinitial 3.2.1 Interiormisses csachestate . Otherwises,tChoeufit(nsa)lcachestateofset isthe Toidentifyacachemiss,Ghoshetal.[26]relyonthenotionof addressoftCheinm(se)moryblockthatisnotsubsequentlyreplascedby amostrecentaccessofamemoryblock,whichtheyobtainthrough anaccesstoablockofmemorymappingtothesamecacheset . s Thesecondpartoftheextensionistoinsurethatourmodelcan handlearrayreferencesthatareguardedinthismanner.Weaccom- (Cout =(cid:9)(L;Cin))d=ef8s2[0;S(cid:0)1]:(9i:i2I^ plishthiseffectbyextending ournotionofavaliditerationpoint Map (Section3.1)tothatofavalidaccess. (9d: (Lx(Fu(i));d;s)^ th Map Let be the reference with :(9e;j;v:(Ru;i)(cid:1)(Rv;j)^ (Ly(Fv(j));e;s))^ . LetRu = (bYe(xth)e;Fguu;aSrdh)of statemuent in the prod0uc6t supac<e Cout(s)=B(Lx(Fu(i)))))_ vkersionGofht(hie)loopnest.WeassumethatthSehguardsareexpressible Map (8) (:(9e: (Lx(Fu(i));e;s))^Cout(s)=Cin(s)) inPresburgerarithmetic. ForFigure1(b), , 3.3 Counting cache misses true,and .Then G0 = (i2 = 0) G1 =is (x) avalidaccGe2ss=if(i2be=lonng2s(cid:0)to1t)heiterat(ioRnus=pac(eY,and;Fu;Sh)h;oil)ds. WeusetheOmegaCalculator[33,34]tosimplifytheformulas Thepredicate i representsthisfact, Gh(i) abovebymanipulatingintegertuplerelationsandsets. Aftersim- (Ru;i)2(cid:22)I plification,weareleftwithformulasdefiningaunionofpolytopes (seeFigure4foranexample).Thenumberofintegerpointsinthis (9) def union is the number of misses. We use PolyLib [42] to operate (Ru;i)2(cid:22)I = i2I^06u<k^Gh(i) onsuchunions. Wefirstconverttheunionintoadisjointunionof Withthisextension,theformulasfromSection3.2applydirectly, polytopes, andthenuseEhrhartpolynomials tocountthenumber witheveryoccurrenceof replacedby . ofintegerpoints[18]ineachpolytope. i2I (Ru;i)2(cid:22) I 3.5 Associativity 3.4 Extension toimperfect loopnests We currently handle associativity in a straightforward manner, Extendingourmodeltoimperfectloopnestsinvolvestwosteps. assuming a Least Recently Used replacement policy. From Sec- tion 3.2.1, we simply need to allow at least distinct accesses 1. WeusethetransformationsofAhmedetal.[2,3]toconvert preceding tounique memoryblocks, sAuchthatthereisno animperfectloopnestintoaperfectloopnestwithguardson access (Ru;ia)ccessing the same memory block as and statements. (Rw;k) (where isthe(Reaur;liie)stof 2. We extend the notion of a valid iteration point to that of a a(Rt lve0a;sjt0)(cid:1)re(fRerwen;cke)s(cid:1)to(uRnuiq;ui)e memor(yRbvl0o;cjk0s)). The following validaccess. PresburgAer formula expresses interior misses for an -way set- associativecache. A Foreachstatementoftheloopnest,Ahmedetal.defineastate- mentiterationspacewhosedimensionisthenumberofloopsthat contain the statement. The product space for the loop nest is a IntMiss def linearly independent subspace of the the Cartesian product of all ((Ru;i)M2ap ) = i2I^ thestatementiterationspaces. Affineembedding functionsmapa 9d;s: (Lx(Fu(i));d;s)^ pointinastatementiterationspacetoapointintheproductspace. 9e0;j0;v0 :(Rv0;j0)(cid:1)(Ru;i)^ Map Whenmultiplestatementsmaptothesameiterationpointinprod- (Ly0(Fv0(j0));e0;s)^ uct space, they are executed in program order. In relation to the product space, embeddings represent guards on statements, map- (9e1;:::;eA(cid:0)1 : pingastatementfromitsplaceoutsidetheinnermostlooptoavalid A(cid:0)1 placeinsidetheinnermostloop. Weemphasizethattheguardsare (9ja;va:(Rv0;j0)(cid:1)(Rva;ja)(cid:1)(Ru;i)^ conceptual, and for analysis only. They do not result in run-time a^=1 Map conditionaltestsinthegeneratedcode. (Lya(Fva(ja));ea;s))^ KellyandPugh[35,36]andLimandLam[41]havepresented d6=e0 6=(cid:1)(cid:1)(cid:1)6=eA(cid:0)1)^ otheralgorithmsthatembedimperfectloopnestsintoperfectloop nests,withsimilarendresults. Thedetailsoftheembeddingalgo- :(9k;w:(Rv0;j0)M(cid:1)a(pRw;k)(cid:1)(Ru;i)^ (10) rithmsarenotimportantforourpurpose.Ouruseoftheframework (Lz(Fw(k));d;s)) of Ahmed et al. merely reflectsour greater familiaritywith their Thismethodwillhandlemodestvaluesof ,andthecomplexity work. of the formulas certainly increases with .APresburger formulas Figure1(a)isanimproved version ofExample1, inwhich the forcachestateandboundarymisseswithAassociativity arenon- loop-invariantreferenceC[i,j]ishoistedoutofthek-loopand obvious,andwillrequiremoreworktoconstruct. A storedinascalarxthatcanberegister-resident. Inthisimperfect 3.6 Arraylayoutsbased onbitinterleaving loop nest, statements S0 and S2 occur outside of the innermost loop. Let denote theloop indexvariable pertaining tostate- Previouswork[14,15,21]suggeststhatnon-lineardatalayouts mentSX.TihXen and arethestatemi entiterationspaces provide bettercache performance thancanonical layout functions ofstatementsSi00(cid:2)anjd0S2,ire2s(cid:2)pejc2tively. Thefollowingembedding in some numerical codes. Such layout functions aredescribed in functions terms of interleavings of the bits in the binary expansions of the array coordinates rather than as affine functions of the numerical valuesofthesequantities. Wedescribesuch bitinterleavingsand provideformulationsoftheselayoutsinPresburgerarithmetic. i0 i2 i0 i2 Indevelopingthemodelofalternativearraylayouts,weassume F0( )=2 j0 3; F2( )=2 j2 3; (cid:20) j0 (cid:21) (cid:20) j2 (cid:21) that forsome (where isthenumber 0 n(cid:0)1 qj mappointsintheses4tateme5ntiterationspacesto4pointsinp5roduct ofconojrd=ina2tesinanarrayj 2 [0);.dTxh(cid:0)ere1f]ore,thebdixtrepresentation (x) space . It is clear how the guards on statements S0 and of an array index will havYe bits, with the least significant bit T S2of[Fii;gju;rke]1(b)accomplishthis. StatementS1isalreadyinthe (LSB)numbered and the mqojst significant bit (MSB) numbered innermostloop,andrequiresnoguardonit. . Weidenti0fythebinarysequence withthenon- qj (cid:0)1 sq(cid:0)1:::s0 do i = 0, n-1 do i = 0, n-1 do j = 0, n-1 do j = 0, n-1 S0: x = C[i,j] do k = 0, n-1 do k = 0, n-1 S0: if (k == 0) x = C[i,j] S1: x = A[i,k]*B[k,j] + x S1: x = A[i,k]*B[k,j] + x end S2: if (k == n-1) C[i,j] = x S2: C[i,j] = x end end end end end Figure1:(a)Animperfectloopnestformatrixmultiplication.(b)Theproductspaceversionwithguards. negativeinteger . Wedenoteby thesetofall The following formula maps , , and to memory binarysequencesso=flengqi=jth(cid:0)01s,i2anidextendtheaboBvqejidentification location .Let ,aFnud(‘le)t (cid:22)x Z((cid:27)) . dx(cid:0)1 T toidentify withtPheinteqrjval qj . m p= j=0 qj M =[mp(cid:0)1;(cid:1)(cid:1)(cid:1) ;m0] We descBribqje a family of nonl[i0n;ea2r la(cid:0)yo1u]t functions parameter- P Interleave ized by a single parameter , as follows. An - def (m= (Fu(‘);(cid:22)x;Z((cid:27)))) = interleaving, ,isasequence(cid:27)oflength (where(q0;:::;qdx(cid:0)1)) dx(cid:0)1 overthealpha(cid:27)bet conptaining p’s=. Itdie=s0cribqeis 9moff;mp(cid:0)1;:::;m0 : theorderinwhichfb0it;s:f:r:o;m(dtxhe(cid:0)1)agrraycoordinaqteisiarePinterleaved 06mp(cid:0)1;:::;m0 61^m>0^ tolinearizethearrayinmemoryd.x m=(cid:22)x+moff(cid:12)x^Fu(‘)=Z((cid:27))M^ Anarraylayoutfunctionsasamapfrom arraycoordinatestoa p(cid:0)1 (11) memoryaddress.Therefore,givenan dx -interleaving k moff = mk2 ,defineamap (q0;:::;qdx(cid:0)1) Xk=0 (cid:27) Data layouts such as X-Morton and U-Morton [15] require an X-ORoperationinadditiontobitinterleaving. (Notethatthisfor- (cid:2):Bq0 (cid:2)(cid:1)(cid:1)(cid:1)(cid:2)Bqdx(cid:0)1 !Bp malism applies only to arrays.) The additional X-OR op- in the following way. If (i) (i) (i) (i) eration can also be exprnes(cid:2)sednas a Presburger formula on the bit ,then x = xqi(cid:0)is1t:h:e:xse1quxe0nce2obBtaqinie8dib2y (0) (dx(cid:0)1) representation. r[0ep;dlaxci(cid:0)ng1]the th(cid:2)(xfrom;:th::e;rxightwit)h . Weextendthisnota- (u) tiontoconsidejr uasamapfrom xj 3.7 Physicallyaddressed caches to p (cid:2)byidentifyingno[0n;-n2eq0ga(cid:0)tiv1e](cid:2)int(cid:1)e(cid:1)g(cid:1)e(cid:2)rs[a0n;d2qtdhxe(cid:0)ir1b(cid:0)i- Thetechniques described thus faroperate on virtual addresses. n1a]rye[0x;p2ans(cid:0)ion1s].Wecall themixingfunctionindexedby .Note However,manysystemsutilizephysicalindexedcaches(e.g.,sec- that for(cid:2)any . (cid:27) ondlevelcaches)whoseperformanceishighlydependentonpage (cid:2)(0;:::;0)=0 (cid:27) placement. Fortunately,mostoperatingsystemsemploypagecol- Example2 Let , ( ), ( ), oringtechniques thatminimizethiseffect[37]bycreatingvirtual and dx.=Th2enn0 = 16 q0 = 4 n1 = 16 q1 = 4 tophysicalpagemappingssuchthatthevirtualandphysicalcache (cid:27)=10110010 indexareidentical. Itmayalsobepossibletoextendouranalysis to include the effects of page placement; we leave this as future (cid:2)(12;5)=(cid:2)(1100;0101)=01101010=106: research. Example3 Let , ( ), ( ), ( ),daxnd=le3t n0 = 8 q0 =. T3henn1 = 8 q1 = 3 4. RESULTS n2 =4 q2 =2 (cid:27)=21102001 In this section, we present and interpret cache behavior as ob- (cid:2)(3;7;1)=(cid:2)(011;111;001)=01101111=111: tained by our method on five model problems, and validate them TheideabehindtranslatingsuchadatalayoutintoaPresburger againstcachemisscountsproducedbya(specially-written)cache formula is to define the bit values of the binary expansion of the simulator. Unless otherwise specified, we use a direct-mapped memory addressusingPresburgerarithmetic. Consideragainref- cache with capacity 4096 bytes and block size of 32 bytes that erence and iteration point . Forevery where Ru = (Y(x,);leFtu;Sh) . Then isan ‘ nj- isinitiallyempty. Weassume thatalldataarrayscontaindouble- interlea0vi6ng.j T<hednxwe cnajn=com2qpjute the f(cid:27)ollowin(gq0;:::;qdmx(cid:0)at1ri)x precisionnumbers(sothat iseightbytes),andthatallarraysare th dx (cid:2)p linearizedincolumn-major(cid:12)order. Thetotalnumberofmissesfor . Letting ,the columnof consistsof in Z((cid:27))th g = (cid:27)f f th Z((cid:27)) 2e eacharraymatchupexactlybetweenourmodelandthesimulator the position,where isthe fromtheright,andzerosin inallcases,buttheirpartitioningdiffers. Weexplaintheimplica- evergy other position. (cid:27)f can bee thgought of asa transformation tionsofthisdifferenceinSection4.1. thatwhenappliedtothZe(b(cid:27)in)aryexpansionofamemoryaddress , producesthecoordinatesofthearrayelementat . m Problem1(Matrixmultiplication) We count boundary and in- m terior misses for each array for the matrix multiplication kernel Example4 Giventhat , ( ), ( ), showninExample1,underfourscenarios. ( ),and dx =3 n0 =,8 q0 =3 n1 =8 q1 =3 n2 =4 q2 =2 (cid:27)=12102010 1. Problemsize ,theleadingdimensionofeacharrayis ,andthethrnee=arr2a1ysareadjacenttoeachotherinmemory 0 0 0 4 0 2 0 1 naddressspace(i.e., , ,and ). Z((cid:27))=2 4 0 2 0 0 0 1 0 3 Weshowresultsfor(cid:22)aAlls=ix0po(cid:22)ssBib=lep(cid:12)enrm2utatio(cid:22)nCso=fth2e(cid:12)lno2op 0 2 0 0 1 0 0 0 4 5 orders,frombothourapproachandfromexplicitcachesim- Themodelcorrectlyclassifiesallthemissesinthefirstloopnestas ulation. This is representative of a code where both the it- boundary misses. Thecache contains allof arrayCatthe endof eration space and the data arrays are tiled. Placing the ar- thefirstloopnest,soallofthemissesofCinthesecondloopnest rays back-to-back causes two memory blocks to be shared areinteriormisses. Figure3 graphically represents cache stateat between arrays. Figure2(a)tabulatestheresults. Thejki theendofthecomputation. looporderisseentobesubstantiallysuperiorintermsoftotal misses. Problem3(Imperfectloopnest) Wecountboundaryandinterior missesforeacharrayfortheimperfectloopversionofthematrix 2. Problem size , the leading dimension ofeach array multiplication kernel of Figure 1 with , with the leading is ,andthethnre=ea2rr0aysareadjacenttoeachotherinmem- dimensionofeacharraybeing .Thisdenm=ons2tr1ateshowthemodel orynaddressspace. Weshowresultsforallsixpossibleper- handlesimperfectloopnests.Wneshowtwoscenarios. mutations of the loop orders, from both our approach and Thefirstscenario hasthethreearraysadjacent toeach otherin from explicit cache simulation. This scenario is similar to memoryaddressspace.Themisscountsareasfollows. theprevious one, butthere isnosharing ofmemory blocks betweenarrays. Figure2(b)tabulatestheresults. Thenum- A B C(read) C(write) berofmissesissomewhatsmaller, and thejkilooporder Bnd 28 92 8 0 winsagain. Int 521 866 383 0 Total 549 958 391 0 3. Problemsize ,theleadingdimensionofeacharrayis Cold 110 110 111 0 ,andthethrnee=ar2ra1yscollideincachespace(i.e., , Repl 439 848 280 0 n , and ). Thisrepresents a(cid:22)siAtua=tio0n w(cid:22)Bher=et4he09a6rraysd(cid:22)oCno=tus8e1t9h2ecacheeffectively(occupying ThesignificantobservationisthatnoneofthewritereferencestoC only111ofthe128cachesets). Weshowresultsforallsix miss, eventhough therearemanyreferences toAandBbetween possiblepermutationsofthelooporders,frombothourap- thereadandthewritereferencetoC[i,j]. Thetotalnumberof proachandfromexplicitcachesimulation. Figure2(c)tab- missesisidenticaltothatofProblem1,scenario1. ulatestheresults. Thenumberofmissesrisesdramatically, Thesecond scenario hasthearrayscolliding inthecache. The asexpected; thejkilooporder produces thefewestcache misscountsareasfollows. misses,butnotbyaslargeamargin. A B C(read) C(write) 4. Problem size , the leading dimension ofeach array Bnd 20 90 1 0 is (for n = 20 ), and the three arrays are adjacent Int 980 648 440 441 tokenach othker2infm1;e2m;3ogry address space. This represents a Total 1000 738 441 441 situationwheretheiterationspaceistiledbutthedataisnot Cold 111 111 111 0 reorganized, resultinginthedatatilesnotbeingcontiguous Repl 889 627 330 441 inmemory space. Weshow only theijkloop order. Fig- ure 2(d) tabulates the results. The total number of misses NoweveryreadandwritereferencetoC[i,j]misses. However, foreacharraychangewiththeleadingdimension,although the total number of cache misses issignificantly smaller than the differentarraysbehavedifferently. corresponding caseinProblem1, scenario3, showing thebenefit ofallocatingC[i,j]inaregister. Problem2(Multipleloopnests) Wecountboundaryandinterior Problem4(Set-associativecache) We count interior misses for missesforeacharrayforthefollowingvariationonthematrixmul- eacharrayforthematrixmultiplicationkernelshowninExample1, tiplicationkernel. usingtwo-wayassociativecaches.Thelayoutconstraintsareiden- ticaltothoseinProblem1,scenario2. Thisdemonstrateshowthe do i = 0, n-1 /* Loop nest 1 */ modelhandlesassociativity. do j = 0, n-1 C[i,j] = 0 Bothscenariosconsideratwo-wayassociativecachewithblock end size of 32 bytes that is initally empty. The cache has a capacity end of 4096 bytes in the first scenario and 8192 bytes in the second do i = 0, n-1 /* Loop nest 2 */ scenario.Themisscountsareasfollows. do j = 0, n-1 do k = 0, n-1 C[i,j] = A[i,k]*B[k,j] + C[i,j] AC =4B096 C AC =8B192 C end end Bnd 128 256 end Int 75 773 213 8 0 36 Total 1189 300 Thelayout constraintsareidenticaltothose inProblem1, sce- Cold 100 100 100 100 100 100 nario 1. Thisdemonstrates how the model handles multiple loop Repl 0 757 132 0 0 0 nests. Themisscountsareasfollows. Thetotalnumberofboundary missesineachscenarioisdeter- minedbythenumberofcache framesinthefootprintofallthree arraysincache. Foreverycache framethatistouched duringthe A B C Loop Bnd Int Tot Bnd Int Tot Bnd Int Tot matrixmultiplicationkernel, thefirstinstanceofamemory block 1 0 0 0 0 0 0 111 0 111 beingmappedtothecacheframeincursaboundarymisssincethe 2 28 521 549 92 866 958 0 383 383 cache is initially empty. In the first scenario, there are 64 cache Loop A B C Grand order Bnd Int Tot Cold Repl Bnd Int Tot Cold Repl Bnd Int Tot Cold Repl Total ijk 28 521 549 110 439 92 866 958 110 848 8 383 391 111 280 1898 ikj 18 445 463 110 353 85 1985 2070 110 1960 25 1563 1588 111 1477 4121 (a) jik 108 590 698 110 588 18 502 520 110 410 2 109 111 111 0 1329 jki 104 355 459 110 349 18 167 185 110 75 6 207 213 111 102 857 kij 2 184 186 110 76 34 1644 1678 110 1568 92 1624 1716 111 1605 3580 kji 9 297 306 110 196 31 436 467 110 357 88 530 618 111 507 1391 Loop A B C Grand order Bnd Int Tot Cold Repl Bnd Int Tot Cold Repl Bnd Int Tot Cold Repl Total ijk 25 405 430 100 330 85 661 746 100 646 18 298 316 100 216 1492 ikj 23 349 372 100 272 73 1533 1606 100 1506 32 1205 1237 100 1137 3215 (b) jik 97 409 506 100 406 28 345 373 100 273 3 97 100 100 0 979 jki 95 261 356 100 256 28 131 159 100 59 5 160 165 100 65 680 kij 13 146 159 100 59 33 1276 1309 100 1209 82 1254 1336 100 1236 2804 kji 16 220 236 100 136 31 352 383 100 283 81 404 485 100 385 1104 Loop A B C Grand order Bnd Int Tot Cold Repl Bnd Int Tot Cold Repl Bnd Int Tot Cold Repl Total ijk 21 964 985 111 874 90 1799 1889 111 1778 0 2393 2393 111 2282 5267 ikj 1 864 865 111 754 110 1846 1956 111 1845 0 2556 2556 111 2445 5377 (c) jik 107 578 685 111 574 4 1900 1904 111 1793 0 2123 2123 111 2012 4712 jki 111 558 669 111 558 0 1789 1789 111 1678 0 2232 2232 111 2121 4690 kij 5 545 550 111 439 20 1866 1886 111 1775 86 2299 2385 111 2274 4821 kji 6 577 583 111 472 5 1823 1828 111 1717 100 2229 2329 111 2218 4740 A B C Grand Bnd Int Tot Cold Repl Bnd Int Tot Cold Repl Bnd Int Tot Cold Repl Total (d) k1 25 405 430 100 330 85 661 746 100 646 18 298 316 100 216 1492 2 40 305 345 100 245 71 1198 1269 100 1169 17 227 244 100 144 3350 3 40 449 489 100 389 68 1119 1187 100 1087 20 311 331 100 231 5357 Figure2: Misscountsfromourapproach(BndandInt)andfromcachesimulation(ColdandRepl). (a)Problem1,scenario1. (b) Problem1,scenario2.(c)Problem1,scenario3.(d)Problem1,scenario4. ArrayA ArrayB ArrayC Figure3:CachestateattheendofthecomputationdescribedinProblem2.Theshadedblocksarecache-resident.Thereareexactly 128shadedmemoryblocks. ArraysAandBshareablock,asdoarraysBandC.Theblockwiththeheavyoutlineineacharray mapstocacheset0.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.