ebook img

CacheBleed: A Timing Attack on OpenSSL Constant Time RSA PDF

21 Pages·2016·0.77 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview CacheBleed: A Timing Attack on OpenSSL Constant Time RSA

CacheBleed: A Timing Attack on OpenSSL Constant Time RSA YuvalYarom1,DanielGenkin2,andNadiaHeninger3 1 TheUniversityofAdelaideandNICTA [email protected] 2 TechnionandTelAvivUniversity [email protected] 3 UniversityofPennsylvania [email protected] Abstract. Thescatter-gathertechniqueisacommonlyimplementedapproachto preventcache-basedtimingattacks.Inthispaperweshowthatscatter-gatheris notconstanttime.Weimplementacachetimingattackagainstthescatter-gather implementationusedinthemodularexponentiationroutineinOpenSSLversion 1.0.2f. Our attack exploits cache-bank conflicts on the Sandy Bridge microar- chitecture. We have tested the attack on an Intel Xeon E5-2430 processor. For 4096-bitRSAourattackcanfullyrecovertheprivatekeyafterobserving16,000 decryptions. Keywords: side-channel attacks, cache attacks, cryptographic implementations, constant-time,RSA 1 Introduction 1.1 Overview Side-channel attacks are a powerful method for breaking theoretically secure crypto- graphicprimitives.SincethefirstworksbyKocher[33],theseattackshavebeenused extensivelytobreakthesecurityofnumerouscryptographicimplementations.Atahigh level,itispossibletodistinguishbetweentwotypesofside-channelattacks,basedon the methods used by the attacker: hardware-based attacks, which monitor the leakage throughmeasurements(usuallyusingdedicatedlabequipment)ofphysicalphenomena such as electromagnetic radiation [43], power consumption [31, 32], or acoustic em- anation [22], and software-based attacks, which do not require additional equipment but relyinstead onthe attackersoftware runningon orinteracting withthe targetma- chine. Examples of the latter include timing attacks which measure timing variations of cryptographic operations [7, 16, 17] and cache attacks which observe cache access patterns[40,41,49]. Percival [41] published in 2005 a cache attack, which targeted the OpenSSL [39] 0.9.7c implementation of RSA. In this attack, the attacker and the victim programs are colocated on the same machine and processor, and thus share the same processor cache. The attack exploits the structure of the processor cache by observing minute timingvariationsduetocachecontention.Thecacheconsistsoffixed-sizecachelines. Whenaprogramaccessesamemoryaddress,thecache-line-sizedblockofmemorythat contains this address is stored in the cache and is available for future use. The attack tracesthechangesthatthevictimprogramexecutionmakesinthecacheand,fromthis trace,theattackerisabletorecovertheprivatekeyusedforthedecryption. Inordertoimplementthemodularexponentiationroutinerequiredforperforming RSA public and secret key operations, OpenSSL 0.9.7c uses a sliding-window expo- nentiationalgorithm[11].Thisalgorithmprecomputessomevalues,calledmultipliers, whichareusedthroughouttheexponentiation.Theaccesspatterntotheseprecomputed multipliersdependsontheexponent,which,inthecaseofdecryptionanddigitalsigna- tureoperations,shouldbekeptsecret.Becauseeachmultiplieroccupiesadifferentset ofcachelines,Percival[41]wasabletoidentifytheaccessedmultipliersandfromthat recovertheprivatekey.Tomitigatethisattack,Intelimplementedacountermeasurethat changes the memory layout of the precomputed multipliers. The countermeasure, of- tencalledscatter-gather,interleavesthemultipliersinmemorytoensurethatthesame cachelinesareaccessedirrespectiveofthemultiplierused[14].Whilethiscountermea- sureensuresthatthesamecachelinesarealwaysaccessed,theoffsetsoftheaccessed addresseswithinthesecachelinesdependonthemultiplierusedand,ultimately,onthe privatekey. BothBernstein[7]andOsviketal.[40]havewarnedthataccessestodifferentoff- sets within cache lines may leak information through timing variations due to cache- bank conflicts. To facilitate concurrent access to the cache, the cache is often divided into multiple cache banks. Concurrent accesses to different cache banks can always behandled,howevereachcachebankcanonlyhandlealimitednumberofconcurrent requests—oftenasinglerequestatatime.Acache-bankconflictoccurswhentoomany requestsaremadeconcurrentlytothesamecachebank.Inthecaseofaconflict,some oftheconflictingrequestsaredelayed.Whiletimingvariationsduetocache-bankcon- flictsaredocumentedintheIntelOptimizationManual[28],noattackexploitingthese haseverbeenpublished.Intheabsenceofademonstratedrisk,Intelcontinuedtocon- tribute code that uses scatter-gather to OpenSSL [23, 24] and to recommend the use ofthetechniqueforside-channelmitigation[12,13].Consequently,thetechniqueisin widespreaduseinthecurrentversionsofOpenSSLanditsforks,suchasLibreSSL[35] andBoringSSL[10].Itisalsousedinothercryptographiclibraries,suchastheMozilla NetworkSecurityServices(NSS)[38]. 1.2 OurContribution InthisworkwepresentCacheBleed,thefirstside-channelattacktosystematicallyex- ploitcache-bankconflicts.InSection3wedescribehowCacheBleedcreatescontention on a cache bank and measures the timing variations due to conflicts and in Section 4 weuseCacheBleedinordertoattackthescatter-gatherimplementationofOpenSSL’s modular exponentiation routine. After observing 16,000 RSA decryptions or signing operations,weareabletorecover60%ofthesecretexponentbits.Tofindtheremain- ingbitsweadapttheHeninger-Shachamalgorithm[25]fortheinformationwecollect withCacheBleed.Inordertoachievefullkeyextraction,ourattackrequiresabouttwo 2 CPU hours. Parallelizing across multiple CPUs, we achieved key extraction in only a fewminutes.SeeSection5foramorecompletediscussion. 1.3 TargetedSoftwareandHardware Software. In this paper we target the modular exponentiation operation as imple- mented in OpenSSL version 1.0.2f which was the latest version of OpenSSL prior to our disclosure to OpenSSL. As mentioned above, similar (and thus potentially vul- nerable) code can be found in several forks of OpenSSL such as LibreSSL [35] and BoringSSL [10]. Other cryptographic libraries, such as the Mozilla Network Security Services(NSS)[38]usesimilartechniquesandmaybevulnerableaswell. Hardware. Our attacks exploit cache-bank conflicts present in Intel Sandy Bridge Processorfamily.WeranourexperimentsonanIntelXeonE5-2430processorwhichis asix-coreSandyBridgemachinewitha2.20GHZclock.Ourtargetmachineisrunning CentOS6.7installedwithitsdefaultparametersandwithhugepagesenabled. Disclosure and Mitigation. We have reported our results to the developers of OpenSSL, LibreSSL, NSS, and BoringSSL. We worked with the OpenSSL develop- ers to evaluate and deploy countermeasures to prevent the attacks described in this paper(CVE-2016-0702).Thesecountermeasuresweresubsequentlyincorporatedinto OpenSSL1.0.2gandBoringSSL.TheLibreSSLdevelopmentteamnotifiedusthatthey arestillworkingonapatch.Thecurrentversion(2.4.0)appearstoremainvulnerable. ForNSS,ourattackwasdocumentedunderMozillabug1252035.Thebugdocumen- tationindicatesthatthefixisscheduledtobeincludedinversion3.24. 2 Background 2.1 OpenSSL’sRSAImplementation RSA[44]isapublic-keycryptosystemwhichsupportsbothencryptionanddigitalsig- natures. To generate an RSA key pair, the user generates two prime numbers p,q and computes N = pq. Next, given a public exponent e (OpenSSL uses e = 65537), the usercomputesthesecretexponentd ≡ e−1 modφ(N).Thepublickeyistheintegers e and N and the secret key is d and N. In textbook RSA encryption, a message m is encrypted by computing me modN and a ciphertext c is decrypted by computing cd modN. RSA-CRT. RSAdecryptionisoftenimplementedusingtheChineseremaindertheo- rem(CRT),whichprovidesaspeedupoverexponentiationmodn.Insteadofcomputing cd modndirectly,RSA-CRTsplitsthesecretkeydintotwopartsd =dmod(p−1) p and d = dmod(q − 1), and then computes two parts of the message as m = q p cdp modp and mq = cdq modq. The message m can then be recovered from mp andm usingGarner’sformula[21]: q h=(m −m )(q−1 modp)modp and m=m +hq. p q q ThemainoperationperformedduringRSAdecryptionisthemodularexponentia- tion, that is, calculating ab modk for some secret exponent b. Several algorithms for 3 Algorithm1:Fixed-windowexponentiation input :windowsizew,basea,modulusk,n-bitexponentb=(cid:80)(cid:100)n/w(cid:101)2wi·b i=0 i output:ab modk //Precomputation a ←1 0 forj =1,...,2w−1do a ←a ·amodk j j−1 end //Exponentiation r←1 fori=(cid:100)n/w(cid:101)−1,...,0do forj =1,...,wdo r←r2 modk end r←r·a modk bi end returnr modularexponentiationhavebeensuggested.Inthisworkweareinterestedinthetwo algorithmsthatOpenSSLhasused. Fixed-WindowExponentiation. Inthefixed-windowexponentiationalgorithm,also known as m-ary exponentiation, the n-bit exponent b is represented as an (cid:100)n/w(cid:101) digit integer in base 2w for some chosen window size w. That is, b is rewritten as b = (cid:80)(cid:100)n/w(cid:101)−12wi ·b where0 ≤ b < 2w.ThepseudocodeinAlgorithm1demon- i=0 i i stratesthefixed-windowexponentiationalgorithm.Inthefirststep,thealgorithmpre- computesasetofmultipliersa =aj modkfor0≤j <2w.Itthenscansthebase2w j representation of b from the most significant digit (b ) to the least significant (cid:100)n/w(cid:101)−1 (b ).Foreachdigitb itsquaresanintermediateresultwtimesandthenmultipliesthe 0 i intermediate result by a . Each of the square or multiply operations is followed by a bi modularreduction. Sliding-WindowExponentiation. Thesliding-windowalgorithmrepresentstheex- ponent b as a sequence of digits b such that b = (cid:80)n−12i ·b , with b being either 0 i i=0 i i or an odd number 0 < b < 2w. The algorithm first precomputes a ,a ,...a as i 1 3 2w−1 in the fixed-window case. It then scans the exponent from the most significant to the leastsignificantdigit.Foreachdigit,thealgorithmsquarestheintermediateresult.For non-zerodigitb ,italsomultipliestheintermediateresultbya . i bi The main advantages of the sliding-window algorithm over the fixed-window al- gorithmarethat,forthesamewindowsize,slidingwindowneedstoprecomputehalf thenumberofmultipliers,andthatfewermultiplicationsarerequiredduringtheexpo- nentiation.Thesliding-windowalgorithm,however,leaksthepositionofthenon-zero multiplierstoadversarieswhocandistinguishbetweensquaringandmultiplicationop- erations. Furthermore, the number of squaring operations between consecutive mul- tipliers may leak the values of some zero bits. Up to version 0.9.7c, OpenSSL used sliding-window exponentiation. As part of the mitigation of the Percival [41] cache 4 attack, which exploits these leaks, OpenSSL changed their implementation to use the fixed-windowexponentiationalgorithm. Sincebothalgorithmsprecomputeasetofmultipliersandaccessthemthroughout theexponentiation,aside-channelattackthatcandiscoverwhichmultiplierisusedin themultiplicationoperationscanrecoverthedigitsb andfromthemobtainthesecret i exponentb. 2.2 TheIntelCacheHierarchy WenowturnourattentiontothecachehierarchyinmodernIntelprocessors.Thecache is a small, fast memory that exploits the temporal and spatial locality of memory ac- cesses to bridge the speed gap between the faster CPU and slower memory. In the processorsweareinterestedin,thecachehierarchyconsistsofthreelevelsofcaching. The top level, known as the L1 cache, is the closest to the execution core and is the smallestandthefastestcache.Eachsuccessivecachelevelislargerandslowerthanthe precedingone,withthelast-levelcache(LLC)beingthelargestandslowest. CacheStructure. Thecachestoresfixed-sizedchunksofmemorycalledcachelines. Eachcachelineholds64bytesofdatathatcomefroma64-bytealignedblockinmem- ory. The cache is organized as multiple cache sets, each consisting of a fixed number of ways. A block of memory can be stored in any of the ways of a single cache set. For the higher cache levels, the mapping of memory blocks to cache sets is done by selectingarangeofaddressbits.FortheLLC,Intelusesanundisclosedhashfunction tomapmemoryblockstocachesets[30,37,50].TheL1cacheisdividedintotwosub caches:theL1datacache(L1-D)whichcachesthedatatheprogramaccesses,andthe L1instructioncache(L1-I)whichcachesthecodetheprogramexecutes.Inmulti-core processors,eachofthecoreshasadedicatedL1cache.However,multithreadedcores sharetheL1cachebetweenthetwothreads. CacheSizes. IntheIntelSandyBridgemicroarchitecture,eachoftheL1-DandL1-I cacheshas64setsand8waystoatotalcapacityof64·8·64=32,768bytes.TheL2 cachehas512setsand8ways,withasizeof256KiB.TheL2cacheisunified,storing bothdataandinstructions.LiketheL1cache,eachcorehasadedicatedL2cache.The L3 cache, or the LLC, is shared by all of the cores of the processor. It has 2,048 sets per core, i.e. the LLC of a four core processor has 8,192 cache sets. The number of waysvariesbetweenprocessormodelsandrangesbetween12and20.Hencethesize of the LLC of a small dual core processor is 3MiB, whereas the LLC of an 8-cores processor can be in the order of 20MiB. The Intel Xeon E5-2430 processor we used forourexperimentsisa6-coreprocessorwitha20-wayLLCofsize15MiB.Morere- centmicroarchitecturessupportmorecoresandmoreways,yieldingsignificantlylarger LLCs. Cache Lookup Policy. When the processor attempts to access data in memory, it first looks for the data in the L1 cache. In a cache hit, the data is found in the cache. Otherwise,inacachemiss,theprocessorsearchesforthedatainthenextlevelofthe cachehierarchy.Bymeasuringthetimetoaccessdata,aprocesscandistinguishcache hitsfrommissesandidentifywhetherthedatawascachedpriortotheaccess. 5 2.3 MicroarchitecturalSide-ChannelAttacks In this section we review related works on microarchitectural side-channel timing at- tacks.Theseattacksexploittimingvariationsthatarecausedbycontentiononmicroar- chitectural hardware resources in order to leak information on the usage of these re- sources,andindirectlyontheinternaloperationofthevictim.Acıic¸mezandSeifert[5] distinguishbetweentwotypesofchannels:thosethatrelyonapersistentstateandthose thatexploitatransientstate.Persistent-statechannelsexploitthelimitedstoragespace within the targeted microarchitectural resource. Transient-state channels, in contrast, exploitthelimitedbandwidthofthetargetedelement. Persistent-State Attacks. The PRIME+PROBE attack [40, 41] is an example of a persistent-state attack. The attack exploits the limited storage space in cache sets to identifythesetsusedforthevictim’sdata.Theattackerpreloadsdatatothecacheand allows the victim to execute before measuring the time to access the preloaded data. When the victim accesses its data it is loaded into the cache, replacing some of the attacker’s preloaded data. Accessing data that has been replaced will take longer than accessing data still in the cache. Thus the attacker can identify the cache sets that the victim has accessed. Persistent-state channels have targeted the L1 data cache [7, 15, 40,41],theL1instructioncache[1,4,51],thebranchpredictionbuffer[2,3],thelast- levelcache[27,29,36,46,49],andDRAMopenrows[42].ThePRIME+PROBEattack was used to recover the accessed multipliers in the sliding-window exponentiation of OpenSSL0.9.7c[41]andofGnuPG1.4.18[27,36]. Transient-state Attacks. Transient-state channels have been investigated mostly within the context of covert channels, where a Trojan process tries to covertly exfil- trateinformation.TheideadatesbacktoLampson[34]whosuggeststhatprocessescan leak information by modifying their CPU usage. Covert channels were also observed with shared bus contention [26, 48], Acıic¸mez and Seifert [5] are the first to publish a side-channel attack based on a transient state. The attack monitors the usage of the multiplicationfunctionalunitinahyperthreadedprocessor.Monitoringtheunitallows anattackertodistinguishbetweenthesquareandthemultiplyphasesofmodularexpo- nentiation.Theattackwastestedonavictimrunningfixed-windowexponentiation,so nosecretinformationwasobtained. Another transient-state channel uses bus contention to leak side-channel informa- tion[47].Bymonitoringthecapacityofthememorybusallocatedtotheattacker,the attacker is able to distinguish the square and the multiply steps. Because the attack of[47]wasonlydemonstratedinasimulator,thequestionofwhetheractualhardware leakssuchhigh-resolutioninformationisstillopen. 2.4 Scatter-GatherImplementation OneofthecountermeasuresIntelrecommendsagainstside-channelattacksistoavoid secret-dependentmemoryaccessatcoarserthancachelinegranularity[12,13].This approachismanifestedinthepatchIntelcontributedtotheOpenSSLprojecttomitigate the Percival [41] attack. The patch1 changes the layout of the multipliers in memory. 1https://github.com/openssl/openssl/commit/46a643763de6d8e39ecf6f76fa79b4d04885aa59 6 offset 0 1 2 63 offset 0 1 2 63 Line 0 M0[0] M0[1] M0[2] ••• M0[63] Line 0 M0[0] M1[0] M2[0] ••• M63[0] Line 1 M0[64] M0[65] M0[66] ••• M0[127] Line 1 M0[1] M1[1] M2[1] ••• M63[1] Line 2 M0[128] M0[129] M0[130] ••• M0[191] Line 2 M0[2] M1[2] M2[2] ••• M63[2] Line 3 M1[0] M1[1] M1[2] ••• M1[63] Line 3 M0[3] M1[3] M2[3] ••• M63[3] Line 4 M1[64] M1[65] M1[66] ••• M1[127] Line 4 M0[4] M1[4] M2[4] ••• M63[4] Line 5 M1[128] M1[129] M1[130] ••• M1[191] Line 5 M0[5] M1[5] M2[5] ••• M63[5] Line 6 M2[0] M2[1] M2[2] ••• M2[63] Line 6 M0[6] M1[6] M2[6] ••• M63[6] Line 7 M2[64] M2[65] M2[66] ••• M2[127] Line 7 M0[7] M1[7] M2[7] ••• M63[7] Line 8 M2[128] M2[129] M2[130] ••• M2[191] Line 8 M0[8] M1[8] M2[8] ••• M63[8] • • • • • • • • • • • • • • • • • • • • • • • • Line 191 M63[128]M63[129]M63[130] ••• M63[191] Line 191 M0[191] M1[191] M2[191] ••• M63[191] Fig.1.Conventional(left)vs.scatter-gather(right)memorylayout. Instead of storing the data of each of the multipliers in consecutive bytes in memory, thenewlayoutscatterseachmultiplieracrossmultiplecachelines[14].Beforeuse,the fragmentsoftherequiredmultiplieraregatheredtoasinglebufferwhichisusedforthe multiplication. Figure 1 contrasts the conventional memory layout of the multipliers withthelayoutusedinthescatter-gatherapproach.Thisscatter-gatherdesignensures thattheorderofaccessingcachelineswhenperformingamultiplicationisindependent ofthemultiplierused. Because Intel cache lines are 64 bytes long, the maximum number of multipliers thatcanbeusedwithscatter-gatheris64.Forlargeexponents,increasingthenumber ofmultipliersreducesthenumberofmultiplyoperationsperformedduringtheexponen- tiations.Gopaletal.[23]suggestdividingthemultipliersinto16-bitfragmentsrather than into bytes. This improves performance by allowing loads of two bytes in a sin- glememoryaccess,atthecostofreducingthemaximumnumberofmultipliersto32. Gueron[24]recommends32-bitfragments,thusreducingthenumberofmultipliersto 16.Heshowsthatthecombinedsavingsfromthereducednumberofmemoryaccesses andthesmallercachefootprintofthemultipliersoutweighstheperformancelossdue totheaddedmultiplicationsrequiredwithlessmultipliers. The OpenSSL Scatter-Gather Implementation. The implementation of exponen- tiation in the current version of OpenSSL (1.0.2f) deviates slightly from the layout describedabove.For2048-bitand4096-bitkeysizestheimplementationusesafixed- windowalgorithmwithawindowsizeof5,requiring32multipliers.Insteadofscatter- ingthemultipliersineachcacheline,themultipliersaredividedinto64-bitfragments, scatteredacrossgroupsoffourconsecutivecachelines.(SeeFigure2.)Thatis,thetable thatstoresthemultipliersisdividedintogroupsoffourconsecutivecachelines.Each groupoffourconsecutivecachelinesstoresone64-bitfragmentofeachmultiplier.To avoid leaking information on the particular multiplier used in each multiplication, the gatherprocessaccessesallofthecachelinesandusesabitmaskpatterntoselectthe ones that contain fragments of the required multiplier. Furthermore, to avoid copying 7 offset 0 78 15 56 63 Line 0 M0[0-7] M1[0-7] ••• M7[0-7] Line 1 M8[0-7] M9[0-7] ••• M15[0-7] Line 2 M16[0-7] M17[0-7] ••• M23[0-7] Line 3 M24[0-7] M25[0-7] ••• M31[0-7] Line 4 M0[8-15] M1[8-15] ••• M7[8-15] Line 5 M8[8-15] M9[8-15] ••• M15[8-15] Line 6 M16[8-15] M17[8-15] ••• M23[8-15] Line 7 M24[8-15] M25[8-15] ••• M31[8-15] Line 8 M0[16-23] M1[16-23] ••• M7[16-23] • • • • • • • • • Line 95 M24[184-191] M25[184-191] ••• M31[184-191] Fig.2.ThememorylayoutofthemultiplierstableinOpenSSL the multiplier data, the implementation combines the gather operation with the multi- plication.Thisspreadstheaccesstothescatteredmultiplieracrossthemultiplication. Key-Dependent Memory Accesses. Because the fragments of each multiplier are storedinafixedoffsetwithinthecachelines,allofthescatter-gatherimplementations describedabovehavememoryaccessesthatdependonthemultiplierusedandthuson thesecretkey.Forapurescatter-gatherapproach,themultiplierisencodedinthelow bitsoftheaddressesaccessed during thegatheroperation.ForthecaseofOpenSSL’s implementation, only the three least significant bits of the multiplier number are en- coded in the address while the other two bits are used as the index of the cache line withinthegroupoffourcachelinesthatcontainsthefragment. Wenotethatbecausethesesecret-dependentaccessesareatafinerthancacheline granularity,thescatter-gatherapproachhasbeenconsideredsecureagainstside-channel attacks[24]. 2.5 IntelL1CacheBanks With the introduction of superscalar computing in Intel processors, cache bandwidth becameabottleneckforprocessorperformance.Toalleviatetheissue,Intelintroduceda cachedesignconsistingofmultiplebanks[6].Eachofthebanksservespartofthecache linespecifiedbytheoffsetinthecacheline.Thebankscanoperateindependentlyand serverequestsconcurrently.However,eachbankcanonlyserveonerequestatatime. When multiple accesses to the same bank are made concurrently, only one access is served,whiletherestaredelayeduntilthebankcanhandlethem. Fog [18] notes that cache-bank conflicts prevent instructions from executing sim- multaneouslyonPentiumprocessors.Delaysduetocache-bankconflictsarealsodoc- umentedforotherprocessorversions[19,20,28]. Both Bernstein [7] and Osvik et al. [40] mention that cache-bank conflicts cause timing variations and warn that these may result in a timing channel which may leak 8 information about low address bits. Tromer et al. [45] note that while scatter-gather has no secret-dependent accesses to cache lines, it does have secret-dependent access tocachebanks. BernsteinandSchwabe[8]demonstrate timingvariationsduetocon- flictsbetweenreadandwriteinstructionsonaddresseswithinthesamecachebankand suggest these may affect cryptographic software. However, although the risk of side- channelattacksbasedoncache-bankconflictshasbeenidentifiedlongago,noattacks exploitingthemhaveeverbeenpublished. 3 TheCacheBleedAttack WenowproceedtodescribeCacheBleed,thefirstside-channelattacktosystematically exploitcache-bankconflicts.Theattackidentifiesthetimesatwhichavictimaccesses data in a monitored cache bank by measuring the delays caused by contention on the cachebank. Inourattackscenario,weassumethatthevictimandtheattackerrunconcurrently ontwohyperthreadsofthesameprocessorcore.Thus,thevictimandtheattackershare theL1datacache.RecallthattheSandyBridgeL1datacacheisdividedintomultiple banks and that the banks cannot handle concurrent load accesses. The attacker issues alargenumberofloadaccessestoacachebankandmeasuresthetimetofulfillthese accesses.Ifduringtheattackthevictimalsoaccessesthesamecachebank,thevictim accesses will contend with the attacker for cache bank access, causing delays in the attack.Hence,whenthevictimaccessesthemonitoredcachebanktheattackwilltake longerthanwhenthevictimaccessesothercachebanks. To implement CacheBleed we use the code in Listing 1. The bulk of the code (Lines4–259)consistsof256addlinstructionsthatreaddatafromaddressesthatare allinthesamecachebank.(Thecachebankisselectedbythelowbitsofthememory address in register r9.) We use four different destination registers to avoid contention ontheregistersthemselves.Beforestartingtheaccesses,thecodetakesthevalueofthe current cycle counter (Line 1) and stores it in register r10 (Line 2). After performing 256 accesses, the previously stored value of the cycle counter is subtracted from the currentvalue,resultinginthenumberofcyclesthatpassedduringtheattack. We run the attack code on an Intel Xeon E5-2430 processor—a six-core Sandy Bridge processor, with a clock rate of 2.20GHz. Figure 3 shows the histogram of the runningtimesoftheattackcodeunderseveralscenarios.2 Scenario1:Idle. Inthefirstscenario,idlehyperthread,theattackeristheonlypro- gram executing on the core. That is, one of the two hyperthreads executes the attack code while the other hyperthread is idle. As we can see, the attack takes around 230 cycles, clearly showing that the Intel processor is superscalar and that the cache can handlemorethanoneaccessinaCPUcycle. Scenario2:PureCompute. Thesecondscenariohasavictimrunningacomputation on the registers, without any memory access. As we can see, access in this scenario is slower than when there is no victim. Because the victim does not perform memory accesses, cache-bank conflicts cannot explain this slowdown. Hyperthreads, however, 2Forclarity,thepresentedhistogramsshowtheenvelopeofthemeasureddata. 9 1 rdtscp 2 movq %rax , %r10 3 4 addl 0x000(%r9), %eax 5 addl 0x040(%r9), %ecx 6 addl 0x080(%r9), %edx 7 addl 0x0c0(%r9), %edi 8 addl 0x100(%r9), %eax 9 addl 0x140(%r9), %ecx 10 addl 0x180(%r9), %edx 11 addl 0x1c0(%r9), %edi . . . 256 addl 0xf00(%r9), %eax 257 addl 0xf40(%r9), %ecx 258 addl 0xf80(%r9), %edx 259 addl 0xfc0(%r9), %edi 260 261 rdtscp 262 subq %r10 , %rax Listing1.Cache-BankCollisionAttackCode share most of the resources of the core, including the execution units, read and write buffers and the register allocation and renaming resources [20]. Contention on any of theseresourcescanexplaintheslowdownweseewhenrunningapure-computevictim. Scenario 3: Pure Memory. At the other extreme is the pure memory victim, which continuously accesses the cache bank that the attacker monitors. As we can see, the attackcodetakesalmosttwiceaslongtoruninthisscenario.Thedistributionofattack timesiscompletelydistinctfromanyoftheotherscenarios.Henceidentifyingthevic- timinthisscenarioistrivial.Thisscenariois,however,notrealistic—programsusually performsomecalculation. Scenarios 4 and 5: Mixed Load. The last two scenarios aim to measure a slightly morerealisticscenario.Inthiscase,oneinfourvictimoperationsisamemoryaccess, where all of these memory accesses are to the same cache bank. In this scenario we measure both the case that the victim accesses the monitored cache line (mixed-load) andwhenthereisnocache-bankcontentionbetweenthevictimandtheattacker(mixed- load–NC).Weseethatthetwoscenariosaredistinguishable,butthereissomeoverlap betweenthetwodistributions.Consequently,asinglemeasurementmaybeinsufficient todistinguishbetweenthetwoscenarios. Inpractice,eventhismixed-loadscenarioisnotparticularlyrealistic.Typicalpro- grams will access memory in multiple cache banks. Hence the differences between measurement distributions may be much smaller than those presented in Figure 3. In the next section we show how we overcome this limitation and correctly identify a smallbiasinthecache-bankaccesspatternsofthevictim. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.