PREPRINT Vectorization of Multibyte Floating Point Data Formats Andrew Anderson David Gregg Lero,TrinityCollegeDublin Lero,TrinityCollegeDublin [email protected] [email protected] ABSTRACT ment customized floating point with bit-level control over 6 the size of the mantissa and exponent [5,16]. Weproposeaschemeforreduced-precisionrepresentationof 1 Morerecently,Schkufzaetal.[14]haveshownhowsuper- 0 floatingpointdataonacontinuumbetweenIEEE-754float- optimizationcanbeusedtogenerateiterativefloatingpoint 2 ing point types. Our scheme enables the use of lower pre- algorithmsthatguaranteeagivenlevelofaccuracy. Forex- cisionformatsforareductioninstoragespacerequirements l ample,anexponentialfunctionmayreturna64-bitfloating u and data transfer volume. We describe how our scheme point value in which at least the first, say, 40 bits of the J can be accelerated using existing hardware vector units on a general-purpose processor (GPP). Exploiting native vec- mantissa are guaranteed to be accurate. 2 A problem with customizing floating point precision on torhardwareallowsustosupportreducedprecisionfloating 2 general-purposeprocessors(GPPs)isthatmostsupportonly point with low overhead. We demonstrate that supporting ] reduced precision in the compiler as opposed to using a li- twostandardtypes: IEEE-754singleprecision(orbinary32) S braryapproachcanyieldalowoverheadsolutionforGPPs. anddoubleprecision(binary64). Ifanoperationneedsmore precision than binary32 and less than binary64, the devel- M oper has no choice but to use binary64. . CCSConcepts In this paper we propose a compiler-based mechanism s c •Computer systems organization → Single instruc- forsupportingseveralnon-standardmultibyte floatingpoint [ tion, multiple data; •Software and its engineering→ memory formats, such as 24-, 40-, or 48-bit floating point. Softwareperformance; Datatypesandstructures;•Theory By multibyte we mean that these formats differ in length 3 from standard IEEE-754 formats by multiples of 8 bits. of computation → Vector / streaming algorithms; v By using these types, a developer can reduce the preci- 9 sion of floating point data in memory, resulting in reduced 8 Keywords storage requirements for their data. 7 Anincreasinglyimportantfactorinthedesignofcomput- 7 Approximate Computing; Floating Point; Multiple Preci- ing systems is energy consumption. In embedded comput- 0 sion; SIMD; Vector Architecture ingsystemstheenergyconsumedbydatamovementcanbe . 1 muchgreaterthantheenergyusedtoperformarithmetic[4]. 0 1. MOTIVATION According to Gustafson [7], a typical 64-bit multiply-add 6 Ithaslongbeenrecognizedthatdifferentapplicationsand operation costs around 200pJ, while reading 64 bits from 1 algorithmsneeddifferentamountsoffloatingpointprecision cache costs 800pJ and a 64-bit main memory read costs : v to achieve accurate results [3,13,16]. For example, 64-bit 12000pJ. Reducing the amount of data movement is there- i double precision floating point is often needed for scientific fore essential to reducing energy consumed by an applica- X applications, whereas 32-bit single precision is sufficient for tion, and in particular, reducing the number of costly last- r most graphics and computer vision applications. level cache misses. a ModernGPUsandembeddedprocessorsincreasinglysup- Customizing precision of floating point data to the needs port 16-bit floating point for applications that are highly of the application is one way to reduce data movement. It tolerant of approximation such as speech recognition [6]. is straightforward to implement a C/C++ data type repre- In some contexts it is possible to customize the level of senting,forexample,24-bitfloatingpointvalues(Figure1). precision precisely to the application. For example, field- However,wehavefoundthattheperformanceofsuchcode programmable gate arrays (FPGAs) can be used to imple- can be extraordinarily poor when operating on arrays of data. We therefore propose a technique to generate vector- izedcodethatperformsanumberofadjacentreadsorwrites together, along with the required packing or unpacking. Contributions • Wepresentapracticalrepresentationofmultibytefloat- ing point values that can easily be converted to and from the next largest natively-supported type. PREPRINT • We propose a compiler vectorization scheme for pack- schemes on a recent general-purpose processor. Section 7 ing and unpacking our floating point types, enabling discusses related work, and Section 8 concludes. their use in vectorized floating point computations. 3. REDUCEDPRECISIONONGPPS • We demonstrate experimentally that our techniques It is possible to emulate floating point operations in soft- providealow-overheadwaytosupportcustomizedfloat- wareusingintegerinstructions. However,eachfloatingpoint ing point types on general-purpose processors. operationrequiresmanyintegerinstructionssoemulationis slow. In particular, the IEEE-754 standard has several special 2. CUSTOMIZINGFLOATINGPOINT valuesandrangessuchasnot-a-number(NaN),positiveand negative zero, infinities, and sub-normal numbers, each of The IEEE-754 2008 standard [17] defines a number of fi- which adds to the complexity of software emulation [15]. nitebinaryrepresentationsofrealnumbersatdifferentreso- Modern processors provide hardware floating point units lutions(16,32,and64bits,amongothers). Eachformaten- whichdramaticallyreducethetimeandenergyrequiredfor codesfloating-pointvaluesusingthreebinaryintegers: sign, floatingpointcomputations. However,theseunitsnormally exponent, and mantissa, which specify a point on the real number line following a general formula: support just two floating point types, typically binary32 (float) and binary64 (double). M v=(−1)s×(1+(cid:88)(m 2−i))×2e−bias 3.1 OurApproach i i=1 Weproposeasetofnon-standardfloatingpointmultibyte v is the real number obtained by the evaluation of the for- typesthatwerefertoasflytes. Ratherthanemulatingcom- mula for a floating-point value with sign s, mantissa m of putationonflytesinsoftware,weconvertthemto/fromthe length M bits, and exponent e. The value bias is an inte- next largest natively supported type. Thus, a 24-bit flyte gerconstantwhichdiffersforeachformat. Differentformats (flyte24)isconvertedtobinary32beforecomputation,and (binary32,binary64,etc.) usedifferentnumbersofbitsfor thebinary32resultisconvertedbacktoflyte24afterwards. the exponent and mantissa components. Weneedtosolvetwoproblems: (1)efficientlyloadingand The exponent determines a power of two by which the storing non-standard data sizes, such as 24-bit data, where rest of the number is multiplied, while the mantissa repre- there is no hardware support for such operations, and (2) sents a fractional quantity in the interval [1,2) obtained by quicklyconvertingbetweenbuilt-infloatingpointtypesand summing successively smaller powers of two, starting from our non-standard types. 2−1 and continuing up to the length of the mantissa. If the In general, converting between floating point formats has ith mantissa bit is set, the corresponding power of two is many special cases. In particular, converting between for- present in the sum. For normalized numbers (those with a mats with different-sized exponents may cause numbers to nonzero exponent), the leading 1 is not explicitly stored in overflow to infinity or underflow to/from sub-normal num- the mantissa, but is implied. bers. Dealing correctly with these cases in software is com- The structure of the IEEE-754 binary encoding means plicated and slow. that a change in an exponent bit strongly influences the Our solution to the problem is that in all our flyte types, resulting value, while a change in a mantissa bit has less thesizeoftheexponentisequaltothesizeoftheexponent influence. Furthermore,achangeinanybitofthemantissa of the next largest built-in type. For example, in our has exponentially greater effect on the resulting value than flyte16andflyte24types,theexponenthaseightbits,just a change in the next-least-significant bit. like binary32. This dramatically reduces the complexity of These observations lead naturally to a scheme for repre- conversions. senting values at precisions between those specified by the Efficiently supporting non-standard floating point types IEEE-754 standard: varying the number of low-order man- usingthisapproachcreatestwotypesofproblems. Thefirst tissa bits. Previous proposals based around this concept issupportingtheloading,storing,andconversionofreduced have typically made use of either customized floating point precision types with acceptably low overhead. This is the hardwaresupportviaFPGA[5]orhigh-leveldatareorgani- topic of the majority of this paper. zation in massively parallel systems [9]. The second type of problem is that performing computa- Inthispaper,wedemonstratehowreducedprecisionfloat- tion in one floating point type and storing values in a less ing point representations can be used on general-purpose precisetypeintroducesissues,suchasdouble-rounding,that processors without requiring any special hardware support. complicate numerical analysis. We make every effort to be Therestofourpaperisorganizedasfollows: first,inSec- clear about this latter group of problems but in most cases tion3,wediscussourschemeforrepresentingfloatingpoint we do not have comprehensive solutions. Double rounding numbers with reduced precision in memory, and a straight- in particular is a topic of extensive study, and we refer the forwardimplementationoftheschemeinscalarcodeusinga reader to the work of Boldo and Melquiond [2] for an in- library of datatypes implementing different memory repre- depth discussion. sentationsoffloatingpointnumberswithdifferentprecision. Our techniques are aimed squarely at problems where Next, in Section 4, we discuss aspects of contemporary some approximation is acceptable and the developer has a GPP architecture which confound the straightforward li- good understanding of exactly how much precision is re- braryapproach,andproposeavectorized,compiler-accelerated quired. Ourmaincontributionistoshowhow toimplement approach. Section5discussesrounding,animportantaspect multibyte floating point formats efficiently; the question of ofimplementingfloatingpoint. Section6presentsanexper- whether tousetheminanyparticularalgorithmdependson imental evaluation of both library and compiler-accelerated the numerical properties of the algorithm. PREPRINT 3.2 SimpleScalarCodeApproach arrays of datatypes that may have non–power-of-two byte Figure 1 shows a simple implementation of the flyte24 width, where memory movements may be overlapping and type in C++. It relies on the bit field facility in C/C++ to misaligned. Althoughtheseoverheadsareencounteredboth specify that the num field contains a 24-bit integer. It also whenreadingandwhenwritingreduced-precisionrepresen- uses the GCC packed attribute to indicate to the compiler tations, there are important differences between the two that arrays of the type should be packed to exactly 24 bits, cases. ratherthanpaddedto32bits. Figure1alsoshowsaroutine for converting from flyte24 to float (i.e. binary32). The 4.1 ReadingInReducedPrecision 24-bitpatternstoredintheflyte24variableisscaledto32 Moderninstructionsetarchitecturestypicallyhavenative bits and padded with zeros. The resulting 32-bit pattern is support for data movement using types with power-of-two returned as a float. byte widths – usually 1, 2, 4, and 8 bytes. Since our flyte typesdifferinwidthfromstandardIEEE-754typesbymul- class flyte24 { tiples of 8 bits, this means we can always fetch a flyte with private: a single read using a wider native type (e.g. a 4-byte read unsigned num:24; for a flyte24). public: Since we propose to store flytes packed consecutively in operator float() { arrayswithoutpadding,themajorityofsuchaccesseswillbe u32 temp = num << 8; misaligned. Specifically,aconsecutiveflyte arrayaccesswill return(cast_u32_to_f32(temp)); only be aligned to the next largest native type once every }; lcm(native ,flyte )/flyte array elements. ... bits bits bits Unalignedaccesscancauseextracachemissesversusaligned } __attribute__((packed)); access due to the possibility that the accessed item spans a Figure 1: Simple implementation of flyte24 in C++ cache line boundary. The strategy of using overlapping ac- cessesatthenextlargestnativetypeallowsustoutilizethe vectorization approach of Anderson et al. [1]. (Section 4.3) The code that is sketched in Fig. 1 can be used to imple- for our vector memory accesses. Also, the conversion pro- mentprogramswitharraysof flyte24values,butitisvery cess when reading flytes is relatively simple, requiring only slow. Figure 6a shows a comparison of the execution time that the read data be shifted and padded with zero bits. ofseveralBLASkernelsusingflyte24andotherflyteand IEEE-754 types. 4.2 WritingInReducedPrecision The order of magnitude difference in execution time is theresultof(1)thecostofconvertingbetweenflyte24and Writing data to a reduced-precision format is more com- binary32 before and after each computation; and (2) the plex than reading, both in terms of the memory movement cost of loading and storing individual 24-bit values. (since memory locations are modified), and due to the fact Inparticular,storingdatatosuccessiveelementsofpacked that when writing, the number format narrows, and preci- flyte24arrayscanresultinsequencesofoverlapping3-byte sion is lost. alignedstores. Load/storehardwareinGPPsisnotdesigned Lossofprecisionisanaturalconsequenceofworkingwith todealwithsuchoperations,whichresultsinextraordinarily floatingpointnumbers. Thepreciseresultofafloatingpoint slow execution. computationcanbearealnumberwhichcannotbeexactly Table1summarizesourproposedsetofmultibyte formats encodedinafiniterepresentation(forexample,itmightre- for floating-point values which preserve the sign and expo- quireinfinitelymanydigits). Inthesecases,lossofprecision nent fields of the corresponding IEEE-754 representations. is necessary to make the result representable. The IEEE-754 standard specifies several methods which Table 1: flyte storage formats for IEEE-754 types. areusedtoround numberstoanearbyrepresentablevalue. One straightforward way to perform rounding is to simply flyte layout (bits) truncate extra digits (i.e. bits, in a binary representation). IEEE-754 type flyte format Sign Exp. Mant. Thisisthestandardround-to-zeroroundingmode[17]. Other rounding modes are specified by the IEEE-754 standard, binary32 16-bit 1 8 7 including round-to-nearest-even and round-to-infinity (pos- binary32 24-bit 1 8 15 itive or negative). binary32 32-bit 1 8 23 binary64 40-bit 1 11 28 4.3 VectorizedReadingandWriting binary64 48-bit 1 11 36 Weproposeacompiler-basedapproachwhichcangreatly binary64 56-bit 1 11 44 reducetheoverheadofoperatingonflyte arraysusingauto- binary64 64-bit 1 11 52 matic vectorization. Our approach uses vector instructions to load, convert, and store flytes. We generate vectorized code to load a number of flyte values at once, and unpack 4. ACCESSINREDUCEDPRECISION those values in parallel to the next largest IEEE type using Readingandwritingreducedprecisionrepresentationsmight vector shuffle and blend instructions. We use vectorization be expected to incur a significant performance penalty due not only to amortize the cost of converting each element to the overheads outlined in Section 3.2. Particular con- (as it does for other operations), but also to help overcome cerns are (1) the overhead of conversion between data for- penalties associated with flyte types’ unnaturally aligned mats, in addition to (2) the overheads of memory access to memory accesses. PREPRINT Byrestrictingthesizeofaflyte tobeamultipleof8bits, Theresultingregisterscanbestoredwithoutoverlap,and weensurethatwidelyavailablefastbyte-levelvectorreorga- data elements are correctly split across register boundaries. nization instructions can be used. Finally, when computa- Ifthedatawrittencannotbepackedperfectlyintofullvector tioniscomplete,weagainusevectorinstructionstoconvert registers,somevectorstoresmaybepartialstoreswithsome backtoflyte types. Thismayinvolvearoundingstepwhen additionalimplementationconcerns. Thesearedescribedin reducing the precision of a standard IEEE-754 format to a detail in the vectorization approach of Anderson et al. [1]. flyte type, followed by packing the data in vector registers before the results are written to memory. 4.4 ControllingFormatConversion Vectorized loading and storing of packed data elements Programming language designers have three conflicting thatdonotcorrespondtoanynativemachinetypepresents goals when deciding the rules for evaluating floating point additional challenges over scalar memory movement. Since expressions. Evaluatingexpressionsathigherprecisionmay vector lengths on modern GPPs are usually a power-of-two resultinmoreaccurateanswers,whichsuggeststhathigher bytes, vectorized access to flyte arrays often leads to the precision should be used if it is available. splittingofdataelementsacrossvectormemoryoperations, Ontheotherhand,programmersliketogetbit-exactiden- wheretheleadingbytesofanelementareaccessedbyoneop- ticalresultsfromtheirprogramregardlessofwhichcompiler eration,andthetrailingbytesbythenext. Figure2displays isused,whichsuggeststhatthelanguageshouldstrictlyde- onesuchscenario: storinginflyte24formatcomputational fine the precise precision and each floating point operation. results produced in binary32. TheIEEE-754standardstipulates[17,§11]thatconform- ing languages should support reproducible programming, 128 bits and defines circumstances when programs should have nu- merically identical results across compliant platforms. Fi- f32 f32 f32 f32 f32 f32 f32 f32 nally, giving the compiler the freedom to choose precision may result in more efficient code. rounding Controlling exactly when format conversion occurs is an important part of using non-standard formats such as our f24 f24 f24 f24 f24 f24 f24 f24 flytes. Languages such as C99 [8] provides features aimed packing atprovidingconsistencyintheoutputofthesameprogram on different platforms. f24 f24 f24 f24 f24 f24 f24 In particular, the programmer can choose to specify the rulesdeterminingtheprecisionofintermediatevaluesinex- f24 pressions using the FLT_EVAL_METHOD facility. The programmer can also use the pragma directive STDC Figure 2: Layout of data in 128-bit (16-byte) vector regis- FP_CONTRACTtoenableordisablecontraction –atomiceval- ters. (top)beforeformatconversion,(center)afterrounding uation of floating point expressions which can use higher tothedesiredsize,and(bottom)thedesiredlayoutinmem- precision and omit rounding errors implied by the source ory. Note that the desired memory layout requires data code and FLT_EVAL_METHOD [8, §6.5]. Both facilities modify elementstostraddletheboundarybetweenvectorregisters. behaviour at the expression level. Forexample,theprogrammercanchoosetohaveallsubex- While vectorized reading is not significantly more compli- pressions within an expression tree evaluated at the exact cated than scalar reading, vectorized writing has additional width of the widest operand to each operator (by setting issues. A straightforward approach to vectorized writing of FLT_EVAL_METHOD = 0). unpadded flyte data could pack as many consecutive flyte Such strict constraints on the types of all intermediate elements in a vector register as would fit, and perform a floatingpointvaluescanresultinverypoorperformancefor store with the trailing unused byte positions predicated off. flytes. Forexample,flyte24isstoredinmemoryasa24-bit Subsequentstores,iftherearemorethanone,couldover- value,butwemustconvertittobinary32beforeperforming lappriorstoresintheseunusedpositions,sothatthedatais anyoperations. Inalargerexpressionof flyte24values,the consecutive in memory. Due to the structure of load/store generatedcodeislikelytobemuchfasterifallintermediate hardware in GPPs, this approach is likely to be extremely valuescanbekeptinbinary32ratherthanbeingconverted inefficient. down to flyte24 after every operation. Ourvectorizedapproachtostoringvaluesinreducedpre- C99 also provides a feature that allows floating point op- cisionformatworksbypackingalltheroundeddataelements erations to be performed at a higher precision than the tobestoredconsecutivelyinasetofvectorregisters,which sourcevaluesorresult: contraction offloatingpointexpres- are mapped to a set of consecutive, non-overlapping vec- sions, which is controlled by the standard pragma directive tor memory movements (shown in Figure 2). FP_CONTRACT. In all our experiments, FP_CONTRACT is on. We use a two-phase permute and blend approach. Vec- One problem not addressed by C99’s floating point sup- tor permute instructions are initially used to compact the portistheissueofusinghigherprecisionvaluesacrossstate- rounded data in each register into memory order, and align ments. C99requiresthatthevaluestoredinavariablemust the contents of some registers so that the leading data ele- be convered to the type of that variable at the point where ment in each register is located at the position of the first it is written to storage. unused byte in the previous register. Next, the compacted WeproposeatypequalifierAT_LEASTwhichisusedtotag vector registers are combined together using vector blend a floating point type informing the compiler that it is free instructions until a number of fully packed vector registers touseahigherprecisiontostoreresultsofthattype,rather result. thanconvertingstrictlytotheprecisionofthestoragetype. PREPRINT This allows the use of higher-precision types in code like value. This would change the value from NaN to a value that in Figure 3 where accumulation into a variable would with all ones in the exponent and zero matissa, which rep- otherwise result in many lossy conversions. resents infinity in IEEE-754. However, there are two types of NaNs: signalling NaNs, flyte24 sum(flyte24 * a, int size) whichcauseafloatingpointexception,andquietNaNswhich { indicate an invalid value with causing an exception. In AT_LEAST flyte24 sum = 0.0; IEEE-754binaryfloatingpointformats,quietNaNsaredis- for(int i = 0; i < size; i++) { tinguished from signalling NaNs by the value of the most sum = sum + a[i]; significant bit of the mantissa, which is preserved by trun- /* cation of up to M −1 bits. without AT_LEAST, C99 will truncate In IEEE-754, a subnormal number has a zero exponent, sum here in every loop iteration andthemantissarepresentsaverysmallfixed-pointnumber. */ Truncatingthefinalbitsofasub-normalnumbermaycause } its value to change to zero. This is correct behaviour, since return sum; thenon-zeropartofthenumberistoosmalltorepresentin } our smaller flyte type, and zero is the closest representable value. Figure 3: Example of the use of the AT_LEAST qualifier 5.2 Round-to-nearest to allow accumulation in a higher precision than the in- Round-towards-zero is simple to implement, but it can put/output. result in large errors. For example, if round-towards-zero were applied in decimal, the number 9.9 would be rounded ThecodeinFigure3showsthevariablesumoftypeAT_LEAST to 9, rather than 10. The maximum error can be reduced flyte24 which the compiler can represent as any floating byroundingtothenearestrepresentablenumberratherthan point type with at least the precision of a flyte24. In gen- simply truncating. eral, the next largest native type is a good choice for a use A special case is where the number to be rounded is ex- of the AT_LEAST qualifier. actly between twovalues. Tobreak the tie, IEEE-754spec- ifies that exact ties should be rounded to the nearest even 5. ROUNDING number. WhennumbersrepresentedinIEEE-754floatingpointfor- Figure4showsa32-bitfloatingpointnumberbeingrounded mat are used in computations (such as addition and sub- to our 24-bit representation using round-to-nearest, where traction), the natural result of computation is often a real exact ties between two values are rounded to the nearest number which is not exactly representable in the finite rep- even value. resentation. In these cases, the standard specifies a way to Rounding to nearest even is expressed in terms of the re- round these numbers to a nearby representable value. lationship between three bits (Guard, Round and Sticky) Computationsonflytesareperformedusingthenextlargest aroundthepointwherethenumberisrounded. Thereareno standardfloatingpointsize. Aftereachoperation,thebuilt- hardware instructions in conventional processors for round- in floating point type performs its own rounding. However, ing floating point values to flytes, and we must therefore aquestionariseswhenweconvertfromIEEEfloatingpoint round in software. types to flytes: should we round again during conversion? Double rounding is discussed in considerable depth by S Sign Exponent Mantissa P G R Boldo and Melquiond [2], and we do not reproduce their arguments here. Note that our aim in this paper is simply 0 X X Down to demonstrate that a low-overhead implementation of cus- 1 0 0 Tie tomized floating point types is feasible on general purpose 1 0 1 Up 1 1 0 Up processors. 1 1 1 Up Anycompilerframeworkwhichmightimplementourpro- Figure 4: Rounding to nearest even. Marked bit positions posedschemeshouldtakethenecessarystepstoensurethat correspond to Guard, Round, and Sticky bits. To avoid thetransformationsappliedtoconvertbetweenfloatingpoint overflow into the exponent when rounding to nearest even representationsarecorrect;intheremainderofthissection, in software, the pre-guard bit (P) must also be inspected. we describe some low-overhead mechanisms which can be used to implement them. In the most straightforward case, rounding to nearest even 5.1 Round-towards-zero simply involves computing the nearest representable num- Thesimplestapproachistoroundbytruncatingthelower ber. An easy way to do this is shown in Figure 5. matissa bits, an option which is known as round-towards- The code in Figure 5 adds half of a unit of least precision zero in IEEE 754. Rounding towards zero is simple to im- (ULP)inthenewsmallerformattothevalueintheexisting plement, but IEEE floating point has a number of special larger format. cases that we must treat correctly. As long is the number to be rounded is not exactly be- In IEEE-754 a NaN value has an exponent consisting en- tween two representable values in the new format, this will tirely of ones, and a mantissa value that is non-zero. If the result in correct round-to-nearest-even. Implementing pre- non-zero part of the mantissa is in the lower bits, truncat- ciseround-to-nearest-evenrequirescheckingforthistiedspe- ing those bits may cause the entire mantissa to have a zero cial case, and rounding accordingly. In addition, there are PREPRINT flyte24 round_to_nearest(float num) 5.3.3 Infinities { Positiveandnegativeinfinitiesareencodedwithanexpo- u32 temp = cast_f32_to_u32(num); nentwhichisall1sandazeromantissa. Thereareonlytwo // round by adding 0.5 ULP valuesinthisclass,whicharedistinguishedfromeachother temp = temp + 128; by their sign. The expected behaviour of format conver- // truncate last eight bits sion for infinities is a correctly signed infinity in the target temp = temp >> 8; format. return cast_u32_to_flyte24(temp); } 5.3.4 NaNvalues NaN values represent the result of expressions of indeter- Figure 5: Heuristically rounding to nearest even by adding minate form, which cannot be computed, such as ∞−∞. half of a unit of least precision (ULP). TheexpectedbehaviourofformatconversionofaNaNvalue is a NaN value in the target format. However, NaN is not a singular value, but a value range. NaN values occupy a rangeofbitpatternsdistinguishedbyanexponentwhichis numerous special cases which must be checked to correctly all ones and any nonzero value in the mantissa. implement IEEE-754 mandated behaviour. Sincethemantissaistruncatedbyconversiontoashorter format,someNaNvaluescannotberepresentedafterdown- 5.3 TreatmentofSpecialValues conversion, and indeed truncation may cause the non-zero part of the mantissa to be lost entirely. The IEEE-754 floating point standard has special values However, as described in section 5.1, IEEE-754 non sig- and ranges as previously outlined in Section 2. Rounding nalling (or quiet) NaNs always have a 1 in the most signif- interactsindifferentwayswiththesevaluesandranges,and icant bit of their mantissa. Thus, although the exact man- behaviourwhichmaybeappropriateforsomescenariosmay tissa value of a NaN may change after truncation, a quiet not be for others. NaN cannot become a non-NaN value through truncation. Theapplicationprogrammermustmakeachoiceofround- Signalling NaNs are different: the non-zero bits of a sig- ing approach based on the information available. We de- nalling NaN’s mantissa may be entirely in the lower bits. scribe the behaviour of each of our proposed rounding ap- A signalling NaN may be corrupted to become an infinity proaches here with respect to IEEE-754 special values and value by truncation. Thus, signalling NaNs should not be ranges. used with flyte values, unless the signal handler is modified toplaceanon-zerovalueinthehigher-orderbitsoftheNaN. 5.3.1 Normalizednumbers Whenusingouradd-half-an-ULPheuristic,afurtherprob- lem can arise with non-signalling NaNs. If the mantissa When a normalized number is being rounded, an infinity valueofthenon-signallingNaNisthemaximumrepresentable occurs when the rounded value is so large that the expo- value after truncation, then adding even half an ULP will nent is all ones after rounding (overflow). This is a natural cause an overflow from the mantissa to the exponent, and consequence of conversion from a larger to a smaller finite potentially into the sign bit, if the exponent is all ones. representation. Ourslowestandmostcorrectround-to-nearestmodechecks However, when a normalized number is so small that its for this case, and corrects the value if necessary. Our fast exponentisallzerosafterrounding(underflow),itdoesnot heuristic round to nearest approach (Figure 5) does not. get rounded directly to zero, but instead to a subnormal However,despiteextensivetestingwehaveneverseenacase number. where the floating point unit creates such a pathological Underflowisgradual,andanumberwillunderflowtozero NaN value as the result of arithmetic. onlywhenitissosmallthatbothexponentandmantissaare allzerosafterrounding. Normalizednumbersmaytherefore 6. EXPERIMENTALEVALUATION naturally be rounded to several different classes of value. Verylargepositiveornegativenumbersmaygotoinfinity We benchmarked the performance of our proposed scalar whenrounded,andverysmallnumbersmaybecomesubnor- codeimplementationfromSection3.2andvectorizedimple- mal. This behaviour conforms to IEEE-754 semantics. mentationfromSection4.3. Experimentswererunon64-bit Linux with a 4.2 series kernel, using a machine with 16GB of RAM and an Intel Core i5-3450 (Ivy Bridge) processor. 5.3.2 Subnormalnumbers WefollowedIntel’sguidelinesforbenchmarkingshortpro- Subnormalnumbersaredistinguishedbyazeroexponent, gramsonthisarchitecture[12]. Figures6and7presentthe and lack the implied leading 1 in their mantissa. They rep- resultsofbenchmarking. Inourexperiments,weuseround- resent numbers very close to zero. The expected behaviour ing towards zero, FP_CONTRACT is on, and variables which offormatconversionforsubnormalnumbersisslightlycom- are used to accumulate are declared as AT_LEAST the target plicated due to the issues of overflow and underflow. precision. The closest value in the target representation for a very Problemsizeinexperimentalfiguresreferstothenumber large subnormal number may be a very small normalized ofelementsinarraysinBLASoperations-forvector-vector number (overflow), while the closest value for a very small operations (BLAS Level 1) this is the number of elements subnormal number may be zero (underflow). in a vector, while for operations involving matrices (BLAS For subnormal numbers, rounding may validly cause the Level 2 and 3) this is the number of elements in a row or class of the value to change, either by underflow to zero, or columnofasquarematrix,sothatthetotalnumberofdata by overflow to a small normalized number. elements is the square of the problem size. PREPRINT 70 70 120000 double es2) flyte56 Cycl307 60 ffllyyttee4480 60 100000 ock size 50 flytfelo2a4t 50 Clm flyte16 80000 ce: ble 40 40 no ormant, pr 30 30 60000 erfme Pe 40000 ormalized per data el 1200 1200 20000 N( 0 0 0 dot-product max min magnitude scale gemv gemm BLAS L1 Routine BLAS L2 Routine BLAS L3 Routine 9(a)Scalarcode(ApproachshowninFigure1) 4.5 2.5 14000 double es2) 4 flyte56 cl7 flyte48 12000 Cy30 flyte40 2 ock size 3.5 flytfelo2a4t 10000 Clm 3 flyte16 nce: oble 2.5 1.5 8000 ormant, pr 2 6000 erfme 1 Pe 1.5 alized data el 1 0.5 4000 ormper 0.5 2000 N( 0 0 0 dot-product max min magnitude scale gemv gemv-unroll2 gemm gemm-unroll2 BLAS L1 Routine BLAS L2 Routine BLAS L3 Routine 9(b)128-bitSSEcode(ApproachshowninFigure2) Figure 6: Variation of performance with precision of memory representation across a number of BLAS kernels. Performance displayed as normalized execution time (cycles per data element) – lower is better. Overhead versus native types can be read as the difference between any flyte type and the next largest native type. 6.1 Overheads: SIMDvsNon-SIMD this overhead is effectively hidden by instruction-level par- The large overhead of scalar access in Figure 6a is due to allelism (Figure 6b). thedesignmismatchbetweenournon-standarduse-caseand The relatively high overhead of flyte40 and flyte24 in thetypicalstructureofaGPPscalardatapath,discussedin the scale benchmark in Figure 6b is due to two factors: moredetailinSection3.2. TheGPPdatapathisillequipped the alignment of the accesses is odd (5 and 3 bytes, respec- tohandlesimultaneousmisalignedaccessestodatastoredin tively)meaningthatdatamustbeshuffledbetween vectors, packed non–power-of-two multibyte formats. ratherthansimplywithinvectors,aswithothertypes. Fur- Incontrast,ourcompiler-basedvectorizedapproach(Fig- thermore, the benchmark performs an in-place update of ure 6b) marshalls and unmarshalls this data into consec- thedata,wherereadsandwritesoverlap,whichreducesthe utive, non-overlapping power-of-two length accesses which available instruction-level parallelism. are the best-case for performance using the (SSE) vector However,theoverheadisstillontheorderof3cyclesper datapath. data element in the worst case, which may be perfectly ac- Overheads in the scalar implementation are in the mid ceptable in many scenarios in return for a 37.5% reduction to high tens of cycles per accessed element, while in the in memory traffic. Indeed, as can be seen from our results vectorized implementation, overheads versus native IEEE- inFigure7a,flyte40isonlymarginallysloweroverallcon- 754 types are on the order of a single cycle per data ele- sidering BLAS Level 1 programs, and significantly reduces ment. In computationally heavy programs like magnitude second-lastandlast-levelcachemissesforBLASLevel2and Level 3 programs. PREPRINT BLAS L1 BLAS L2 BLAS L3 Normalized Performance: Clock Cycles(per data element, geomean across all benchmarks) 012345 1000 1500 2000 25d0o0uflfffffob54421al68046et 3000 0001111....... 46824681 1000 1500 2000 2500 3000 11 246802000000000000 0000000 1000 1500 2000 2500 3000 Problem Size Problem Size Problem Size 9(a)Normalizedvariationinabsoluteperformance(clockcyclesperelementprocessed)ateachBLASlevelasproblemsizeincreases.Lower is better. BLAS L1 BLAS L2 BLAS L3 mance: L2 Cache Missesmean across all benchmarks) 011 ...1824 douflfffffob54421al68046et 000 0.0.0.000...000234253545 122350500000 Normalized Perfor(per data element, geo 000 ...0246 1000 1500 2000 2500 3000 00 .0.00.001 0515 1000 1500 2000 2500 3000 1 05 000 1000 1500 2000 2500 3000 Problem Size Problem Size Problem Size 9(b)Normalizedvariationinsecond-lastlevelcachemissesasproblemsizeincreases(missesperelementprocessed).Lower is better. BLAS L1 BLAS L2 BLAS L3 Normalized Performance: Last Level Cache Misses(per data element, geomean across all benchmarks) 000011 ......01246824 1000 1500 2000 25d0o0uflfffffob54421al68046et 3000 0000 .0.0.0.0000...0000123 05152535 1000 1500 2000 2500 3000 111112 0246802468 00000000000 1000 1500 2000 2500 3000 Problem Size Problem Size Problem Size 9(c)Normalizedvariationinlast-levelcachemissesasproblemsizeincreases(missesperelementprocessed).Lower is better. Figure7: Summaryofabsoluteperformance(cyclesperprocessedelement)andcachebehaviour(cachemissesatL2andL3) forourvectorizedBLASbenchmarkprogramsasproblemsizeincreases. PerformanceforeachBLASlevelismeasuredasthe geometricmeanperformanceacrossalltheprogramsinthatlevel. TheprogramsineachBLASlevelareshowninFigure6b. 6.2 EffectofUnrollingLoops Ineffect,unrollingisequivalenttoasimple1-dimensional The benchmark gemv-unroll2 in Figure 6b is the BLAS looptilingwithtilesize2×VFwhereVFisthevectorization Level2GEMVkernelunrolledtwicetoincreasetheamountof factor. Inthisbenchmark,datatransferaccountsforalarge datamovementpervectorizedloopiteration. Inthisbench- portion of total execution time. mark, data transfer accounts for a large portion of total However,unlikeGEMV,GEMMexhibitsaslowdownwhenun- execution time. rolled. We inspected the performance data and found that The benchmark demonstrates that a win-win is possible: the number of last level cache misses was significantly ele- thechoiceofareduced-precisionmemoryrepresentationcan vated when GEMM was unrolled. actuallyincrease overallperformanceversusthenextlargest In this case, unrolling simply introduces too much data IEEE-754 type, while also reducing memory requirements, movementintheinnerloop. Itislikelythatlessna¨ıvetilings even on a general purpose processor without special hard- couldmitigatethiseffect,however,weaimonlytoshowthe ware support for non-standard floating point. effect of using reduced precision types. The benchmark gemm-unroll2 in Figure 6b is the BLAS Level3GEMMkernelunrolledtwicetoincreasetheamountof data movement per vectorized loop iteration. PREPRINT 6.3 EffectonCacheBehaviour using GPPs in an extreme-scale computing context. They Theprimaryeffectofusingshorterdatarepresentationsis do not utilize SIMD, address only reads, and convert in a seeninthebehaviourofthesecond-lastandlast-levelcaches. pre-pass. Using a shorter data representation means that each indi- Many approaches use FPGAs or otherwise customized vidualmemoryoperation(i.e. cachelinereadorwritefrom hardware; notably Tong et al. [16], who propose customiz- DRAM) stores or retrieves a larger proportion of data ele- ing ALUs to support short-mantissa representations. More ments,resultinginfewercachemissesoverall. Reducingthe recently, De Dinechin et al. [5] also propose custom hard- number of last-level cache misses in particular has a large ware to support reduced precision. Ou et al. accelerate effect on performance, and on energy efficiency [7]. mixed-precisionfloatingpointusingavectorprocessorwith a customized datapath [11]. 6.3.1 BLASLevel1 Our approach, since it targets GPPs, is necessarily less Figure7displaysasummaryoftheabsoluteperformance flexiblethanFPGA/customhardwarebasedapproaches. How- (cyclesperdataelement)aswellasthesecond-lastandlast- ever, it is precisely because GPPs are so widely deployed level cache behaviour for our benchmark programs. As in that reduced precision support on GPPs is attractive. Figure6,theprogramsaredividedintothreecategories,one Priorworkonloadingonlyportionsoffloatingpointnum- for each BLAS level. bersbyJenkinsetal.[9]doesnotperformanexplicitround- For BLAS level 1 programs, we see that the variation ingstep,butdirectlytruncatesvalues. Theyperformabyte- in absolute performance remains small as problem size in- leveltransposeonamatrixoffloatingpointnumbersstored creases. BLASlevel1programsaremostlycompute-bound, in memory, chopping off some number of trailing bytes of so using smaller types does not significantly affect perfor- each number. mance. This conversion approach is semantically equivalent to Figures 7b and 7c show that, for BLAS Level 1, there is rounding to zero. Jenkins et al. evaluate the error intro- very little difference in the number of second-last and last- duced by rounding to zero and find it acceptable for their level cache misses from using shorter types. purposes. In contrast, using our proposed techniques, the rounding 6.3.2 BLASLevel2 step in Figure 2 can be implemented in any way which is For BLAS level 2 programs, we see that initially there is suitablefortheneedsoftheapplication,includingrounding littlevariationinperformanceatsmallproblemsizes. How- towards zero, rounding to nearest even, or rounding to odd ever,astheproblemsizeincreases,theeffectofusingsmaller as proposed by Boldo and Melquiond [2]. types becomes apparent. For BLAS Level 2 programs, we see a large variation in performance once the problem size exceeds the capacity of the L1 cache. In Figure 7a, we see that for BLAS Level 2 programs, 8. CONCLUSION floatanddoubleinitiallyoutperformtheirreduced-precision In this paper, we propose flytes; a scheme representing representations,butoncetheproblemsizegrowslargeenough, floating-point data in memory at precisions along a contin- the situation is reversed. uumbetweenIEEE-754types,convertingtoandfromstan- Moreover,smallerrepresentationsoutperformlargerones, dard IEEE-754 types to perform computations. ingeneral. Atthelargestproblemsizeinexperiments,GEMV We propose a method for converting between IEEE-754 on flyte40 outperforms GEMV on double by 23.5% (Fig- floatingpointandflytes,andshowhowitcanbeaccelerated ure 7a, center graph). using vectorization on general purpose processors, without Figures 7b and 7c display the large reduction in second- requiring special hardware support. lastandlast-levelcachemissesforBLASLevel2programs. Our proposed technique handles both reads and writes, Insomecases,thereductionisasmanyas6×fewerlast-level and supports reduced precision floating point memory rep- cache misses (compare flyte40 and double in Figure 7c, resentations with very low overhead. center graph). Our investigation shows that reducing the precision of 6.3.3 BLASLevel3 floating point data in memory, and using SIMD operations asthebasisofacompiler-acceleratedschemeforperforming For BLAS Level 3, we again see that using smaller types conversionspresentsalow-overheadpathtosupportingcus- resultsinmanyfewersecond-lastandlast-levelcachemisses, tomized floating point on commodity general purpose pro- although the reduction is less pronounced for flyte40. 5- cessors. byteaccessesarefrequentlysplitacrosscachelines,causing twomisses,whichoffsetsthereductioninmissesfromsimply transferring less data overall. However, we again see that performance closely tracks cache behaviour - our straightforward implementation of Acknowledgements GEMM is heavily memory-bound, so this is not surprising. We would like to thank Ayal Zaks for many helpful com- Overall, we see a significant reduction in cache misses for ments on an initial draft of this paper. We would also like smallertypes: asmanyas4.5×fewerlast-levelcachemisses to thank the reviewers of PACT 2016 for their close atten- comparingflyte56anddouble(Figure7c,rightmostgraph). tion which helped us greatly in improving the presentation. This work was supported by Science Foundation Ireland 7. RELATEDWORK grant 12/IA/1381, and also by Science Foundation Ireland Muchpriorworkdiscussesreducedprecisionfloatingpoint[3, grant 10/CE/I1855 to Lero — the Irish Software Research 10,13]. Jenkinsetal.[9]evaluateareduced-precisionscheme Centre (www.lero.ie). PREPRINT 9. REFERENCES International Conference for High Performance Computing, Networking, Storage and Analysis (SC). [1] A. Anderson, A. Malik, and D. Gregg. Automatic IEEE, 2012. vectorization of interleaved data revisited. ACM [10] M. O. Lam, J. K. Hollingsworth, B. R. de Supinski, Transactions on Architecture and Code Optimization and M. P. LeGendre. Automatically adapting (TACO), 12(4):50, 2015. programs for mixed-precision floating-point [2] S. Boldo and G. Melquiond. When double rounding is computation. In International Conference on odd. In 17th IMACS World Congress, page 11, 2004. Supercomputing. ACM, 2013. [3] A. Buttari, J. Dongarra, J. Kurzak, J. Langou, [11] A. Ou, K. Asanovic, and V. Stojanovic. Mixed J. Langou, P. Luszczek, and S. Tomov. Exploiting precision vector processors. Technical Report No. mixed precision floating point hardware in scientific UCB/EECS-2015-265, 2015. computations. In High Performance Computing [12] G. Paoloni. How to benchmark code execution times Workshop, 2006. on Intel IA-32 and IA-64 instruction set architectures. [4] W. J. Dally, J. Balfour, D. Black-Shaffer, J. Chen, Intel Corporation, 2010. R. C. Harting, V. Parikh, J. Park, and D. Sheffield. [13] C. Rubio-Gonza´lez, C. Nguyen, H. D. Nguyen, Efficient embedded computing. IEEE Computer, (7), J.Demmel,W.Kahan,K.Sen,D.H.Bailey,C.Iancu, 2008. and D. Hough. Precimonious: Tuning assistant for [5] F. De Dinechin, C. Klein, and B. Pasca. Generating floating-point precision. In International Conference high-performance custom floating-point pipelines. In on High Performance Computing, Networking, Storage International Conference on Field Programmable Logic and Analysis (SC). ACM, 2013. and Applications. IEEE, 2009. [14] E. Schkufza, R. Sharma, and A. Aiken. Stochastic [6] P. R. Dixon, T. Oonishi, and S. Furui. Fast acoustic optimization of floating-point programs with tunable computations using graphics processors. In precision. ACM SIGPLAN Notices, 49(6), 2014. International Conference on Acoustics, Speech and [15] N. Sidwell and J. Myers. Improving software floating Signal Processing. IEEE, 2009. point support. Proceedings of GCC DeveloperˆaA˘Z´s [7] J. Gustafson. Exascale: Power, cooling, reliability, and Summit, 2006. future arithmetic. HPC User Forum Seattle, September 2010. [16] J. Y. F. Tong, D. Nagle, R. Rutenbar, et al. Reducing power by optimizing the necessary precision/range of [8] B. S. Institution. The C standard: incorporating Technical Corrigendum 1 : BS ISO/IEC 9899/1999. floating-point arithmetic. IEEE Transactions on Very John Wiley, 2003. Large Scale Integration Systems, 8(3), 2000. [9] J. Jenkins, E. R. Schendel, S. Lakshminarasimhan, [17] D. Zuras, M. Cowlishaw, A. Aiken, M. Applegate, D. Boyuka, T. Rogers, S. Ethier, R. Ross, S. Klasky, D. Bailey, S. Bass, D. Bhandarkar, M. Bhat, N. F. Samatova, et al. Byte-precision level of detail D. Bindel, S. Boldo, et al. IEEE Standard for processing for variable precision analytics. In Floating-point Arithmetic. IEEE Std 754-2008, 2008.