Table Of Content

Jean-Michel Muller • Nicolas Brunie Florent de Dinechin • Claude-Pierre Jeannerod Mioara Joldes • Vincent Lefèvre Guillaume Melquiond • Nathalie Revol Serge Torres Handbook of Floating-Point Arithmetic Second Edition Jean-MichelMuller NicolasBrunie CNRS-LIP Kalray Lyon,France Grenoble,France FlorentdeDinechin Claude-PierreJeannerod INSA-Lyon-CITI Inria-LIP Villeurbanne,France Lyon,France MioaraJoldes VincentLefèvre CNRS-LAAS Inria-LIP Toulouse,France Lyon,France GuillaumeMelquiond NathalieRevol Inria-LRI Inria-LIP Orsay,France Lyon,France SergeTorres ENS-Lyon-LIP Lyon,France ISBN978-3-319-76525-9 ISBN978-3-319-76526-6 (eBook) https://doi.org/10.1007/978-3-319-76526-6 LibraryofCongressControlNumber:2018935254 MathematicsSubjectClassification:65Y99,68N30 ©SpringerInternationalPublishingAG,partofSpringerNature2010,2018 Contents ListofFigures xv ListofTables xix Preface xxiii PartI Introduction,BasicDefinitions,andStandards 1 1 Introduction 3 1.1 SomeHistory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 DesirableProperties. . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 SomeStrangeBehaviors . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Somefamousbugs . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Difficultproblems . . . . . . . . . . . . . . . . . . . . . 9 2 DefinitionsandBasicNotions 15 2.1 Floating-PointNumbers . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Maindefinitions . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Normalizedrepresentations,normalandsubnormal numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.3 Anoteonunderflow . . . . . . . . . . . . . . . . . . . . 19 2.1.4 Specialfloating-pointdata . . . . . . . . . . . . . . . . . 21 2.2 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 Roundingfunctions . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Usefulproperties . . . . . . . . . . . . . . . . . . . . . . 24 2.3 ToolsforManipulatingFloating-PointErrors . . . . . . . . . . 25 2.3.1 Relativeerrorduetorounding . . . . . . . . . . . . . . 25 2.3.2 Theulpfunction . . . . . . . . . . . . . . . . . . . . . . 29 2.3.3 Linkbetweenerrorsinulpsandrelativeerrors . . . . . 34 2.3.4 Anexample:iteratedproducts . . . . . . . . . . . . . . 35 2.4 TheFusedMultiply-Add(FMA)Instruction . . . . . . . . . . . 37 2.5 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 LostandPreservedPropertiesofRealArithmetic . . . . . . . . 40 2.7 NoteontheChoiceoftheRadix . . . . . . . . . . . . . . . . . . 41 2.7.1 Representationerrors. . . . . . . . . . . . . . . . . . . . 41 2.7.2 Acaseforradix10 . . . . . . . . . . . . . . . . . . . . . 43 2.8 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3 Floating-PointFormatsandEnvironment 47 3.1 TheIEEE754-2008Standard . . . . . . . . . . . . . . . . . . . . 48 3.1.1 Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.1.2 Attributesandrounding . . . . . . . . . . . . . . . . . . 66 3.1.3 Operationsspecifiedbythestandard . . . . . . . . . . 70 3.1.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . 72 3.1.5 Conversionsto/fromstringrepresentations. . . . . . . 73 3.1.6 Defaultexceptionhandling . . . . . . . . . . . . . . . . 74 3.1.7 Specialvalues . . . . . . . . . . . . . . . . . . . . . . . . 77 3.1.8 Recommendedfunctions . . . . . . . . . . . . . . . . . 79 3.2 OnthePossibleHiddenUseofaHigherInternalPrecision . . 79 3.3 RevisionoftheIEEE754-2008Standard . . . . . . . . . . . . . 82 3.4 Floating-PointHardwareinCurrentProcessors . . . . . . . . . 83 3.4.1 Thecommonhardwaredenominator . . . . . . . . . . 83 3.4.2 Fusedmultiply-add . . . . . . . . . . . . . . . . . . . . 84 3.4.3 Extendedprecisionand128-bitformats . . . . . . . . . 85 3.4.4 Roundingandprecisioncontrol. . . . . . . . . . . . . . 85 3.4.5 SIMDinstructions . . . . . . . . . . . . . . . . . . . . . 86 3.4.6 Binary16(half-precision)support . . . . . . . . . . . . . 87 3.4.7 Decimalarithmetic . . . . . . . . . . . . . . . . . . . . . 87 3.4.8 Thelegacyx87processor . . . . . . . . . . . . . . . . . . 88 3.5 Floating-PointHardwareinRecentGraphicsProcessingUnits 89 3.6 IEEESupportinProgrammingLanguages . . . . . . . . . . . . 90 3.7 CheckingtheEnvironment . . . . . . . . . . . . . . . . . . . . . 91 3.7.1 MACHAR . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.7.2 Paranoia . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.7.3 UCBTest . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.7.4 TestFloat . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.7.5 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . 93 PartII CleverlyUsingFloating-PointArithmetic 95 4 BasicPropertiesandAlgorithms 97 4.1 TestingtheComputationalEnvironment . . . . . . . . . . . . . 97 4.1.1 Computingtheradix . . . . . . . . . . . . . . . . . . . . 97 4.1.2 Computingtheprecision. . . . . . . . . . . . . . . . . . 99 4.2 ExactOperations . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.2.1 Exactaddition . . . . . . . . . . . . . . . . . . . . . . . . 100 4.2.2 Exactmultiplicationsanddivisions . . . . . . . . . . . 103 4.3 AccurateComputationoftheSumofTwoNumbers . . . . . . 103 4.3.1 TheFast2Sumalgorithm . . . . . . . . . . . . . . . . . . 104 4.3.2 The2Sumalgorithm . . . . . . . . . . . . . . . . . . . . 107 4.3.3 Ifwedonotuseroundingtonearest . . . . . . . . . . . 109 4.4 AccurateComputationoftheProductofTwoNumbers . . . . 111 4.4.1 The2MultFMAAlgorithm. . . . . . . . . . . . . . . . . 112 4.4.2 If no FMA instruction is available: Veltkamp splitting andDekkerproduct . . . . . . . . . . . . . . . . . . . . 113 4.5 Computation of Residuals of Division and Square Root with anFMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.6 Anothersplittingtechnique:splittingarounda powerof2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.7 Newton–Raphson-BasedDivisionwithanFMA . . . . . . . . 124 4.7.1 VariantsoftheNewton–Raphsoniteration . . . . . . . 124 4.7.2 UsingtheNewton–Raphsoniterationforcorrectly roundeddivisionwithanFMA . . . . . . . . . . . . . . 129 4.7.3 Possibledoubleroundingsindivisionalgorithms . . . 136 4.8 Newton–Raphson-BasedSquareRootwithanFMA . . . . . . 138 4.8.1 Thebasiciterations . . . . . . . . . . . . . . . . . . . . . 138 4.8.2 UsingtheNewton–Raphsoniterationforcorrectly roundedsquareroots . . . . . . . . . . . . . . . . . . . . 138 4.9 RadixConversion . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.9.1 Conditionsontheformats . . . . . . . . . . . . . . . . . 142 4.9.2 Conversionalgorithms . . . . . . . . . . . . . . . . . . . 146 4.10 ConversionBetweenIntegersandFloating-PointNumbers . . 153 4.10.1 From32-bitintegerstofloating-pointnumbers . . . . . 153 4.10.2 From64-bitintegerstofloating-pointnumbers . . . . . 154 4.10.3 Fromfloating-pointnumberstointegers. . . . . . . . . 155 4.11 MultiplicationbyanArbitrary-PrecisionConstantwithanFMA 156 4.12 EvaluationoftheErrorofanFMA . . . . . . . . . . . . . . . . 160 5 EnhancedFloating-PointSums,DotProducts, andPolynomialValues 163 5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 5.1.1 Floating-pointarithmeticmodels . . . . . . . . . . . . . 165 5.1.2 Notationforerroranalysisandclassicalerror estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.1.3 Somerefinederrorestimates . . . . . . . . . . . . . . . 169 5.1.4 Propertiesforderivingvalidatedrunningerror bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.2 ComputingValidatedRunningErrorBounds . . . . . . . . . . 175 5.3 ComputingSumsMoreAccurately . . . . . . . . . . . . . . . . 177 5.3.1 Reorderingtheoperands,andabitmore . . . . . . . . 177 5.3.2 Compensatedsums . . . . . . . . . . . . . . . . . . . . . 178 5.3.3 Summationalgorithmsthatsomehowimitate afixed-pointarithmetic . . . . . . . . . . . . . . . . . . 184 5.3.4 Onthesumofthreefloating-pointnumbers . . . . . . 187 5.4 CompensatedDotProducts . . . . . . . . . . . . . . . . . . . . 189 5.5 CompensatedPolynomialEvaluation . . . . . . . . . . . . . . 190 6 LanguagesandCompilers 193 6.1 APlaywithManyActors . . . . . . . . . . . . . . . . . . . . . 193 6.1.1 Floating-pointevaluationinprogramming languages . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.1.2 Processors,compilers,andoperatingsystems . . . . . . 196 6.1.3 Standardizationprocesses . . . . . . . . . . . . . . . . . 197 6.1.4 Inthehandsoftheprogrammer . . . . . . . . . . . . . 199 6.2 FloatingPointintheCLanguage . . . . . . . . . . . . . . . . . 200 6.2.1 StandardC11headersandIEEE754-1985support . . . 201 6.2.2 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 6.2.3 Expressionevaluation . . . . . . . . . . . . . . . . . . . 204 6.2.4 Codetransformations . . . . . . . . . . . . . . . . . . . 209 6.2.5 Enablingunsafeoptimizations . . . . . . . . . . . . . . 210 6.2.6 Summary:afewhorrorstories . . . . . . . . . . . . . . 211 6.2.7 TheCompCertCcompiler . . . . . . . . . . . . . . . . . 214 6.3 Floating-PointArithmeticintheC++Language . . . . . . . . . 215 6.3.1 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.3.2 Numericlimits . . . . . . . . . . . . . . . . . . . . . . . 215 6.3.3 Overloadedfunctions . . . . . . . . . . . . . . . . . . . 217 6.4 FORTRANFloatingPointinaNutshell . . . . . . . . . . . . . 218 6.4.1 Philosophy. . . . . . . . . . . . . . . . . . . . . . . . . . 218 6.4.2 IEEE754supportinFORTRAN . . . . . . . . . . . . . . 220 6.5 JavaFloatingPointinaNutshell . . . . . . . . . . . . . . . . . 222 6.5.1 Philosophy. . . . . . . . . . . . . . . . . . . . . . . . . . 222 6.5.2 Typesandclasses . . . . . . . . . . . . . . . . . . . . . . 222 6.5.3 Infinities,NaNs,andsignedzeros . . . . . . . . . . . . 224 6.5.4 Missingfeatures. . . . . . . . . . . . . . . . . . . . . . . 225 6.5.5 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . 226 6.5.6 TheBigDecimalpackage . . . . . . . . . . . . . . . . . . 227 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 PartIII ImplementingFloating-PointOperators 231 7 AlgorithmsfortheBasicOperations 233 7.1 OverviewofBasicOperationImplementation . . . . . . . . . . 233 7.2 ImplementingIEEE754-2008Rounding . . . . . . . . . . . . . 235 7.2.1 Roundinganonzerofinitevaluewithunbounded exponentrange . . . . . . . . . . . . . . . . . . . . . . . 235 7.2.2 Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 7.2.3 Underflowandsubnormalresults . . . . . . . . . . . . 239 7.2.4 Theinexactexception . . . . . . . . . . . . . . . . . . . 239 7.2.5 Roundingforactualoperations . . . . . . . . . . . . . . 240 7.3 Floating-PointAdditionandSubtraction . . . . . . . . . . . . . 240 7.3.1 Decimaladdition . . . . . . . . . . . . . . . . . . . . . . 244 7.3.2 Decimaladditionusingbinaryencoding . . . . . . . . 245 7.3.3 Subnormalinputsandoutputsinbinaryaddition . . . 245 7.4 Floating-PointMultiplication . . . . . . . . . . . . . . . . . . . 245 7.4.1 Normalcase . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.4.2 Handlingsubnormalnumbersinbinary multiplication . . . . . . . . . . . . . . . . . . . . . . . . 247 7.4.3 Decimalspecifics . . . . . . . . . . . . . . . . . . . . . . 248 7.5 Floating-PointFusedMultiply-Add . . . . . . . . . . . . . . . 248 7.5.1 Caseanalysisfornormalinputs . . . . . . . . . . . . . . 249 7.5.2 Handlingsubnormalinputs . . . . . . . . . . . . . . . . 252 7.5.3 Handlingdecimalcohorts . . . . . . . . . . . . . . . . . 253 7.5.4 OverviewofabinaryFMAimplementation . . . . . . . 254 7.6 Floating-PointDivision . . . . . . . . . . . . . . . . . . . . . . . 256 7.6.1 Overviewandspecialcases . . . . . . . . . . . . . . . . 256 7.6.2 Computingthesignificandquotient . . . . . . . . . . . 257 7.6.3 Managingsubnormalnumbers . . . . . . . . . . . . . . 258 7.6.4 Theinexactexception . . . . . . . . . . . . . . . . . . . 259 7.6.5 Decimalspecifics . . . . . . . . . . . . . . . . . . . . . . 259 7.7 Floating-PointSquareRoot. . . . . . . . . . . . . . . . . . . . . 259 7.7.1 Overviewandspecialcases . . . . . . . . . . . . . . . . 259 7.7.2 Computingthesignificandsquareroot . . . . . . . . . 260 7.7.3 Managingsubnormalnumbers . . . . . . . . . . . . . . 260 7.7.4 Theinexactexception . . . . . . . . . . . . . . . . . . . 261 7.7.5 Decimalspecifics . . . . . . . . . . . . . . . . . . . . . . 261 7.8 NonhomogeneousOperators . . . . . . . . . . . . . . . . . . . 261 7.8.1 Asoftwarealgorithmarounddoublerounding . . . . . 262 7.8.2 Themixed-precisionfusedmultiply-and-add . . . . . . 264 7.8.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 265 7.8.4 Implementationissues . . . . . . . . . . . . . . . . . . . 265 8 HardwareImplementationofFloating-PointArithmetic 267 8.1 IntroductionandContext . . . . . . . . . . . . . . . . . . . . . 267 8.1.1 Processorinternalformats . . . . . . . . . . . . . . . . . 267 8.1.2 Hardwarehandlingofsubnormalnumbers . . . . . . . 268 8.1.3 Full-customVLSIversusreconfigurablecircuits (FPGAs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 8.1.4 Hardwaredecimalarithmetic . . . . . . . . . . . . . . . 270 8.1.5 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . 271 8.2 ThePrimitivesandTheirCost . . . . . . . . . . . . . . . . . . . 272 8.2.1 Integeradders . . . . . . . . . . . . . . . . . . . . . . . . 272 8.2.2 Digit-by-integermultiplicationinhardware. . . . . . . 278 8.2.3 Usingnonstandardrepresentationsofnumbers . . . . 278 8.2.4 Binaryintegermultiplication . . . . . . . . . . . . . . . 280 8.2.5 Decimalintegermultiplication . . . . . . . . . . . . . . 280 8.2.6 Shifters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 8.2.7 Leading-zerocounters . . . . . . . . . . . . . . . . . . . 283 8.2.8 Tablesandtable-basedmethodsforfixed-pointfunction approximation . . . . . . . . . . . . . . . . . . . . . . . 285 8.3 BinaryFloating-PointAddition . . . . . . . . . . . . . . . . . . 287 8.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 287 8.3.2 Afirstdual-patharchitecture . . . . . . . . . . . . . . . 288 8.3.3 Leading-zeroanticipation . . . . . . . . . . . . . . . . . 290 8.3.4 Probingfurtheronfloating-pointadders . . . . . . . . 294 8.4 BinaryFloating-PointMultiplication . . . . . . . . . . . . . . . 295 8.4.1 Basicarchitecture . . . . . . . . . . . . . . . . . . . . . . 295 8.4.2 FPGAimplementation . . . . . . . . . . . . . . . . . . . 296 8.4.3 VLSIimplementationoptimizedfordelay . . . . . . . . 297 8.4.4 Managingsubnormals . . . . . . . . . . . . . . . . . . . 300 8.5 BinaryFusedMultiply-Add . . . . . . . . . . . . . . . . . . . . 301 8.5.1 Classicarchitecture . . . . . . . . . . . . . . . . . . . . . 301 8.5.2 Toprobefurther . . . . . . . . . . . . . . . . . . . . . . . 302 8.6 DivisionandSquareRoot . . . . . . . . . . . . . . . . . . . . . 304 8.6.1 Digit-recurrencedivision . . . . . . . . . . . . . . . . . 305 8.6.2 Decimaldivision . . . . . . . . . . . . . . . . . . . . . . 308 8.7 BeyondtheClassicalFloating-PointUnit. . . . . . . . . . . . . 309 8.7.1 Morefusedoperators . . . . . . . . . . . . . . . . . . . . 309 8.7.2 Exactaccumulationanddotproduct . . . . . . . . . . . 309 8.7.3 Hardware-acceleratedcompensatedalgorithms . . . . 311 8.8 Floating-PointforFPGAs . . . . . . . . . . . . . . . . . . . . . 312 8.8.1 Optimizationincontextofstandardoperators . . . . . 312 8.8.2 Operationswithaconstantoperand . . . . . . . . . . . 314 8.8.3 Computinglargefloating-pointsums . . . . . . . . . . 315 8.8.4 Blockfloatingpoint . . . . . . . . . . . . . . . . . . . . . 319 8.8.5 Algebraicoperators . . . . . . . . . . . . . . . . . . . . . 319 8.8.6 Elementaryandcompoundfunctions . . . . . . . . . . 320 8.9 ProbingFurther . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 9 SoftwareImplementationofFloating-PointArithmetic 321 9.1 ImplementationContext . . . . . . . . . . . . . . . . . . . . . . 322 9.1.1 Standardencodingofbinaryfloating-pointdata . . . . 322 9.1.2 Availableintegeroperators . . . . . . . . . . . . . . . . 323 9.1.3 Firstexamples . . . . . . . . . . . . . . . . . . . . . . . . 326 9.1.4 Designchoicesandoptimizations . . . . . . . . . . . . 328 9.2 BinaryFloating-PointAddition . . . . . . . . . . . . . . . . . . 329 9.2.1 Handlingspecialvalues . . . . . . . . . . . . . . . . . . 330 9.2.2 Computingthesignoftheresult . . . . . . . . . . . . . 332 9.2.3 Swapping the operands and computing the alignment shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 9.2.4 Gettingthecorrectlyroundedresult . . . . . . . . . . . 335 9.3 BinaryFloating-PointMultiplication . . . . . . . . . . . . . . . 341 9.3.1 Handlingspecialvalues . . . . . . . . . . . . . . . . . . 341 9.3.2 Signandexponentcomputation . . . . . . . . . . . . . 343 9.3.3 Overflowdetection . . . . . . . . . . . . . . . . . . . . . 345 9.3.4 Gettingthecorrectlyroundedresult . . . . . . . . . . . 346 9.4 BinaryFloating-PointDivision . . . . . . . . . . . . . . . . . . 349 9.4.1 Handlingspecialvalues . . . . . . . . . . . . . . . . . . 350 9.4.2 Signandexponentcomputation . . . . . . . . . . . . . 351 9.4.3 Overflowdetection . . . . . . . . . . . . . . . . . . . . . 354 9.4.4 Gettingthecorrectlyroundedresult . . . . . . . . . . . 355 9.5 BinaryFloating-PointSquareRoot . . . . . . . . . . . . . . . . 362 9.5.1 Handlingspecialvalues . . . . . . . . . . . . . . . . . . 362 9.5.2 Exponentcomputation . . . . . . . . . . . . . . . . . . . 363 9.5.3 Gettingthecorrectlyroundedresult . . . . . . . . . . . 365 9.6 CustomOperators . . . . . . . . . . . . . . . . . . . . . . . . . . 372 10 EvaluatingFloating-PointElementaryFunctions 375 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 10.1.1 Whichaccuracy? . . . . . . . . . . . . . . . . . . . . . . 375 10.1.2 Thevariousstepsoffunctionevaluation. . . . . . . . . 376 10.2 RangeReduction . . . . . . . . . . . . . . . . . . . . . . . . . . 379 10.2.1 Basicrangereductionalgorithms . . . . . . . . . . . . . 379 10.2.2 Boundingtherelativeerrorofrangereduction . . . . . 382 10.2.3 Moresophisticatedrangereductionalgorithms . . . . . 383 10.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 386 10.3 PolynomialApproximations . . . . . . . . . . . . . . . . . . . . 389 10.3.1 L2 case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 ∞ 10.3.2 L ,orminimax,case . . . . . . . . . . . . . . . . . . . . 391 10.3.3 “Truncated”approximations . . . . . . . . . . . . . . . 394 10.3.4 Inpractice:usingtheSollyatooltocomputeconstrained approximationsandcertifiederrorbounds . . . . . . . 394 10.4 EvaluatingPolynomials . . . . . . . . . . . . . . . . . . . . . . 396 10.4.1 Evaluationstrategies . . . . . . . . . . . . . . . . . . . . 396 10.4.2 Evaluationerror . . . . . . . . . . . . . . . . . . . . . . . 397 10.5 TheTableMaker’sDilemma . . . . . . . . . . . . . . . . . . . . 397 10.5.1 WhenthereisnoneedtosolvetheTMD. . . . . . . . . 400 10.5.2 Onbreakpoints . . . . . . . . . . . . . . . . . . . . . . . 400 10.5.3 Findingthehardest-to-roundpoints . . . . . . . . . . . 404 10.6 SomeImplementationTricksUsedintheCRlibmLibrary . . . 427 10.6.1 Roundingtest . . . . . . . . . . . . . . . . . . . . . . . . 428 10.6.2 Accuratesecondstep . . . . . . . . . . . . . . . . . . . . 429 10.6.3 Erroranalysisandtheaccuracy/performance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 10.6.4 Thepointwithefficientcode . . . . . . . . . . . . . . . 430 PartIV Extensions 435 11 ComplexNumbers 437 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 11.2 ComponentwiseandNormwiseErrors . . . . . . . . . . . . . . 439 11.3 Computingad±bcwithanFMA . . . . . . . . . . . . . . . . . 440 11.4 ComplexMultiplication . . . . . . . . . . . . . . . . . . . . . . 442 11.4.1 ComplexmultiplicationwithoutanFMA instruction . . . . . . . . . . . . . . . . . . . . . . . . . . 442 11.4.2 ComplexmultiplicationwithanFMAinstruction . . . 442 11.5 ComplexDivision . . . . . . . . . . . . . . . . . . . . . . . . . . 443 11.5.1 Errorboundsforcomplexdivision . . . . . . . . . . . . 443 11.5.2 Scalingmethodsforavoidingover-/underflow incomplexdivision . . . . . . . . . . . . . . . . . . . . . 444 11.6 ComplexAbsoluteValue . . . . . . . . . . . . . . . . . . . . . . 447 11.6.1 Errorboundsforcomplexabsolutevalue . . . . . . . . 447 11.6.2 Scalingforthecomputationofcomplexabsolute value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 11.7 ComplexSquareRoot . . . . . . . . . . . . . . . . . . . . . . . . 449 11.7.1 Errorboundsforcomplexsquareroot . . . . . . . . . . 449 11.7.2 Scalingtechniquesforcomplexsquareroot . . . . . . . 450 11.8 AnAlternativeSolution:ExceptionHandling . . . . . . . . . . 451

Handbook of floating-point arithmetic PDF

638 Pages·2018·3.11 MB·English

by Brunie

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Handbook of floating-point arithmetic

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.