l'lndOffErrors 7 mental isource of imperfection in numerical computing: roundoff errors. Such errors arise due the intrinsic limitation of the finite precision representation of numbers (except for a restricted set tegers) in computers. . Different audiences may well require different levels of depth and detail in the present topic. ot Venbthe essenti;s of Section 2.1,then y u'maySkipthis chap— .ea pofiéfirst:‘reading.vy:io;Wurhmautsiatcceptgthen,is the epres tation_andeachelementary operationt(suchas + or *) in thmetrc=:introduéesi a Small; random relative error: up to about an :dlflOatinpogint SYIStemS' ' ' 'pfl a,ti.nIgnpSoienctzalrtitih2o.m2newtiec.gSeetvetreaclhnicisaslues anodnlydivmeenitniotonedthe ingorSyectidoentails2.1ofarfleeoxaptlianingepdoinhtersey.stems and V The smallrepresentation errors as well as errors that arise upon carrying out elementary op- erations such as addition and multiplication are typically harmless unless they accumulate or get magnifiedin an unfortunate way during the course Ofa possibly long calculation. We discuss round- offerror accumulation as well as ways to avoid or reduce such effects in Section 2.3. Finally, in Section 2.4, the IEEE standard for floatingpoint arithmetic, which is implemented in any nonspecialized hardware, is brieflydescribed. 2.1 The essentials This section summarizes what we believe all our students should know as a minimum about floating point arithmetic. Let us start with a motivating example. Example 2.1. Scientists and engineers often wish to believe that the numerical results of a computer calculation, especially those obtained as output of a software package, contain no error—at least not a significant or intolerable one. But careless numerical computing does occasionally lead to trouble. l7 18 m Chapter 2. Roundoff Errors Note:TheWord'fg‘essenistinaotl?s’ynonymouwsit a H W description belowtoo terse forcomfort, thefts-pleas re 3chapter,for more‘motivation,detail, and explanation; V One of the more spectacular disasters was the Patriot missile failure in Dhahran, SaudiArabia, dFleibngruaofryro2un5d,of1f991er,rorswhicihn‘rtehseultmedissilei’ns 28sofdtweaatrhes.. TThheis wfeabilpuraegewhatstp:u//lwtimwawte.ilmya.truarcnedn.edtou/~poaomr olhda/no-n dhistatspte:r/s//pawtriwot.whtmzlecnontgainesr.tihne fdoetarilms aofttihkis.tstuory—. Fmor ua elanrgcehcoellencti.odn e/ofpseofrtwsaorensb/ughs,usceekle/bugse.html. I ' Computer representation of real numbers Computer memory has a finite capacity. This obvious fact has far—reachinigmplications on the representation of real numbers, which in general do not have a finite uniformrepresentation. How should we then represent a real number on the computer in a way that can be sensibly‘iimplemented in hardware? ' .' Any real number x E .R is accurately representable by an infinite sequence, o,f digits.4: Thu.s-, wecanwrite ' l ‘ , _ ~ '. x‘ = w " x=i(l.d1d2d3---a',_1d,dt+1.-~-)X2e, V where e is an integer exponent. The (possibly infinite)set of binary digits (1, eachhaveavalue 0 or 1. The decimal value ofthis mantissa is ‘ ' '- ' d1 d2 d3 l.ddd1---2=13 + — + —— + — + 2 22 23 V ..i For instance, the binary representation x = —(1.10100- - -) X 21 has in decimalthevalue x = —(l +l/2+l/8) x2: —_3.25. Of course, it should be clear that the choice of a binary representation is justone of-many possibilities, indeed a convenient choice when it comes to computers. Torepresent any real number on the c\omputer, we associate to x a floatingpoint representation fl(x) of a form similarto thatpofx but with only I digits, so fl(x) = sign(x) x (1.&1&2&3---J,_1x&2,e) for some t that is fixed in advance and binary digits 3,-that relateto d;-in a manner that willsoon be specified. Storing this fraction in memory thus requires t bits. The exponent e must also be stored in a fixed number of bits and therefore must be bounded, say, between a lower bound‘L and an upper lying such bound U. Further details are given in Section 2.2, which we hope will provide you with much of ess guard j what you’d like to know. Rounding unit and standard floating point system Without any further details it should already be clear that representing x by fl(x) necessarily causes an error. A central question is how accurate the floatingpoint representation of thereal numbersis. 4It is known from calculus that the set of all rational numbers in a given real interval is dense in that interval. This means that any number in the interval, rational or not, can be approached to arbitrary accuracy by a sequence of rational ' numbers. 's_s,acom‘ l9 (idem;floatinpgoint systems, such as the celebrated IEEE standard described in detail in guar teethat this relative errorris bounded by I) an quantity :17, which will be more generally defined in Section 2.2, has been called EVE-machinpreecision, machine epsilon, and more. We will use the term rounding mplications on theF ‘ ‘ presentation. Ho ' ‘ ~h‘usualfloatingpoint word in the standard floatingpoint system, which is what MATLAB is1bly-implement_e, fault,:has64 bits. Of thes‘e, 52 bits are devoted to the mantissa (or fraction).while the rest nal’the Value just one oi’many tany real number lmilatro thatof x that will soon be alsobe stored in IL and an upper vlng suchnumbers introduce roundoff errors. These errors can be quite large in the relative sense, )u With much of eSs guarddigits are used. These are extra digits that are used in interim calculations. The IEEE , idardrequires exact rounding, which yields that the relative error in each arithmetic operation ,s" boundedby in. V . Gi n'the' above soothing words about errors remaining small after representing a number and orming 'an arithmetic operation, can we really put our minds at ease and count on a long and cessarily causes ,nsercalculationt.o be as accurate as we want it to be? real nUmbers is. 0t quite! We have already seen in Example 1.3 that unpleasant surprises may arise. Let us ention a fewpotential traps. The fuller version is in Section 2.3. 1that interval. This “Carelesscomputations can sometimes result in division by 0 or another form of undefined equence of rational erical result. The corresponding variable name is then assigned the infamous designation NaN. 20 ' Chapter 2. Roundoff.Errors one’s'own calculations, but it allows software to detect problematic situations, such as an attempt j to divide Oby O,and do something graceful instead of just halting. We have a hunch;that you will inadvertently encounter a few NaN’s before you finish implementing all the algorithmién this book. There is also a potential for an overflow, which occurs when a number is larger than the largest that can be represented in the floatingpoint system. Occasionally this can be avoided, as in Example 2.9 and other examples in Section 2.3. > We have already mentioned in Section 1.3 that a roundoff error accumulation that grows lin- early with the number of operations in a calculation is inevitable. Our calculatith never be so long that this sort of error would surface as a practical issue. Still, there areimore pitfalls to watch out for. One painful type of error magnificationis a cancellation error, which occurs when ‘ two nearly equal numbers are subtracted from one another. There are several examples of this in Section 2.3. Furthermore, the discussion in this chapter suggests that such errors may consistently . arise in practice. 5 ' ‘ The rough appearance of roundoff error Consider a smooth function g, sampled at some t and at t+h for a small (i.e., near 0) value h. Continuity then implies that the values g(t) and g(t + h) are close to each other. But the rounding errors in the machine representation of these two numbers are unrelated to eachother:' they are random for all we know! These rounding errors are both small (that’swhat 17is for), but even their signs may in fact differ. So when subtracting g(t + h) — g(t), for instance, on our way to estimate the derivative of g at t, the relative roundoff error becomes significantlylarger as cancellation error naturally arises. This is apparent already in Figure 1.3. -Let us take a further look at the seemingly unstructured-behavior 0f roundoff error as a smooth function is being sampled. Example 2.2. Weevaluate g(t) = e" (sin(27r t) + 2) at 501 equidistant points between 0 and 1,using the usual double precision as well as single precision and plotting the differences. This essentially gives the rounding error in the less accurate single precision evaluations. The following MATLAB instructions do the job: 7 t = 0:.002zl; tt = exp(—t) .* (sin(2*pi*t)+2);- rt single(tt); round_err = (tt - rt) ./tt ; plot (t,roundfierr,’b-’);, title (’error in sampling exp(—t)(sin(2\pi t)+2) single precision’)h xlabel(’t’) ‘ H " ylabel(’roundoff error’) % relative error should be about eta = eps(single)/2 rel_round_err = max(abs(round_err)) / (eps('single’)/2) Thus, the definition of the array values tt is automatically implemented in double precision, while the instruction '5 ingle when defining the array rt records the corresponding values in single precision. - - = The resulting plot is depicted in Figure 2.2. Note the disorderly, “highfrequency”oscillation of the roundoff error. This is in marked contrast to discretization errors, which are usually “smooth,” as we have seen in Example 1.2. (Recall the straight line drop of the error‘in Figure‘13.3for relatively large h, which is where the discretization error dominates.) “ "‘ ' ' ‘ The: output of this program indicates that, as expected, the relative error is at about the rounding unit level. The latter is obtained by the (admittedly unappealing) function call 21 2; ,‘Roundofigrrrb fr pointsystems jare more pitfalls to j'which occurs when examples of this in us may consistently :e.,,ncar_ 0) value h Er.But the rounding ‘ach other: they are 0,2 0.3 0.4 0.5 0.6 : for), but even their our Wayto estimate iscancellationrerror . Figure 2.2. The “almost random ” nature of roundoflerrors. ff error as a smooth ngle ’ ) /2, which yields roughly the value nsingle = 6e-8. The last line in the script p‘p’mximathteevlaylue 0.97, which is indeed close to 1. I ween 0:and 1,using “asT;his essentially. lowingMATLAB' ‘r‘det' ledi‘andxrtechnicalfitnhe épresentzchaNpotteerth.at E and’ard‘ efore : a - v irecision' ) ,82 base of the number system; t = precision (# of digits); L = lower bound on exponent e; . l U = upper bound on exponent e. t 3 double precision, the, a bmary representation_ 1’n Sectlo_n 2.1, we write_ ng‘valuesin Single 1 fl iuency”oscillation fl(x) _—_ 2|: (.@30+ .31+ . . . + [3‘'1 x '39, usually “smooth,” ‘3 1.3 for relatively w,. haerefl‘ i:_s afn i..nteger larger than I referred to as the base and d~,-are i_nteger d1. g1. ts1.n the range 0 5 a-r_is at _about the y i5slr n,.o—rriTr‘hitenoaumlsabietiirsifzyfle3(dx0)aéis0anbyaapdpjurostxinimg athteionex'opfoxn.enTto eenssourethatunlieqaudeinngesszerosof thairse rderporpepseedn.taNtiooten, lg) function can 22 Chapter 2. fRoundoff Errors . . tLthhea5t lauesntl5essstUor.efld = di2gitthishasdoeinsdexnot tfi—xltheandvalnuoet IofasJoin, wSheicchtionthe2re.1fo.reIn maudsdtitiobne, steormedus,t toob.e iTnhitshe isranwgehy ’ Choppingand rounding i nTuombsteorre oxf s=trai(tdeg0i.eds1.d2Tdhe3 t-w-o-d,b_a1sdictodnt+es1 -a-re-) x 69 using only; digits,it is' possilb' tloeuse one of a , o chopping: ignoredigits d,,d,+1,dt+2,dt+3. . ., yielding'Ji = d;and i " ’i ' fl(x)=ido.d1d2d3---dt—1.Xfie; i _ o rounding: consultd, to determine the approximation flm _ 3: do-d1d2d3- --d._1 x W. d. < 6/2. i(do.d1d2d3~-dt_1+fi‘—') x53, d. > 73/2. ' In case of a tie (d, = ,8/2), round to the nearest even number. Example 2.3. Here are resultsof chopping and rounding with ,8= 10, t = 3 a x Chopped to Rounded to 3 digits 3 digits 5.672 5.67 ' 5.67 ——5.672 —5.67 —5.67 . 5.677 5.67 5.68 . —5.677 —5.67 —-5.68 , 5.692 5.69 * 5.69 5.695 5.69 I v p p i Example 2.4;. Since humans are used to decimal arithmetic, let usset = 10. consider 8 ' (—2 —6 +6 —6 +—6 +—+—+---) 5 =2.6666...= x . 10°. T‘his nTuomrebperersgenivtes iatnusininfignitte= se4ri,eschoinppbiansge g1i0v,esalthough the digit series has finite length in base 3. 8 (12 006 +16 016 Jr102‘L103)X —3: — — _ — 100 =2.666 X 10.0 . before Ocnhothpepiontghe,r whhiachnd,gwivitehs r2o.6u6n7dinXg1n0o0te. that d', = 6 > 5, so ',61'"= 10“3is addedtol th' enum' b'Per ’ ‘ Recall t1; This floatingpoint representation is not unique; for instance, we have relative error i: 2.666 x 100= 0.2666 x101. dating-POM Therefore, we normalize the representation by insisting that do 7e0, so 15d059,05d,-59, i=1,...,t—l, ‘which eliminates any ambiguityI. 1jn3uds,totboe. iTnhitsheisrawnghey tedThase snpuecmiablecro0mbcainnnaottionsbe roefpbrietsse,ntaecdcoridninagnotormaanlizaegdree f(a1shuipoonn. coItnvaenndtiotnhe lfiomritsthe:lzgoioveanrefloreaptrien-g ibleto use one of a xa'rencipmlael numCboenrside2r.666a (tiosyp)rdeecciismeally rfleporeasteinngtpaoblient bseycsatuesme wiitthhast =fou4r, Udi=git1s,iannditsLm'=an—tis2sa. Thaunsd, ’ <‘e’ T<heUl.argest number here is 99.9 9 § 102= 100, the smallest is —99.99 g —100, and the small— 1!tsHhirotewievdfimiageniy‘tsnu1d0imffisevrabel1uneet0sr-2ne=uamchb0e.0r(s1b.ecaduosewethheayv me?ayThbee zfierros,t dtoigoi)t. cTanhust,aketheorne a9re di9ffexren1t0 Xva1l0uexs,10th=e 000 different naorerm4alixzed9,000frac=tio3ns6,00p0osdsififbelreen.t Thepo0s.eixtiSpvooe,nentnhuetmrebecarnsare bep7o2o,s0nse0i1bloefd.ifTUfeh—reerneLt+1isn=u4tmheberssamevainlutoethtsa,ils Whaabtout the choice of a base? Computer storageis in an integer multiple of bits, hence all "s'tpeiclttue'rreirseEpwritehsebnatseastion1w6seankdnow'8. Toofdhaayv,e asusWeedbhaasvees otbhsaetrvaerde pionweSresctioonf 2.2.I1n, tthhee 1s9ta7n0d3ardtherefloawterieng tsystemuses base 3 = 2; see Figure 2.1 and Section 2.4. ,fe_vthrflt~eaaltb#aild1tein3itv,o1eb0rewwrce)oaourrskheiasvwgeiittehthniasetsriuanuclhdly=epaane1nm,0odare0epn0pt,r0moo0xfe0iam=anicanlhtgiaxofnung1lea0sm6eoaaafsncurdCeexuvrpao;tenthleaflynn(uta.s)abT=wshoiult9uhst9,e0u,i0f0e=ri0rnor12a0 di9anen.c9diflmoxavalt1=i0ns5gfly,s(wptueeom)ine=tx(pri9ee..pe9c-..t, - Lbeotru?nseldouetnboyteththee flvoalauteingofptohientrerleaptivreeseenrrtoar.tionmapping 'byfuxnd|a—m>efln(txa)l, anads istuexpppresosessreounadibnogundis e floating point system. We have d in Section 2.1 the rounding unit 11. ionsider _ . I . 3 3 Y e i:length in base 3. :ldedto the number R callthat‘for the double precision word in the standardfloating point system, a bound on the lat1Veerroirs given by Ix— fl(x)|/|x|g n z 1.1e-l6. ointarithmetic tic;eoi‘pfI*ee4rgsaritiiiolntn'eownaafstbiaoscicinonmeSadpreuictthtiemodnetiecx2a.1c,otlpytehreaantidIoEnEtEhmenu stsatthnedbearrdeisduelnrtteicqaruloiurnedtosedethxeacttorestuhrleotunndoeaibnrteagsit,nedflwohiiafcthithnegmaeparnoitshin-t 24 Chapter 2. Roundoff Errors 4 . i . E, t 1 Theorem: Floating Point Representation Error. ~ 3 ‘ 'Let x I—> fl(x) é—Jgx‘ fie,where x 75Oand g is the normaliZed,‘Signemdanti:sspa'i' r ‘ ;.. I g2 Then the absolute error committed in using the floatingpoint representationof x is bounded by i ’31_t- ’3e' forchoPPing1,‘1,. ' lx—fl‘ (x)ls ' WW“ formatting; ; = whereas the relative error satisfies "“flm' fl“? ,féréhoppéxig, < IXI $1317?for rounding. . number. With exact rounding, if fl(x) and fl(y) are machine numbers, then fl(fl(x) i 1101))= (1106)fl: fl(y))(1 + 61), fl(fl(x) >< fl(y)) = (fl(x) X fl(y))(1+€2), fl(flOC)+ 110)) = (1106)+ fl(y))(1.+ 63), . where leil s n. Thus, the relative errors remain small after each such operation. . Example 2.6. Consider a floatingpoint system with decimal base ,3= 10 and foun digits, i.e., t = 4. Thus, the rounding unit is n = %X 10‘3. Let u ' ' - x = .1103 = 1.103 x10", y = 9.963 x10_3. Subtracting these two numbers, the exact value is x — y = .100337. Hence, exact rennding yields .1003.Obviously, the relative error is ‘ |.100337 — .1003 e .37 x10‘3 < n. .100337 were wouldlobtain However, if we to subtract these two numbers without guard digits we .1103 — .0099 = .1004. Now the obtained relative error is not below 71,because |.100337—.1004| 10‘3 m .63 x > 77. .100337 Thus, guard digits must be used to produce exact rounding. I Example 2.7. Generally, proper rounding yields fl(1 + a) = 1 for any number a that satisfies la! 5 17. In particular, the MATLAB commands eta = .5*2*(—52), beta = (l+eta)—l produce the output eta = 1.1102e-16, beta = O. Returning to Example 1.3 and to Figure 1.3;‘We can'now explain why the curve of the error is flat for the very, very small values of h. For such values, fl(f (x0 + 11))= fl(f (x0)), so the approximation is precisely zero and the recordedvalues are those of fl(f’(xo)), which is independent of h. I .Erré: 25 ‘2. .Roundoff howglgaiven floating point system represents the real line you’llfind that it has a nyeeinghnbatourrei.ng,inpdoeseidti,veveflryoatlianrggeponuinmtbvearsluesare isnocto nrsetapntreseinntethde atrelaatlilv,eandbutthenotdistianncethe , (2‘,e3t,—us -—plo2t,3tTh)he.e desmcimallaelst repporsessiebnletantiuomnbeorf tihnethneumbmearnstissain tdhe0.ds1ydst2emis 1s.p00eciafindedthbey oss1bnluember is 1.11 in binary, which is equal to l + 1/2+ 1/4 = 1.75. So, we will run furisfromto 1—.7—52toin3,inwcrhemicehntsis whoaft ,6we1”lo=op2‘o2ver= i0n.25th.e Tsehcisondis t(hneesbtaesdis)lofoorp.onTeheloroeps.ultTinhge ‘ ‘uiMldAusTpLAanBarrayscriwpitththaatll tphleotpsotshietivneumelbeemresnts of otfhethesyssytesmtem. . Notice the format in the -2‘:‘3 x i~k2Aj1 ; Ir digits, i.e., t = 4; “30]; % Add all negatiire numbers and O ‘orittxvy; % Sort gerbs(l, length(x) ) ,- .(§c,y,’ +’ ) lctroundingyields ‘sultingplot is in Figure 2.3. Note that the points are not distributed uniformly along the The rounding unit for this system is l : i We would obtain , nzlfll—tle21#3Z—— 1 2 2 8’ half the spacing between 1 = fi0and the next number in the system. (Can you see i"Most iannoyingin Figure 2.3 is the fact that the distance between the value 0 and the smallest flat satisfies |ozl5 aon0stdh1ttihvSeeynnseutxemtmbessrmisaliblseysstiingtnrpoiofidsucictaiinvngetlynlunaemrxgtbeerr!toth0anAadtcdhoietmiodmnisatola,nnclyesubunbseeotdrwmeeawnlay nthutoamtbreesramsmeedwyshmicthahilslestairne pmnoootsdietrninvoremnflauloimzaebtdei.rng :15:o0Ffigh.ur'Feor1.3s;uwceh MpMarAtiTciLrAleapBrs,(h1a)s agivpaersamtweitceer tcheallreodunedpisng thuantit si1g.niEfinetesr thehelsppacinegps of toflolaeatrinngmporoeintanbuomutbersth.is orded values are 26 Chapter 2. Roundof‘f Errors Roundofi . 0.8 0.6 0.4 0.2 o + + + +++++l-l—|—l-l-|-|-HHI+IHH+I-IH+l-l-++++ + + + —1 ‘——15 —10 —5 0 5 10 15 Figure 2.3. Picture of the floating point system described in Example 2.8. 2.3 Roundoff error accumulation Because many operations are being performed in the course of carrying out a numerical algorithm, many small errors unavoidably result. We know that each elementary floating point operation may introduce a small relative error, but how do these errors accumulate? In general, as we have already mentioned in Chapter 1,error growth that is linear in the number of operations is unavoidable. Yet, there are a few things to watch out for. 0 Error magnification: 1.‘ If x and y differ widely in magnitude, then x + y has a large absolute error. 2. If Iyl << 1, then x /y may have large relative and absolute errors. Likewisefor dryif ' Iyl>> 1. 3. If x 2 y, then x — y has a large relative error (cancellation error). 0 Overflowand underflow: An overflow is obtained when a number is too large to fit into the floatingpoint system in use, i.e., when 6 > U .‘An underflow is obtained when e < L. When overflow occurs in the course of a calculation, this is generally fatal. But underflow is nonfatal: the system usually sets the number to 0 and continues. (MATLAB does this, quietly.) It is also worth mentioning again the unfortunate possibility of runninginto NaN, described in > Section 2.1. Design of library functions You may wonder what the fuss is all about, if the representation of a number and basic arithmetic operations cause an error as small as about 10—16in a typical floating:point system. ,But taking this
Description: