1 FLOATING-POINT ARITHMETIC (cid:15) Floating-point representation and dynamic range (cid:15) Normalized/unnormalized formats (cid:15) Values represented and their distribution (cid:15) Choice of base (cid:15) Representation of signi(cid:12)cand and of exponent (cid:15) Rounding modes and error analysis (cid:15) IEEE Standard 754 (cid:15) Algorithms and implementations: addition/subtraction, multiplication and division Digital Arithmetic - Ercegovac/Lang 2003 8 { Floating-Point Arithmetic 2 VALUES REPRESENTED IN FLPT SYSTEM A B C D E - inf b d + inf [A, B] - negative floating-point numbers (normalized) [D,E] - positive floating-point numbers (normalized) (B,b] & [d,D) - denormals C - zero > E - positive overflow < A - negative overflow (B, C) - negative underflow (normalized) (C, D) - positive underflow (normalized) (a) significand 0.100 0.101 0.110 0.111 denormals 0 1/8 1/4 1/2 1 2 exponent -2 -1 0 1 (b) Figure 8.1: a) Regions in (cid:13)oating-point representation. b) Example for m = f = 3, r = 2, and (cid:0)2 (cid:20) E (cid:20) 1 (only positive region). Digital Arithmetic - Ercegovac/Lang 2003 8 { Floating-Point Arithmetic 3 Floating-point system Normalized Unnormalized A (cid:0)(rm(cid:0)f (cid:0) r(cid:0)f) (cid:2) bEmax B (cid:0)rm(cid:0)f(cid:0)1 (cid:2) bEmin (cid:0)r(cid:0)f (cid:2) bEmin C 0 D rm(cid:0)f(cid:0)1 (cid:2) bEmin r(cid:0)f (cid:2) bEmin E (rm(cid:0)f (cid:0) r(cid:0)f) (cid:2) bEmax Digital Arithmetic - Ercegovac/Lang 2003 8 { Floating-Point Arithmetic 4 DISTRIBUTION FOR b = 2, m = f = 4, and e = 2 Signi(cid:12)cand 2E 1 2 4 8 0.1000 1/2 1 2 4 0.1001 9/16 9/8 9/4 9/2 0.1010 10/16 10/8 10/4 5 0.1011 11/16 11/8 11/4 11/2 0.1100 12/16 12/8 3 6 0.1101 13/16 13/8 13/4 13/2 0.1110 14/16 14/8 14/4 7 0.1111 15/16 15/8 15/4 15/2 Digital Arithmetic - Ercegovac/Lang 2003 8 { Floating-Point Arithmetic 5 DISTRIBUTION FOR b = 2, m = f = 3, and e = 3 Signi(cid:12)cand 2E 1 2 4 8 16 32 64 128 0.100 1/2 1 2 4 8 16 32 64 0.101 5/8 5/4 5/2 5 10 20 40 80 0.110 6/8 3/2 3 6 12 24 48 96 0.111 7/8 7/4 7/2 7 14 28 56 112 Digital Arithmetic - Ercegovac/Lang 2003 8 { Floating-Point Arithmetic 6 DISTRIBUTION FOR b = 4, m = f = 4, and e = 2 Signi(cid:12)cand 4E 1 4 16 64 0.0100 1/4 1 4 16 0.0101 5/16 5/4 5 20 0.0110 6/16 6/4 6 24 0.0111 7/16 7/4 7 28 0.1000 1/2 2 8 32 0.1001 9/16 9/4 9 36 0.1010 10/16 10/4 10 40 0.1011 11/16 11/4 11 44 0.1100 12/16 3 12 48 0.1101 13/16 13/4 13 52 0.1110 14/16 14/4 14 56 0.1111 15/16 15/4 15 60 Digital Arithmetic - Ercegovac/Lang 2003 8 { Floating-Point Arithmetic 7 DISTRIBUTION OF FLPT NUMBERS (a) b=2, f=4, e=2 E: 1 2 4 8 0 1/2 1 2 3 4 5 6 7 (b) b=2, f=3, e=3 E: 1 2 4 8 16, 32, 64, 128 0 1/2 1 2 3 4 5 6 7 8 ,10,12,14,16,20,24,28, 32,40,48,56, 64,80,96,112 (c) b=4, f=4, e=2 E: 1 4 16, 64 01/41/2 1 2 3 4 5 6 7 8 , 9, ..., 16, 20, 24, ...,60 Figure 8.2: EXAMPLES OF DISTRIBUTIONS OF FLOATING-POINT NUMBERS. Digital Arithmetic - Ercegovac/Lang 2003 8 { Floating-Point Arithmetic 8 REPRESENTATION OF SIGNIFICAND AND EXPONENT (cid:15) SIGNIFICAND: SM with HIDDEN BIT (cid:15) EXPONENT: BIASED E = E + B, minE = 0 ) B = (cid:0)E R R min (cid:15) Symmetric range (cid:0)B (cid:20) E (cid:20) B ) 0 (cid:20) E (cid:20) 2B (cid:20) 2e (cid:0) 1 R (cid:15) for 8-bit exponent: B = 127, (cid:0)127 (cid:20) E (cid:20) 128, 0 (cid:20) E (cid:20) 255 R (cid:15) E = 255 not used R (cid:15) SIMPLIFIES COMPARISON OF FLOATING-POINT NUMBERS (same as in (cid:12)xed-point) (cid:15) MINIMUM EXPONENT REPRESENTED BY 0 SO THAT FLOATING-POINT VALUE 0: ALL ZEROS (0 sign, 0 exponent, 0 signi(cid:12)cand) Digital Arithmetic - Ercegovac/Lang 2003 8 { Floating-Point Arithmetic 9 SPECIAL VALUES AND EXCEPTIONS (cid:15) Special values - not representable in the FLPT system { NAN (Not A Number) { In(cid:12)nity (pos, neg) { allow computation in presence of special values (cid:15) Exceptions: result produced not representable - set a (cid:13)ag { Exponent over(cid:13)ow { Under(cid:13)ow Digital Arithmetic - Ercegovac/Lang 2003 8 { Floating-Point Arithmetic 10 ROUNDOFF MODES AND ERROR ANALYSIS (cid:15) Exact results (inf. precision): x, y, etc. (cid:15) FLPT number representing x is R (x) with rounding mode mode mode (cid:15) Basic relations: 1. If x (cid:20) y then R (x) (cid:20) R (y) mode mode 2. If x is a FLPT number then R (x) = x mode 3. If F1 and F2 are two consecutive FLPT numbers then for F1 (cid:20) x (cid:20) F2 x is either F1 or F2 F1 F2 x Figure 8.3: Relation between x, Rmode(x), and (cid:13)oating-point numbers F1 and F2. Digital Arithmetic - Ercegovac/Lang 2003 8 { Floating-Point Arithmetic
Description: