Jump to content

Extended precision

From Wikipedia, the free encyclopedia

Extended precisionrefers tofloating-pointnumber formats that provide greaterprecisionthan the basic floating-point formats.[1]Extended precision formats support a basic format byminimizing roundoff and overflow errorsin intermediate values of expressions on the base format. In contrast toextended precision,arbitrary-precision arithmeticrefers to implementations of much larger numeric types (with a storage count that usually is not a power of two) using special software (or, rarely, hardware).

Extended precision implementations[edit]

There is a long history of extended floating-point formats reaching back nearly to the middle of the last century[when?].Various manufacturers have used different formats for extended precision for different machines. In many cases the format of the extended precision is not quite the same as a scale-up of the ordinary single- and double-precision formats it is meant to extend. In a few cases the implementation was merely a software-based change in the floating-point data format, but in most cases extended precision was implemented in hardware, either built into thecentral processoritself, or more often, built into the hardware of an optional, attached processor called a "floating-point unit"(FPU) or" floating-point processor "(FPP), accessible to theCPUas a fast input / output device.

IBM extended precision formats[edit]

TheIBM 1130,sold in 1965,[2]offered two floating-point formats: A 32-bit "standard precision" format and a 40-bit "extended precision" format. Standard precision format contains a 24-bittwo's complementsignificandwhile extended precision utilizes a 32-bit two's complement significand. The latter format makes full use of the CPU's 32-bit integer operations. The characteristic in both formats is an 8-bit field containing the power of twobiasedby 128. Floating-point arithmetic operations are performed by software, anddouble precisionis not supported at all. The extended format occupies three 16-bit words, with the extra space simply ignored.[3]

TheIBM System/360supports a 32-bit "short" floating-point format and a 64-bit "long" floating-point format.[4]The 360/85 and follow-onSystem/370add support for a 128-bit "extended" format.[5]These formats are still supported in the currentdesign,where they are now called the "hexadecimal floating-point"(HFP) formats.

Microsoft MBF extended precision format[edit]

TheMicrosoft BASICport for the6502CPU, such as in adaptations likeCommodore BASIC,AppleSoft BASIC,KIM-1 BASICorMicroTAN BASIC,supports anextended 40-bit variantof the floating-point formatMicrosoft Binary Format(MBF) since 1977.[6]

IEEE 754 extended precision formats[edit]

TheIEEE 754floating-point standard recommends that implementations provide extended precision formats. The standard specifies the minimum requirements for an extended format but does not specify an encoding.[7]The encoding is the implementor's choice.[8]

TheIA32,x86-64,andItaniumprocessors support what is by far the most influential format on this standard, the Intel 80-bit (64 bit significand) "double extended" format, described in the next section.

TheMotorola 6888xmath coprocessors and theMotorola 68040and68060processors also support a 64-bit significand extended precision format (similar to the Intel format, although padded to a 96-bit format with 16 unused bits inserted between the exponent and significand fields, and values with exponent zero and bit 63 one are normalized values[9]). The follow-onColdfireprocessors do not support this 96-bit extended precision format.[10]

The FPA10 math coprocessor for earlyARMprocessors also supports a 64-bit significand extended precision format (similar to the Intel format although padded to a 96-bit format with 16 zero bits inserted between the sign and the exponent fields), but without correct rounding.[11]

The x87 and Motorola 68881 80-bit formats meet the requirements of the IEEE 754-1985 double extended format,[12]as does the IEEE 754128-bitbinary format.

x86 extended precision format[edit]

The x86 extended precision format is an 80 bit format first implemented in theIntel 8087mathcoprocessorand is supported by all processors that are based on thex86 designthat incorporate afloating-point unit(FPU).

The Intel 8087 was the firstx86device which supported floating-point arithmetic in hardware. It was designed to support a 32 bit "single precision" format and a 64 bit "double-precision" format for encoding and interchanging floating-point numbers. The extended format was designed not to store data at higher precision, but rather to allow for the computation of temporary double results more reliably and accurately by minimising overflow and roundoff-errors in intermediate calculations.[a][14][15]All the floating-point registers in the 8087 hold this format, and it automatically converts numbers to this format when loading registers frommemoryand also converts results back to the more conventional formats when storing the registers back into memory. To enable intermediate subexpression results to be saved in extended precision scratch variables and continued across programming language statements, and otherwise interrupted calculations to resume where they were interrupted, it providesinstructionswhich transfer values between these internal registers and memory without performing any conversion, which therefore enables access to the extended format for calculations[b]– also reviving the issue of the accuracy of functions of such numbers, but at a higher precision.

Thefloating-point units(FPU) on all subsequentx86processors have supported this format. As a result, software can be developed which takes advantage of the higher precision provided by this format.William Kahan,a primary designer of thex87arithmetic and initial IEEE 754 standard proposal notes on the development of the x87 floating point: "An extended format as wide as we dared (80 bits) was included to serve the same support role as the 13 decimal internal format serves in Hewlett-Packard's 10 decimal calculators."[17]Moreover, Kahan notes that 64 bits was the widest significand across which carry propagation could be done without increasing the cycle time on the 8087,[18]and that the x87 extended precision was designed to be extensible to higher precision in future processors:

"For now the10 byte extended formatis a tolerable compromise between the value of extra-precise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a16 byte format.... That kind of gradual evolution towards wider precision was already in view whenIEEE Standard 754 for Floating-Point Arithmeticwas framed. "[19]

This 80 bit format uses one bit for the sign of the significand, 15 bits for the exponent field (i.e. the same range as the 128 bitquadruple precision IEEE 754 format) and 64 bits for the significand. The exponent field isbiasedby 16383, meaning that 16383 has to be subtracted from the value in the exponent field to compute the actualpower of2.[20]An exponent field value of 32767 (all fifteen bits1) is reserved so as to enable the representation of special states such asinfinityandNot a Number.If the exponent field is zero, the value is adenormalnumber and the exponent of 2 is −16382.[21]

In the following table, "s"is the value of the sign bit (0 means positive, 1 means negative),"e"is the value of the exponent field interpreted as a positive integer, and"m"is the significand interpreted as a positive binary number, where the binary point is located between bits 63 and 62. The"m"field is the combination of the integer and fraction parts in the above diagram.

Interpretation of the fields of an x86 Extended Precision value
Exponent Significand Meaning
bits
78–64
bit
63
bits
62–0
all0 0 0 Zero. The sign bit gives the sign of the zero, which usually is meaningless.
non-zero Denormal. The value is(−1)s×m× 2−16382
1 anything Pseudo Denormal. The 80387 and later properly interpret this value but will not generate it. The value is(−1)s×m× 2−16382
bits
78–64
bits
63–62
bits
61–0
all1 00 0 Pseudo-infinity. The sign bit gives the sign of the infinity. The 8087 and 80287 treat this as Infinity. The 80387 and later treat this as an invalid operand.
non-zero Pseudo 'Not a Number'. The sign bit is meaningless. The 8087 and 80287 treat this as a Signaling Not a Number. The 80387 and later treat this as an invalid operand.
01 anything Pseudo 'Not a Number'. The sign bit is meaningless. The 8087 and 80287 treat this as a Signaling Not a Number. The 80387 and later treat this as an invalid operand.
10 0 Infinity. The sign bit gives the sign of the infinity. The 8087 and 80287 treat this as a Signaling Not a Number. The 8087 and 80287 coprocessors used the pseudo-infinity representation for infinities.
non-zero Signalling 'Not a Number', the sign bit is meaningless.
11 0 Floating-point Indefinite, the result of invalid calculations such as square root of a negative number, logarithm of a negative number,0/0,infinity/infinity,infinity times 0, and others, when the processor has been configured to not generate exceptions for invalid operands. The sign bit is meaningless. This is a special case of a Quiet Not a Number.
non-zero Quiet 'Not a Number', the sign bit is meaningless. The 8087 and 80287 treat this as a Signaling Not a Number.
bits
78–64
bit
63
bits
62–0
any
other
0 anything Unnormal. Only generated on the 8087 and 80287. The 80387 and later treat this as an invalid operand. The value is(−1)s×m× 2e− 16383
1 anything Normalized value. The value is(−1)s×m× 2e− 16383

In contrast to thesingleanddouble-precisionformats, this format does not utilize an implicit /hidden bit.Rather, bit 63 contains the integer part of the significand and bits 62–0 hold the fractional part. Bit 63 will be 1 on all normalized numbers. There were several advantages to this design when the8087was being developed:

  • Calculations can be completed a little faster if all bits of the significand are present in the register.
  • A 64 bit significand provides sufficient precision to avoid loss of precision when the results are converted back to double-precision format in the vast number of cases.
  • This format provides a mechanism for indicating precision loss due to underflow which can be carried through further operations. For example, the calculation2 × 10−4930× 3 × 10−10× 4 × 1020generates the intermediate result6 × 10−4940which is adenormaland also involves precision loss. The product of all of the terms is24 × 10−4920which can be represented as a normalized number. The80287could complete this calculation and indicate the loss of precision by returning an "denormal" result (exponent not 0, bit 63 = 0).[22][23]Processors since the80387no longer generate unnormals and do not support unnormal inputs to operations. They will generate a denormal if an underflow occurs but will generate a normalized result if subsequent operations on the denormal can be normalized.[24]

Introduction to use[edit]

The 80 bit floating-point format was widely available by 1984,[25]after the development of C, Fortran and similar computer languages, which initially offered only the common 32 and 64 bit floating-point sizes. On thex86 designmostCcompilers now support 80 bit extended precision via thelong doubletype, and this was specified in theC99/C11standards (IEC 60559 floating-point arithmetic (Annex F)). Compilers on x86 for other languages often support extended precision as well, sometimes via nonstandard extensions: For example,Turbo Pascaloffers anextendedtype, and severalFortrancompilers have aREAL*10type (analogous toREAL*4andREAL*8). Such compilers also typically include extended-precision mathematicalsubroutines,such assquare rootandtrigonometric functions,in their standardlibraries.

Working range[edit]

The 80 bit floating-point format has a range (includingsubnormals) from approximately3.65 × 10−4951to1.18 × 10+4932.Althoughlog10( 264) ≈ 19.266,this format is usually described as giving approximately eighteen significant digits of precision (the floor oflog10( 263),the minimum guaranteed precision). The use of decimal when talking about binary is unfortunate because most decimal fractions are recurring sequences in binary just as2/3is in decimal. Thus, a value such as 10.15, is represented in binary as equivalent to 10.1499996185 etc. in decimal forREAL*4but 10.15000000000000035527 etc. inREAL*8:inter-conversion will involve approximation, except for those few decimal fractions that represent an exact binary value, such as 0.625. ForREAL*10,the decimal string is 10.1499999999999999996530553 etc. The last 9 digit is the eighteenth fractional digit and thus the twentieth significant digit of the string. Bounds on conversion between decimal and binary for the 80 bit format can be given as follows: If a decimal string with at most 18 significant digits is correctly rounded to an 80 bit IEEE 754 binary floating-point value (as on input) then converted back to the same number of significant decimal digits (as for output), then the final string will exactly match the original; while, conversely, if an 80 bit IEEE 754 binary floating-point value is correctly converted and (nearest) rounded to a decimal string with at least 21 significant decimal digits then converted back to binary format it will exactly match the original.[12]These approximations are particularly troublesome when specifying the best value for constants in formulae to high precision, as might be calculated viaarbitrary-precision arithmetic.

Need for the 80 bit format[edit]

A notable example ofthe need for a minimum of 64 bits of precision in the significandof the extended precision format is the need to avoid precision loss when performing exponentiation ondouble-precisionvalues.[26][27][28][c]Thex86floating-point units do not provide an instruction that directly performsexponentiation:Instead they provide a set of instructions that a program can use in sequence to perform exponentiation using the equation:

In order to avoid precision loss, the intermediate results "log2(x)"and"y·log2(x)"must be computed with much higher precision, because effectively both the exponent and the significand fields ofxmust fit into the significand field of the intermediate result. Subsequently, the significand field of the intermediate result is split between the exponent and significand fields of the final result when2intermediate resultis calculated. The following discussion describes this requirement in more detail.

With a little unpacking, anIEEE 754double-precisionvalue can be represented as:

wheresis the sign of the exponent (either 0 or 1),Eis the unbiased exponent, which is an integer that ranges from 0 to 1023, andMis the significand which is a 53 bit value that falls in the range1 ≤M< 2.Negative numbers and zero can be ignored because the logarithm of these values is undefined. For purposes of this discussionMdoes not have 53 bits of precision because it is constrained to be greater than or equal to one i.e. the hidden bit does not count towards the precision (Note that in situations whereMis less than 1, the value is actually a de-normal and therefore may have already suffered precision loss. This situation is beyond the scope of this article).

Taking the log of this representation of adouble-precisionnumber and simplifying results in the following:

This result demonstrates that when taking base 2 logarithm of a number, the sign of the exponent of the original value becomes the sign of the logarithm, the exponent of the original value becomes the integer part of the significand of the logarithm, and the significand of the original value is transformed into the fractional part of the significand of the logarithm.

BecauseEis an integer in the range 0 to 1023, up to 10 bits to the left of the radix point are needed to represent the integer part of the logarithm. BecauseMfalls in the range1 ≤M< 2,the value oflog2Mwill fall in the range0 ≤ log2M< 1so at least 52 bits are needed to the right of the radix point to represent the fractional part of the logarithm. Combining 10 bits to the left of the radix point with 52 bits to the right of the radix point means that the significand part of the logarithm must be computed to at least 62 bits of precision. In practice values ofMless thanrequire 53 bits to the right of the radix point and values ofMless thanrequire 54 bits to the right of the radix point to avoid precision loss. Balancing this requirement for added precision to the right of the radix point, exponents less than 512 only require 9 bits to the left of the radix point and exponents less than 256 require only 8 bits to the left of the radix point.

The final part of theexponentiationcalculation is computing2intermediate result.The "intermediate result" consists of an integer part "I"added to a fractional part"F".If the intermediate result is negative then a slight adjustment is needed to get a positive fractional part because both"I"and"F"are negative numbers.

For positive intermediate results:

For negative intermediate results:

Thus the integer part of the intermediate result ( "I"or"I− 1")plus a bias becomes the exponent of the final result and transformed positive fractional part of the intermediate result:2For2F+ 1becomes the significand of the final result. In order to supply 52 bits of precision to the final result, the positive fractional part must be maintained to at least 52 bits.

In conclusion, the exact number of bits of precision needed in the significand of the intermediate result is somewhat data dependent but 64 bits is sufficient to avoid precision loss in the vast majority ofexponentiationcomputations involvingdouble-precisionnumbers.

The number of bits needed for the exponentof the extended precision format follows from the requirement that the product of twodouble-precisionnumbers should not overflow when computed using the extended format. The largest possible exponent of adouble-precisionvalue is 1023 so the exponent of the largest possible product of twodouble-precisionnumbers is 2047 (an 11 bit value). Adding in a bias to account for negative exponents means that the exponent field must be at least 12 bits wide.

Combining these requirements: 1 bit for the sign, 12 bits for the biased exponent, and 64 bits for the significand means that the extended precision format would need at least 77 bits. Engineering considerations resulted in the final definition of the 80 bit format (in particular the IEEE 754 standard requires the exponent range of an extended precision format to match that of the next largest,quad,precision format which is 15 bits).[27]

Another example of calculations that benefit from extended precision arithmetic areiterative refinementschemes, used to indirectly clean out errors accumulated in the direct solution during the typically very large number of calculations made for numerical linear algebra.[30]

Language support[edit]

  • SomeC/C++implementations (e.g.,GNU Compiler Collection(GCC),Clang,Intel C++) implementlong doubleusing 80 bit floating-point numbers on x86 systems. However, this is implementation-defined behavior and is not required, but allowed by the standard, as specified for IEEE 754 hardware in theC99standard "Annex F IEC 60559 floating-point arithmetic". GCC also provides__float80and__float128types.[31]
  • SomeCommon Lispimplementations (e.g.CMU Common Lisp,Embeddable Common Lisp) implementlong-floatusing 80 bit floating-point numbers on x86 systems.
  • TheDprogramming language implementsrealusing the largest floating-point size implemented in hardware, for example 80 bits forx86CPUs. On other machines, this will be the widest floating point type natively supported by the CPU, or 64 bit double precision, whichever is wider.
  • Turbo Pascal(andObject PascalorDelphi) has anextended80 bit type available in addition toreal/single(32 bits) anddouble(64 bits), either natively (when a 80x87 coprocessor is present) or emulated (through the Turbo87 library); thisextendedtype is available on 16, 32, and 64 bit platforms, possibly withpadding.[32]
  • TheRacketrun-time system provides the 80 bit extflonum datatype on x86 systems.
  • TheSwiftstandard library provides theFloat80datatype.
  • ThePowerBASICBASIC compiler providesEXTorEXTENDED10 byte Extended-precision floating-point data type.
  • Zigprovides a f80 type since version 0.10.0

See also[edit]

Footnotes[edit]

  1. ^"This format is intended mainly to help programmers enhance the integrity of their single and double software, and to attenuate degradation by round-off in double matrix computations of larger dimensions, and can easily be used in such a way that substituting quadruple for extended need never invalidate its use."x87designerW. Kahan[13]
  2. ^"High-level languages will use extended (invisibly) to evaluate intermediate sub-expressions, and later may provide extended as a declarable data type."[16]: 70 
  3. ^"The presence of at least as many extra bits of precision in extended as in the exponent field of the basic format it supports greatly simplifies the accurate computation of the transcendental functions, inner products, and the power functionyx."[29]: 70 

References[edit]

  1. ^IEEE 754 (2008,¶ 2.1.21) definesextended precision formatas "A format that extends a supported basic format by providing wider precision and range."
  2. ^Francis, C.G. (11 February 1965)."IBM introduces powerful small computer".Director of Information (Press release).White Plains, New York:International Business Machines Corporation (IBM). Archived fromthe originalon 2019-07-05.
  3. ^Subroutine Library(PDF).IBM 1130 (9th ed.). IBM Corporation. 1974. p. 93.
  4. ^Principles of Operation.IBM System/360 (9th ed.). IBM Corporation. 1970. p. 41.
  5. ^IBM System/370 Principles of Operation(7th ed.). IBM Corporation. 1980. pp. 9-2–9-3.
  6. ^Steil, Michael (2008-10-20)."Create your own version of Microsoft BASIC for 6502".pagetable.com.p. 46.Archivedfrom the original on 2016-05-30.Retrieved2016-05-30.
  7. ^IEEE Computer Society (August 29, 2008).IEEE Standard for Floating-Point Arithmetic(Report). IEEE. §3.7.doi:10.1109/IEEESTD.2008.4610935.ISBN978-0-7381-5752-8.IEEE Std 754-2008.
  8. ^Brewer, Kevin."Kevin's Report".IEEE-754 Reference Material.Retrieved2012-02-19.
  9. ^Motorola MC68000 Family(PDF).Programmer's Reference Manual. NXP Semiconductors. 1992. pp. 1–16, 1–18, 1–23.
  10. ^ColdFire Family(PDF).Programmers' Reference Manual. Freescale semiconductor. 2005. p. 7-7.
  11. ^"FPA10 Data Sheet"(PDF).chrisacorns.computinghistory.org.uk.GEC Plessey Semiconductors. 11 June 1993.Retrieved26 November2020.
  12. ^abKahan, William (1 October 1997)."Lecture notes on the status of IEEE standard 754 for binary floating-point arithmetic"(PDF).
  13. ^Kahan, William(1 October 1997)."Lecture notes on the status of IEEE standard 754 for binary floating-point arithmetic"(PDF).p. 5.
  14. ^Einarsson, Bo (2005).Accuracy and reliability in scientific computing.SIAM. pp. 9ff.ISBN978-0-89871-815-7.Retrieved3 May2013.
  15. ^"Intel 64 and IA-32 Architectures".Software Developer's Manual. Intel Corp. March 2012. §8.2.
  16. ^Coonen, Jerome T. (January 1980). "An implementation guide to a proposed standard for floating-point arithmetic".IEEE Computer.13:68–79.doi:10.1109/MC.1980.1653344.S2CID206445847.
  17. ^Kahan, William (22 November 1983)."Mathematics written in sand - the hp-15C, Intel 8087, etc"(PDF).
  18. ^Goldberg, David (March 1991)."What every computer scientist should know about floating-point arithmetic"(PDF).ACM Computing Surveys.23(1): 192.doi:10.1145/103162.103163.S2CID222008826.
  19. ^Higham, Nicholas (2002). "Designing stable algorithms".Accuracy and Stability of Numerical Algorithms(2 ed.). Society for Industrial and Applied Mathematics (SIAM). p. 43.
  20. ^Intel 80C187 datasheet(PDF)(Report).Intel Corporation– via datasheetcatalog.org.
  21. ^Intel 64 and IA-32 Architectures Developer's Manual(Report). Vol. 1.Intel Corporation.pp. 4-6 thru 4-9, and 4-18 thru 4-21.
  22. ^Palmer, John F.; Morse, Stephen P. (1984).The 8087 Primer.Wiley Press. pp.14.ISBN0-471-87569-4.
  23. ^Morse, Stephen P.; Albert, Douglas J. (1986).The 80286 Architecture.Wiley Press. pp.91–111.ISBN0-471-83185-9.
  24. ^Intel 64 and IA-32 Architectures Developer's Manual(Report). Vol. 1.Intel Corporation.pp. 8-21 thru 8-22.
  25. ^Severance, Charles(20 February 1998)."An interview with the old man of floating-point".eecs.berkeley.edu.U.C. Berkeley.
  26. ^Palmer, John F.; Morse, Stephen P. (1984).The 8087 Primer.Wiley Press. pp.16.ISBN0-471-87569-4.
  27. ^abMorse, Stephen P.; Albert, Douglas J. (1986).The 80286 Architecture.Wiley Press. pp.96–98.ISBN0-471-83185-9.
  28. ^Hough, David (March 1981). "Applications of the proposed IEEE 754 standard for floating-point arithmetic".IEEE Computer.14(3): 70–74.doi:10.1109/C-M.1981.220381.S2CID14645749.
  29. ^Coonen, Jerome T. (January 1980). "An implementation guide to a proposed standard for floating-point arithmetic".IEEE Computer:68–79.doi:10.1109/MC.1980.1653344.S2CID206445847.
  30. ^Demmel, James; Hida, Yozo;Kahan, William;Li, Xiaoye S.;Mukherjee, Sonil; Riedy, E. Jason (June 2006)."Error bounds from extra-precise iterative refinement"(PDF).ACM Transactions on Mathematical Software.32(2): 325–351.doi:10.1145/1141885.1141894.S2CID1340891.Retrieved2014-04-18.
  31. ^"Floating Types (Using the GNU compiler collection (GCC))".
  32. ^"The Extended Data Type is different on different platforms".