# A 63-bit floating-point type for 64-bit OCaml

Pascal Cuoq - 9th May 2013## The OCaml runtime

The OCaml runtime allows polymorphism through the uniform representation of types. Every OCaml value is represented as a single word, so that it is possible to have a single implementation for, say, “list of things”, with functions to access (e.g. `List.length`

) and build (e.g. `List.map`

) these lists that work just the same whether they are lists of ints, of floats, or of lists of sets of integers.

Anything that does not fit in in a word is allocated in a block in the heap. The word representing this data is then a pointer to the block. Since the heap contains only blocks of words, all these pointers are aligned: their few least significants bits are always unset.

Argumentless constructors (like this: `type fruit = Apple | Orange | Banana`

) and integers do not represent so much information that they need to be allocated in the heap. Their representation is *unboxed*. The data is directly inside the word that would otherwise have been a pointer. So while a list of lists is actually a list of pointers, a list of ints contains the ints with one less indirection. The functions accessing and building lists do not notice because ints and pointers have the same size.

Still, the Garbage Collector needs to be able to recognize pointers from integers. A pointer points to a well-formed block in the heap that is by definition alive (since it is being visited by the GC) and should be marked so. An integer can have any value and could, if precautions were not taken, accidentally look like a pointer. This could cause dead blocks to look alive, but much worse, it would also cause the GC to change bits in what it thinks is the header of a live block, when it is actually following an integer that looks like a pointer and messing up user data.

This is why unboxed integers provide 31 bits (for 32-bit OCaml) or 63 bits (for 64-bit OCaml) to the OCaml programmer. In the representation, behind the scenes, the least significant bit of a word containing an integer is always set, to distinguish it from a pointer. 31- or 63-bit integers are rather unusual, so anyone who uses OCaml at all knows this. What users of OCaml do not usually know is why there isn't a 63-bit unboxed float type for 64-bit OCaml.

## There is no unboxed 63-bit floating-point type in OCaml

And the answer to this last question is that there is no particular reason one shouldn't have a 63-bit unboxed float type in OCaml. Defining one only requires carefully answering two more intricately related questions:

- What 63-bit floating-point format should be used?
- How will the OCaml interpreter compute values in this format?

In 1990, when 64-bit computers were few, Xavier Leroy decided that in his (then future) Caml-light system the type for floating-point would be 64-bit double precision. The double precision floating-point format did not come close to fitting in the then-prevalent 32-bit word:

Floating-point numbers are allocated in the heap as unstructured blocks of length one two or three words depending on the possibilities of the hardware and on the required precision. An unboxed representation is possible using the 10 suffix for instance but this gives only 30 bits to represent floating-point numbers. Such a format lacks precision and does not correspond to any standard format so it involves fairly brutal truncations. Good old 64-bit IEEE-standard floating point numbers seem more useful even if they have to be allocated.

First a remark: it is not necessary to distinguish floats from ints: that is what the static type-system is for. From the point of view of the GC they are all non-pointers and that's the only important thing. So if we decide to unbox floats we can take advantage of the same representation as for integers a word with the least significant bit set. And nowadays even the proverbial grandmother has a 64-bit computer to read e-mail on hence the temptation to unbox floats.

Second the reticence to truncate the mantissa of any existing format remains well-founded. Suppose that we defined a format with 51 explicit mantissa bits as opposed to double-precision's 52. We could use the double-precision hardware for computations and then round to 51 bits of mantissa but the sizes are so close that this would introduced plenty of *double rounding errors* where the result is less precise than if it had been rounded directly to 51 bits. As someone who has to deal with the consequences of hardware computing 64-bit mantissas that are then rounded a second time to 52-bit I feel dirty just imagining this possibility. If we went for 1 sign bit 11 exponent bits 51 explicit mantissa bits we would have to use software emulation to round directly to the correct result.

This post is about another idea to take advantage of the double-precision hardware to implement a 63-bit floating-point type.

## A truncationless 63-bit floating-point format

### Borrow a bit from the exponent

Taking one of the bits from the 11 reserved for the exponent in the IEEE 754 double-precision format does not have such noticeable consequences. At the top of the scale it is easy to map values above a threshold to infinity. This does not involve double-rouding error. At the bottom of the scale things are more complicated. The very smallest floating-point numbers of a proper floating-point format called subnormals have an effective precision of less than the nominal 52 bits. Computing with full-range double-precision and then rounding to reduced-range 63-bit means that the result of a computation can be computed as a normal double-precision number with 52-bit mantissa say 1.324867e-168 and then rounded to the narrower effective precision of a 63-bit subnormal float.

Incidentally this sort of issue is the sort that remains even after you have configured an x87 to use only the 53 or 24 mantissa bits that make sense to compute with the precision of double- or single-precision. Only the range of the mantissa is reduced not that of the exponent so numbers that would be subnormals in the targeted type are normal when represented in a x87 register. You could hope to fix them after each computation with an option such as GCC's -ffloat-store but then they are double-rounded. The first rounding is at 53 or 24 bits and the second to the effective precision of the subnormal.

### Double-rounding Never!

But since overflows are much easier to handle we can cheat. In order to make sure that subnormal results are rounded directly to the effective precision we can bias the computations so that if the result is going to be a 63-bit subnormal the double-precision operation produces a subnormal result already.

In practice this means that when the OCaml program is adding numbers 1.00000000001e-152 and -1.0e-152 we do not show these numbers to the double-precision hardware. What we show to the hardware instead is these numbers multiplied by 2^-512 so that if the result need to be subnormal in the 63-bit format and in this example it needs then a subnormal double-precision will be computed with the same number of bits of precision.

In fact we can maintain this “store numbers as 2^-512 times their intended value” convention all the time and only come out of it at the time of calling library functions such as `printf()`

.

For multiplication of two operands represented as 2^-512 times their real value one of the arguments needs to be unbiased (or rebiased: if you have a trick to remember which is which please share) before the hardware multiplication by multiplying it by 2^512.

For division the result must be rebiased after it is computed.

The implementation of the correctly-rounded function `sqrt()`

for 63-bit floats is left as an exercise to the reader.

### Implementation

A quick and dirty implementation only tested as much as shown is available from ideone. Now I would love for someone who actually uses floating-point in OCaml to finish integrating this in the OCaml runtime and do some benchmarks. Not that I expect it will be very fast: the 63-bit representation involves a lot of bit-shuffling and OCaml uses its own tricks such as unboxing floats inside arrays so that it will be hard to compete.

## Credits

I should note that I have been reading a report on implementing a perfect emulation of IEEE 754 double-precision using x87 hardware and that the idea presented here was likely to be contained there. Google which is prompt to point to the wrong definition of FLT_EPSILON has been no help in finding this report again.