OCaml's option -compact can optimize for code size and speed

Pascal Cuoq - 11th Oct 2013

The OCaml compiler can generate native code for various architectures. One of the few code generation options is named -compact:

$ ocamlopt -help 
Usage: ocamlopt <options> <files> 
Options are: 
  ... 
  -compact  Optimize code size rather than speed 
  ... 

One of the effects, perhaps the only effect(?), of this option is on the code generated for allocation. In OCaml, a small block typically gets allocated from the minor heap. When all goes well, allocating in the minor heap means simply decrementing the pointer that separates the already-allocated top of the minor heap from the still-available bottom of the minor heap. Eventually the minor heap becomes full; what happens then is complicated. But the fast path is basically a subtraction and a conditional branch (usually not taken) to the complicated stuff.

With option -compact, an allocation simply calls a routine that does all the above. Without it, the subtraction and conditional jump are inlined at the site where the allocation is requested. OCaml being a typical functional language, there are plenty of these. Inlining a couple of instructions makes the code bigger, but when execution stays on the fast path, which is often, a routine call is avoided completely.

A small example

Consider the one-line OCaml function let f i = Some i. This function takes a value i and returns a Some memory cell that points to i. I intended i to be an integer, but I inadvertently wrote code that works for any type of i. This was possible because in OCaml, all types of values share a uniform representation, as discussed in the introduction of a previous post.

A Some cell occupies 16 bytes (yes this is huge. Blame 64-bit computing). The fast code generated by ocamlopt -S t.ml where L102 is the label at which selcom-called code manages the case when the minor heap is full:

	... 
	subq	$16  %r15 
	movq	_caml_young_limit@GOTPCREL(%rip)  %rax 
	cmpq	(%rax)  %r15 
	jb	L102 
	... 

The -compact option causes more compact code to be emitted for the allocation. When our function f is compiled with ocamlopt -S -compact t.ml the sequence generated for the allocation is much shorter indeed. There is even a specialized allocation function for allocating 16-byte blocs so no arguments need to be set up:

	... 
	call	_caml_alloc1 
	... 

Effects of option -compact in the large

When the development version of Frama-C is compiled with option -compact the binary takes 14.4MB.

Without option -compact the Frama-C binary weights in at 15.4MB.

One additional megabyte of code is not going to hurt us at this point right? It is not as if the Frama-C binary had to fit exactly in a box of ten floppy disks. As long as the larger code is faster it is worth it…

$ time  bin/toplevel.lean_and_mean tests/test/adpcm.c -sparecode > /dev/null 
user	0m0.843s 
$ time  bin/toplevel.big_and_slow tests/test/adpcm.c -sparecode > /dev/null 
user	0m0.856s 

… but it turns out that the larger code isn't faster. In the case of Frama-C and of the computer I am typing on at least the code generated with option -compact is smaller and faster.

Conclusion

It is not easy to try to predict performance on modern out-of-order architectures but if I had to hazard an explanation I would say that besides the obvious cache concerns having many scattered predictable conditional branches may be a losing proposition compared to one single conditional branch visited from many places with call and ret. Every conditional branch that isn't in the branch prediction buffer is at risk of being mispredicted. In addition each entry taken by the branches that have recently been visited is an entry that isn't available for branches translated from conditionals in the source code.

In contrast the execution of the call and ret instructions has been optimized over the years. The target of the ret instruction is predicted with a Return Stack Buffer that has a conceptually simpler job than the Branch Predictor (although it is clearly possible to write code for which the ret instructions cause a stall which could re-establish an advantage for the partially inlined version).

The conclusion is that everyone who uses OCaml should try the native compiler's option -compact and that perhaps the default should be for this option to be on.

Pascal Cuoq
11th Oct 2013