Avoid allocating a flonum object for floating-opint calculations
that are consumed only by other floating-point caculations.
For this first cut, unboxing applies only to fl+, fl-, fl*, fl/,
flabs, fl<, fl<=, fl=, fl>, fl>=, bytevector-ieee-double-[native-]ref,
and bytevector-ieee-double-[native-]set!. Local variables can be
unboxed in the same way as implicit temporaries, and loop arguments
can be unboxed, but values in a closure and function-call arguments
are always boxed.
arm32 support is mostly in place, but not yet right. ppc32 support is
not yet implemented.
This commit includes a small change that is incompatible with previous
Chez Scheme versions: `(fl= +nan.0)` (and similar for other
comparisons) produces true instead of false.
original commit: 36459e43f10705aa3e383376ca7d54cf2998b7ee
On x86_64, a POPCNT instruction is usually available, and it can speed
up `fxpopcount` operations by a factor of 2-3.
Since POPCNT isn't always available, code using `fxpopcount` is
compiled to a call to a generic implementation. The linker substitutes
a POPCNT instruction when it determines at runtime that POPCNT is
available.
Some measurements on a 2018 MacBook Pro (2.7 GHz Core i7) using the
program below:
popcnt = this implementation, POPCNT discovered
nocnt = this implementation, POPCNT considered unavailable
optcnt = compile to use POPCNT directly (no linker work)
cpcnt = compile to inlined generic (no linker work, no POPCNT)
Since the generic implementation is always a 64-bit popcount, it's not
as good as an inlined version for `fxpopcount32`, but otherwise the
link-edit approach to POPCNT works well:
fxpopcount fxpopcount32
popcnt: 0.098s
nocnt: 0.284s
optcnt 0.109s [slower means noise?]
cpcnt: 0.279s 0.188s
(optimize-level 3)
(time
(let loop ([v #f] [i 100000000])
(if (fx= i 0)
v
(loop (fxpopcount i) (fx- i 1)))))
original commit: 5f090e509f8fe5edc777ed9f0463b20c2e571336