On x86_64, a POPCNT instruction is usually available, and it can speed
up `fxpopcount` operations by a factor of 2-3.
Since POPCNT isn't always available, code using `fxpopcount` is
compiled to a call to a generic implementation. The linker substitutes
a POPCNT instruction when it determines at runtime that POPCNT is
available.
Some measurements on a 2018 MacBook Pro (2.7 GHz Core i7) using the
program below:
popcnt = this implementation, POPCNT discovered
nocnt = this implementation, POPCNT considered unavailable
optcnt = compile to use POPCNT directly (no linker work)
cpcnt = compile to inlined generic (no linker work, no POPCNT)
Since the generic implementation is always a 64-bit popcount, it's not
as good as an inlined version for `fxpopcount32`, but otherwise the
link-edit approach to POPCNT works well:
fxpopcount fxpopcount32
popcnt: 0.098s
nocnt: 0.284s
optcnt 0.109s [slower means noise?]
cpcnt: 0.279s 0.188s
(optimize-level 3)
(time
(let loop ([v #f] [i 100000000])
(if (fx= i 0)
v
(loop (fxpopcount i) (fx- i 1)))))
original commit: 5f090e509f8fe5edc777ed9f0463b20c2e571336