#lang scribble/doc @(require scribble/manual "guide-utils.ss") @title[#:tag "performance"]{Performance} Alan Perlis famously quipped ``Lisp programmers know the value of everything and the cost of nothing.'' A Scheme programmer knows, for example, that a @scheme[lambda] anywhere in a program produces a value that is closed over it lexical environment---but how much does allocating that value cost? While most programmers have a reasonable grasp of the cost of various operations and data structures at the machine level, the gap between the Scheme language model and the underlying computing machinery can be quite large. In this chapter, we narrow the gap by explaining details of the PLT Scheme compiler and run-time system and how they affect the run-time and memory performance of Scheme code. @; ---------------------------------------------------------------------- @section{The Bytecode and Just-in-Time (JIT) Compilers} Every definition or expression to be evaluated by Scheme is compiled to an internal bytecode format. In interactive mode, this compilation occurs automatically and on-the-fly. Tools like @exec{mzc} and @exec{setup-plt} marshal compiled bytecode to a file, so that you do not have to compile from source every time that you run a program. (Most of the time required to compile a file is actually in macro expansion; generating bytecode from fully expanded code is relatively fast.) See @secref["compile"] for more information on generating bytecode files. The bytecode compiler applies all standard optimizations, such as constant propagation, constant folding, inlining, and dead-code elimination. For example, in an environment where @scheme[+] has its usual binding, the expression @scheme[(let ([x 1][y (lambda () 4)]) (+ 1 (y)))] is compiled the same as the constant @scheme[5]. On some platforms, bytecode is further compiled to native code via a @deftech{just-in-time} or @deftech{JIT} compiler. The @tech{JIT} compiler substantially speeds programs that execute tight loops, arithmetic on small integers, and arithmetic on inexact real numbers. Currently, @tech{JIT} compilation is supported for x86, x86_64 (a.k.a. AMD64), and 32-bit PowerPC processors. The @tech{JIT} compiler can be disabled via the @scheme[eval-jit-enabled] parameter or the @DFlag{no-jit}/@Flag{j} command-line flag for @exec{mzscheme}. The @tech{JIT} compiler works incrementally as functions are applied, but the @tech{JIT} compiler makes only limited use of run-time information when compiling procedures, since the code for a given module body or @scheme[lambda] abstraction is compiled only once. The @tech{JIT}'s granularity of compilation is a single procedure body, not counting the bodies of any lexically nested procedures. The overhead for @tech{JIT} compilation is normally so small that it is difficult to detect. @; ---------------------------------------------------------------------- @section{Modules and Performance} The module system aids optimization by helping to ensure that identifiers have the usual bindings. That is, the @scheme[+] provided by @schememodname[scheme/base] can be recognized by the compiler and inlined, which is especially important for @tech{JIT}-compiled code. In contrast, in a traditional interactive Scheme system, the top-level @scheme[+] binding might be redefined, so the compiler cannot assume a fixed @scheme[+] binding (unless special flags or declarations act as a poor-man's module system to indicate otherwise). Even in the top-level environment, importing with @scheme[require] enables some inlining optimizations. Although a @scheme[+] definition at the top level might shadow an imported @scheme[+], the shadowing definition applies only to expressions evaluated later. Within a module, inlining and constant-propagation optimizations take additional advantage of the fact that definitions within a module cannot be mutated when no @scheme[set!] is visable at compile time. Such optimizations are unavailable in the top-level environment. Although this optimization within modules is important for performance, it hinders some forms of interactive development and exploration. The @scheme[compile-enforce-module-constants] parameter disables the @tech{JIT} compiler's assumptions about module definitions when interactive exploration is more important. See @secref["module-set"] for more information. Currently, the compiler does not attempt to inline or propagate constants across module boundary, except for exports of the built-in modules (such as the one that originally provides @scheme[+]). The later section @secref["letrec-performance"] provides some additional caveats concerning inlining of module bindings. @; ---------------------------------------------------------------------- @section[#:tag "func-call-performance"]{Function-Call Optimizations} When the compiler detects a function call to an immediately visible function, it generates more efficient code than for a generic call, especially for tail calls. For example, given the program @schemeblock[ (letrec ([odd (lambda (x) (if (zero? x) #f (even (sub1 x))))] [even (lambda (x) (if (zero? x) #t (odd (sub1 x))))]) (odd 40000000)) ] the compiler can detect the @scheme[odd]--@scheme[even] loop and produce code that runs much faster via loop unrolling and related optimizations. Within a module form, @scheme[define]d variables are lexically scoped like @scheme[letrec] bindings, and definitions within a module therefore permit call optimizations, so @schemeblock[ (define (odd x) ....) (define (even x) ....) ] within a module would perform the same as the @scheme[letrec] version. Primitive operations like @scheme[pair?], @scheme[car], and @scheme[cdr] are inlined at the machine-code level by the @tech{JIT} compiler. See also the later section @secref["fixnums+flonums"] for information about inlined arithmetic operations. @; ---------------------------------------------------------------------- @section{Mutation and Performance} Using @scheme[set!] to mutate a variable can lead to bad performance. For example, the microbenchmark @schememod[ scheme/base (define (subtract-one x) (set! x (sub1 x)) x) (time (let loop ([n 4000000]) (if (zero? n) 'done (loop (subtract-one n))))) ] runs much more slowly than the equivalent @schememod[ scheme/base (define (subtract-one x) (sub1 x)) (time (let loop ([n 4000000]) (if (zero? n) 'done (loop (subtract-one n))))) ] In the first variant, a new location is allocated for @scheme[x] on every iteration, leading to poor performance. A more clever compiler could unravel the use of @scheme[set!] in the first example, but since mutation is discouraged (see @secref["using-set!"]), the compiler's effort is spent elsewhere. More significantly, mutation can obscure bindings where inlining and constant-propagation might otherwise apply. For example, in @schemeblock[ (let ([minus1 #f]) (set! minus1 sub1) (let loop ([n 4000000]) (if (zero? n) 'done (loop (minus1 n))))) ] the @scheme[set!] obscures the fact that @scheme[minus1] is just another name for the built-in @scheme[sub1]. @; ---------------------------------------------------------------------- @section[#:tag "letrec-performance"]{@scheme[letrec] Performance} When @scheme[letrec] is used to bind only procedures and literals, then the compiler can treat the bindings in an optimal manner, compiling uses of the bindings efficiently. When other kinds of bindings are mixed with procedures, the compiler may be less able to determine the control flow. For example, @schemeblock[ (letrec ([loop (lambda (x) (if (zero? x) 'done (loop (next x))))] [junk (display loop)] [next (lambda (x) (sub1 x))]) (loop 40000000)) ] likely compiles to less efficient code than @schemeblock[ (letrec ([loop (lambda (x) (if (zero? x) 'done (loop (next x))))] [next (lambda (x) (sub1 x))]) (loop 40000000)) ] In the first case, the compiler likely does not know that @scheme[display] does not call @scheme[loop]. If it did, then @scheme[loop] might refer to @scheme[next] before the binding is available. This caveat about @scheme[letrec] also applies to definitions of functions and constants within modules. A definition sequence in a module body is analogous to a sequence of @scheme[letrec] bindings, and non-constant expressions in a module body can interfere with the optimization of references to later bindings. @; ---------------------------------------------------------------------- @section[#:tag "fixnums+flonums"]{Fixnum and Flonum Optimizations} A @deftech{fixnum} is a small exact integer. In this case, ``small'' depends on the platform. For a 32-bit machine, numbers that can be expressed in 30 bits plus a sign bit are represented as fixnums. On a 64-bit machine, 62 bits plus a sign bit are available. A @deftech{flonum} is used to represent any inexact real number. They correspond to 64-bit IEEE floating-point numbers on all platforms. Inlined fixnum and flonum arithmetic operations are among the most important advantages of the @tech{JIT} compiler. For example, when @scheme[+] is applied to two arguments, the generated machine code tests whether the two arguments are fixnums, and if so, it uses the machine's instruction to add the numbers (and check for overflow). If the two numbers are not fixnums, then the next check whether whether both are flonums; in that case, the machine's floating-point operations are used directly. For functions that take any number of arguments, such as @scheme[+], inlining is applied only for the two-argument case (except for @scheme[-], whose one-argument case is also inlined). Flonums are @defterm{boxed}, which means that memory is allocated to hold every result of a flonum computation. Fortunately, the generational garbage collector (described later in @secref["gc-perf"]) makes allocation for short-lived results reasonably cheap. Fixnums, in contrast are never boxed, so they are especially cheap to use. @; ---------------------------------------------------------------------- @section[#:tag "gc-perf"]{Memory Management} PLT Scheme is available in two variants: @deftech{3m} and @deftech{CGC}. The @tech{3m} variant uses a modern, @deftech{generational garbage collector} that makes allocation relatively cheap for short-lived objects. The @tech{CGC} variant uses a @deftech{conservative garbage collector} which facilitates interaction with C code at the expense of both precision and speed for Scheme memory management. The 3m variant is the standard one. Although memory allocation is reasonably cheap, avoiding allocation altogether is normally faster. One particular place where allocation can be avoided sometimes is in @deftech{closures}, which are the run-time representation of functions that contain free variables. For example, @schemeblock[ (let loop ([n 40000000][prev-thunk (lambda () #f)]) (if (zero? n) (prev-thunk) (loop (sub1 n) (lambda () n)))) ] allocates a closure on every iteration, since @scheme[(lambda () n)] effectively saves @scheme[n]. The compiler can eliminate many closures automatically. For example, in @schemeblock[ (let loop ([n 40000000][prev-val #f]) (let ([prev-thunk (lambda () n)]) (if (zero? n) prev-val (loop (sub1 n) (prev-thunk))))) ] no closure is ever allocated for @scheme[prev-thunk], because its only application is visible, and so it is inlined. Similarly, in @schemeblock[ (let n-loop ([n 400000]) (if (zero? n) 'done (let m-loop ([m 100]) (if (zero? m) (n-loop (sub1 n)) (m-loop (sub1 m)))))) ] then the expansion of the @scheme[let] form to implement @scheme[m-loop] involves a closure over @scheme[n], but the compiler automatically converts the closure to pass itself @scheme[n] as an argument instead.