From 763c7981a481dd367526fffde47617c808f1d1c5 Mon Sep 17 00:00:00 2001
From: Matthew Flatt <mflatt@racket-lang.org>
Date: Thu, 9 Jul 2020 13:32:10 -0600
Subject: [PATCH] add a guide implementation

This first cut is heavy on describing how backends work, because
that's fresh in mind and relevant to ongoing effort.

original commit: 964ce95dc910e6c1825b341bf05492af5903cc82
---
 IMPLEMENTATION | 812 +++++++++++++++++++++++++++++++++++++++++++++++++
 s/cmacros.ss   |  91 +++++-
 2 files changed, 897 insertions(+), 6 deletions(-)
 create mode 100644 IMPLEMENTATION

diff --git a/IMPLEMENTATION b/IMPLEMENTATION
new file mode 100644
index 0000000000..e42e7f9288
--- /dev/null
+++ b/IMPLEMENTATION
@@ -0,0 +1,812 @@
+Getting Started
+---------------
+
+Most of the Chez Scheme implementation is in the "s" directory. The
+C-implemented kernel is in the "c" directory.
+
+Some key files in "s":
+
+ * "cmacro.ss": object layouts and other global constants
+
+ * "syntax.ss": the macro expander
+
+ * "cpnanopass.ss": the main compiler
+
+ * "cp0.ss", "cptypes.ss", "cpletrec.ss", etc.: source-to-source
+   passes that apply before the main compiler
+
+ * "x86_64.ss", "arm64.ss", etc.: backends that are used by
+   "cpnanopass.ss"
+
+ * "ta6os.def", "tarm64le", etc.: one per OS-architecture combination,
+   provides platform-specific constants that feed into "cmacro.ss" and
+   selects the backend used by "cpnanopass.ss"
+
+Scheme Objects
+--------------
+
+A Scheme object is represented at run time by a pointer. The low bits
+of the pointer indicate the general type of the object, such as "pair"
+or "closure". The memory referenced by the pointer may have an
+additional tag word to further refine the pointer-tag type.
+
+See also:
+
+  Don't Stop the BiBOP: Flexible and Efficient Storage Management for
+  Dynamically Typed Languages.
+  R. Kent Dybvig, David Eby, and Carl Bruggeman.
+  Indiana University TR #400, 1994.
+
+For example, if "cmacro.ss" says
+
+  (define-constant type-pair         #b001)
+
+then that means an address with only the lowest bit set among the low
+three bits refers to a pair. To get the address where the pair content
+is stored, round *up* to the nearest word. So, on a 64-bit machine,
+add 7 to get to the `car` and add 15 to get to the `cdr`. Since
+allocation on a 64-byte machine is 16-byte aligned, the hexadecimal
+form of every pair pointer will end in "9".
+
+The `type-typed-object` type,
+
+ (define-constant type-typed-object #b111)
+
+refers to an object whose first word indicates its type. In the case
+of a Scheme record, that first word will be a record-type descriptor
+--- that is, a pointer to a record type, which is itself represented
+as a record. The based record type, `#!base-rtd` has itself as its
+record type. Since the type bits are all ones, on a 64-bit machine,
+every object tagged with an additional type workd will end in "F" in
+hexadecimal, and adding 1 to the pointer produces the address
+containing the record content (which starts with the rrecord type, so
+add 9 instead to get to the first field in the record).
+
+As another example, a vector is represented as `type-typed-object`
+pointer where the first word is a fixnum. That is, a fixnum used a
+type word indicates a vector. The fixnum value is the vector's length
+in wordobjects, but shifted up by 1 bit, and then the low bit is set
+to 1 for an immutable vector.
+
+Most kinds of Scheme values are represented records, so the layout is
+defined by `define-record-type` and similar. For the primitive object
+types that are not records (and even a few that are), the layouts are
+defined in "camcros.ss". For example, an `exactnum` (i.e., a complex
+number with exact real and imaginary components) is defined as
+
+ (define-primitive-structure-disps exactnum type-typed-object
+   ([iptr type]
+    [ptr real]
+    [ptr imag]))
+
+The `type-typed-object` in the first line indicates that an exactnum
+is represented by a pointer that is tagged with `type-typed-object`,
+and so we should expect the first first to be a type word. That's why
+the first field above is `type`, and it turns out that it will always
+contain the value `type-inexactnum`. The `iptr` type for `type` means
+"a pointer-sized signed integer". The `ptr` type for `real` and `imag`
+means "pointer" or "Scheme object".
+
+Functions and Calls
+-------------------
+
+Scheme code does not use the C stack, except to the degree that it
+interacts with C functions. Instead, the Scheme continuation is a
+separate, heap-allocated, linked list of stack segments. Locally, you
+can just view the continuatiton as a stack and assume that overflow
+and continuation operations are handled as needed at the boundaries.
+
+See also:
+ 
+ Representing Control in the Presence of First-Class Continuations.
+ Robert Hieb, R. Kent Dybvig, and Carl Bruggeman.
+ Programming Language Design and Implementation, 1990.
+
+ Compiler and Runtime Support for Continuation Marks.
+ Matthew Flatt and R. Kent Dybvig.
+ Programming Language Design and Implementation, 2020.
+
+To the degree that the runtime system needs global state, that state
+is in the thread context (so, it's thread-local), which we'll
+abbreviate as "TC". Some machine register is desgined as the `%tc`
+register, and it's initialized on entry to Scheme code. For the
+defintion of TC, see `(define-primitive-structure-disps tc ...)` in
+"cmacro.ss".
+
+The first several fields of TC are virtual registers that may be
+assigned to machine registers, in which case the TC and registers are
+synced on entry and exit from Scheme code, including when calling
+kernel functionality from Scheme code. In particular, the SFP (Scheme
+frame pointer) virtual register must be assigned to a real register,
+because it's the Scheme stack pointer. The TC and SFP registers are
+the only two that absolutely must be registers, but AP (allocation
+pointer) and TRAP registers are also good candidates on architectures
+where plenty of registers are available.
+
+The Scheme stack grows up, and SFP points to the beginning (i.e., the
+low address) of the current stack frame. The first word of a stack
+frame is the return address, so a frame looks like this:
+
+                ^
+                |          (higher addresses)
+              future
+              frames
+          |------------|
+          |   var N    |
+          |------------|
+          |     ...    | ....
+          |------------|
+          |   var 1    | SFP[1]
+          |------------|
+          |  ret addr  | SFP[0]
+ SFP ->   |------------|
+             previous
+              frames 
+                |          (lower addresses)
+                v
+
+On entry to a Scheme function, a check ensures that the difference
+between SFP and the end of the current stack segment is big enough to
+accomodate the (spilled) variables of the called function, plus enough
+slop to deal with some primitive operations.
+
+A non-tail call moves SFP past all the live variables of the current
+function, installs the return address as as pointer within the current
+function, and then jumps to the called function. Function calls and
+returns do not use machine "call" and "return" instructions;
+everything is just a "jump". ("Call" and "return" instructions are
+used for for C interactions.) It's the caller's responsibity to reset
+SFP back on return, since the caller knows how much it moved SFP
+before calling.
+
+The compiler can use a register for the return address instead of
+immediately installing it in SFP[0] on a call. That mode is triggered
+by giving one of the regisers the name `%ret` (as described in
+"Machine Registers" below). Currently, however, the called Scheme
+function will immediatelly copy the register into SFP[0], and it will
+always return by jumping to SFP[0]. So, until the compiler improves to
+deal with leaf functions differently, using a return register can help
+only with hand-coded leaf functions that don't immediately move the
+return register into SFP[0].
+
+There are two ways that control transitions from C to Scheme: an
+initial call through `S_generic_invoke` (see "scheme.c") or via a
+foreign callable. Both of those go through `S_call_help` (see
+"schlib.c"). The `S_generic_invoke` function calls `S_call_help`
+directly. A foreign callable is represented by generated code that
+converts arguments and then calls `S_call_help` to run the Scheme
+procedure that is wrapped by the callable.
+
+The `S_call_help` function calls the hand-coded `invoke` code (see
+"cpnanopass.ss"). The `invoke` code sets things up for the Scheme
+world and jumps to the target Scheme function. When control returns
+from the called Scheme function back to `invoke`, `invoke` finishes
+not with a C return, but by calling `S_return` (see "schlib.c"), which
+gets back to `S_call_help` through a longjmp. The indirect return
+through longjmp helps the Scheme stack and C stack be independent,
+which is part of how Scheme continuations interact with foreign
+functions.
+
+For a non-tail call in Scheme, the return address is not right after
+the jump instruction for the call. Instead, the return address is a
+little later, and there's some data just before that return address
+that describes the calling function's stack frame. The GC needs that
+information, for example, to know which part of the current Scheme
+stack is populated by live variables. The data is represented by
+either the `rp-header` or `rp-compact-header` (see "cmacro.ss") shape.
+So, when you disassemble code generated by the Chez Scheme compiler,
+you may see garbage instructions mingled with the well-formed
+instructions, but the garbage will always be jumped over.
+
+Compilation Pipeline
+--------------------
+
+Compilation
+
+ * starts with an S-expression (possibly with annotations for source
+   locations),
+
+ * converts it to a syntax object (see "syntax.ss"),
+
+ * expands macros (see "syntax.ss") and produces an `Lsrc`
+   representation in terms of core forms (see `Lsrc` in
+   "base-lang.ss"),
+
+ * performs front-end optimizations on that representation (see
+   "cp0.ss", "cptypes.ss", etc.),
+
+ * and then compiles to machine code (see "cpnanopass.ss"), which
+   involves many individual passes that convert through many different
+   intermediate forms (see "np-language.ss").
+
+See also:
+
+   Nanopass compiler infrastructure.
+   Dipanwita Sarkar.
+   Indiana University PhD dissertation, 2008
+
+   A Nanopass Framework for Commercial Compiler Development.
+   Andrew W. Keep.
+   Indiana University PhD dissertation, 2013
+
+Note that the core macro expander always converts its input to the
+`Lsrc` intermediate form. That intermediate form can be converted back
+to an S-expression (see "uncprep.ss", whose name you should parse as
+"undo-compilerpass-representation").
+
+In the initial intermediate form, `Lsrc`, all primitive operations are
+represented as calls to functions. In later passes in "cpnanopass.ss",
+some primitive operations get inlined into a combination of core
+forms, some of which are `inline` forms. The `inline` forms eventually
+get delivered to a backend for instruction selection. For example, a
+use of safe `fx+` is inlines as argument checks that guard an `(inline
++ ...)`, and the `(inline + ...)` eventually becomes a machine-level
+addition instruction.
+
+Machine Registers
+-----------------
+
+Each backend file, such as "x86_64.ss" or "arm64.ss", starts with a
+description of the machine's registers. It has three parts in
+`define-registers`:
+
+(define-registers
+  (reserved
+    <reg>
+    ...)
+  (allocable
+    <reg>
+    ...)
+  (machine-dependent
+    <reg>
+    ...))
+
+Each <reg> has the form
+
+    [<name> ... <preserved? / callee-saved?> <num> <type>]
+
+ * The <name>s in one <reg> will all refer to the same register, and
+   the first <name> is used as the canonical name. By convention, each
+   <name> starts with `%`. The compiler gives specific meaning to a
+   few names listed below, and a backend can use any names otherwise.
+
+ * The information on preserved (i.e, callee-saved) registers helps
+   the compiler save registers as needed before some C interactons.
+
+ * The <num> value is for the private use of the backend. Typically,
+   it corresponds to the register's representation within machine
+   instructions.
+
+ * The <type> is either 'uptr or 'fp, indicating whether the register
+   holds a pointer/integer value (i.e., an unsigned integer that is
+   the same size as a pointer) or a floating-point value. For
+   `allocatable` registers, the different types of registers represent
+   different allocation pools.
+
+The `reserved` section describes registers that the compiler needs and
+that will be used only for a designated purpose. The registers will
+never be allocated to Scheme variables in a compiled function. The
+`reserved` section must start with `%tc` and `%sfp`, and it must list
+only registers with a recognized name as the canonical name.
+
+The `machine-dependent` section describes additional registers that
+also will not be allocated. They are also not saved automatically for
+C interactions.
+
+The `allocable` section describes registers that may be mapped to
+specific purposes by using a recognized canonical name, but generally
+these registers are allocated as needed to hold Scheme variables and
+temporaries (including registers with recognized names in situations
+where the recognized purpose is not needed). Registers in this
+category are automatically saved as needed for C interactions.
+
+The main recognized register names, roughly in order of usefulness as
+real machine registers:
+
+ %tc - the first reserved register, must be mapped as reserved
+ %sfp - the second reserved register, must be mapped as reserved
+ %ap - allocation pointer (for fast bump allocation)
+ %trap - counter for when to check signals, including GC signal
+
+ %eap - end of bump-allocatable region
+ %esp - end of current stack segment
+
+ %cp - used for a procedure about to be called
+ %ac0 - used for argument count and call results
+ 
+ %ac1 - various scratch and communication purposes
+ %xp  - ditto
+ %yp  - ditto
+
+Each of the registers maps to a slot in the TC, so they are sometimes
+used to communicate between compiled code and the C-implemented
+kernel. For example, `S_call_help` expects the function to be called
+in AC1 with the argument count in AC0 (as usual).
+
+A few more names are recognized to direct the compiler in different
+ways:
+
+ %ret - use a return register insteda of just SFP[0]
+
+ %reify1, %reify2 - a kind of manual allocation of registers for
+                    certain hand-coded routines, which otherwise could
+                    run out of registers to use
+
+Variables and Register Allocation
+---------------------------------
+
+A variables in Scheme code can be allocated either to a register or to
+a location in the stack frame, and the same goes for temporaries that
+are needed to evaluate subexpressions. Naturally, variables and
+temporaries with non-overlapping extents can be mapped to the same
+register or frame location. Currently, only variables with the same
+type, integer/pointer versus floating-point, can be allocated to the
+same frame location.
+
+An early pass in the compiler converts mutable variables to
+pair-valued immutable variables, but assignment to variables is still
+allowed within the compiler's representation. (The early conversion of
+mutables variables ensures that mutation is properly shared for, say,
+variables in captured continuations.) That is, even though variables
+and temporaries are typically assigned only once, the compiler's
+intermediate representation is not a single-asssignment form like
+SSA.
+
+Each variable or temporary will be allocated to one spot for it's
+whole lifetime. So, from the register-allocation perspective, it's
+better to use
+
+   (set! var1 ...)
+   ... var1 ...
+   ... code that doesn't use var1 ...
+   (set! var2 ...)
+   ... var2 ...
+
+than to reuse var1 like
+
+   (set! var1 ...)
+   ... var1 ...
+   ... code that doesn't use var1 ...
+   (set! var1 ...)
+   ... var1 ...
+
+Intermediate code in later passes of the compiler can also refer to
+registers directly, and those uses are taken into account by the
+register allocator.
+
+Overall, the allocator see several kinds of "variables":
+
+ * real registers;
+
+ * Scheme variables and temporaries as represented by `uvar`s, each of
+   which is eventually allocated to a real register or to a frame
+   location;
+
+ * unspillable varriables, each of which must be allocated to a real
+   register; these are introduced by a backend during the
+   instruction-selection pass, where an instruction may require a
+   register argument; and
+
+ * pre-colored unspillable variables, each which must be allocated to
+   a specific real register; these are introduced by a backend where
+   an instruction may require an argument in a specific registers.
+
+The difference between a pre-colored unspillable and just using the
+real register is that you declare intent to the register allocator,
+and it can sometimes tell you if things go wrong. For example,
+
+        (set! %r1 v1)
+        (set! must-be-r1 v2)
+        ... use %r1 and must-be-r1 ...
+
+has clearly gone wrong. In contrast, the register allocator thinks
+that
+
+        (set! %r1 v1)
+        (set! %r1 v2)
+        ... use %r1, sometimesexpecting v1 and sometimess v2 ...
+
+looks fine, and it may optimize away the first assignment. [Note:
+Optimized-away assignments are one of the most confusing potential
+results of register-use mistakes.]
+
+At the point where the register allocator runs, a Scheme program has
+been simplified to a sequence of assignment forms and expression
+forms, where the latter are either value-producing and sit on the
+right-hand side of an assignment or they are effectful and sit by
+themselves. The register allocator sees the first assignment to a
+variable/register as the beginning of its live range and the last
+reference as the end of its live range. In some cases, an instruction
+is written with "dummy" arguments just to expose the fact that it
+needs those arguments to stay live; for example, a jump instruction
+that implements a function-call return conceptually needs to consume
+the result-value registers (because those values need to stay live
+throgh the jump), even though the machine-level jump instruction
+doens't refer to the result values. The `kill` dummy instruction can
+be used with `set!` to indicate that a variable is trashed, but the
+`kill` is discarded after register allocation. It's also possible for
+an insstruction to produce results in multiple registers. So, besides
+using dummy arguments and `kill`, an instruction form can have a
+`info-kill*-live*` record attached to it, which lists the `kill*`
+variables that the expression effectively assigns and the `live*`
+variables that the expression effectively references. (Note: a `set!`
+form cannot itself have a `info-kill*-live*` record attached to it,
+because the info slot for `set!` be an `info-live` record that records
+computed live-variable information.)
+
+As a first pass, the register allocator can look at an intermediate
+instruction sequence and determine that there are too many live
+variables, so some of them need to be spilled. The register allocator
+does that before consulting the backend. So, some of the variables in
+the intermediate form will stay as `uvar`s, and some will get
+converted to a frame reference of them form SFP[pos]. When the backend
+is then asked to select an instruction for an operation that cosumes
+some variables and delivers a result to some destination variable, it
+may not be able to work with one or more of the arguments or
+destination in SFP[pos] form; in that case, it will create an
+unspillable and assign the SFP[pos] value to the unspillable, then use
+the unspillable in a generated instruction sequence. Of course,
+introducing unspillables may mean that some of the remaining `uvar`s`
+to no longer fit in registers after all; when that happens, the
+register allocator will discard the tentative instruction selection
+and try again after spilling for `uvar`s (which will then create even
+more unspillables locally, but those will have short lifetimes, so
+register allocation will eventually succeed). Long story short, the
+backend can assume that a `uvar` wil be replaced later by a register.
+
+When reading the compiler's implementation, `make-tmp` in most passes
+creates a `uvar` (that may eventually be spilled to a stack-frame
+slot). A `make-tmp` in the instruction-selection pass, however, makes
+an unspillable. In earlies passes of the compiler, new temporaries
+must be bound with a `let` form (i.e., a `let` in the intermediate
+repressentation) before they can be used; in later passes, a `set!`
+initializes a temporary.
+
+In all but the very earliest passes, an `mref` form represents a
+memory reference. Typically, a memory reference consistents of a
+variable and an offset. The general form is two variables and an
+offset, all of which are added to obtain an address, because many
+machine support indexed memory references of that form. The `%zero`
+pseudo-register is used as the second variable in an general `mref`
+when only one variable is needed. A variable or memory reference also
+has a type, 'uptr or 'fp, in the same way as a register. So, a
+variable of a given type may be allocated to a register of that type,
+or it may be spilled to a frame location and then referenced through
+an `%sfp`-based `mref` using that type. In early passes of the
+compiler, `mref`s can be nested and have computed pieces (such as
+calulating the offset), but a later pass will introduce temporaries to
+flatten `mref`s into just variable/register and immediate-integer
+components.
+
+A backend may introduce an unspillable to hold an `mref` value for
+various reasons: because the relevant instruction suports only one
+register plus an offset instead of two registers, because the offset
+is too big, because the offset does not have a required alignment, and
+so on.
+
+Instruction Selection: Compiler <-> Backend
+-------------------------------------------
+
+For each primitive that the compiler will reference via `inline`,
+there must be a `declare-primitive` in "np-language.ss". Each
+primitive is either an `effect`, a `value` that must be used on the
+right-hand side of a `set!` or a `pred` that must be used immediately
+in the test position of an `if` --- where `set!` and `if` here refer
+to forms in the input intermediate language of the
+instruction-selection compiler pass (see `L15c` in "np-languages.ss").
+Most primitives potentially correspond to a single machine
+instruction, but any of them can expand to any number of instructions.
+
+The `declare-primitive` form binds the name formed by adding a `%`
+prefix. So, for example,
+
+  (declare-primitive logand value #t)
+
+binds `%logand`. The `(%inline name ,arg ...)` macro expands to
+`(inline ,null-info ,%name ,arg ...)` macro, so that's why you don't
+usually see the `%` written out.
+
+The backend implementation of a prrimitive is a function that takes as
+many arguments as the `inline` form, plus an additional initial
+argument for the destination in the case of a `value` primitive on the
+right-hand side of a `set!`. The result of the primitive function is a
+list of instructions, where an instruction is either a `set!` or `asm`
+form in the output intermediate representation of the
+instruction-selection pass (see `L15d` in "np-languages.ss"). The
+`asm` form in the output language has a function that represents the
+instruction; that function again takes the arguments of the `asm`
+form, plus an extra leading argument for the destiination if it's on
+the right-hand side of a `set!` (plus an argument before that for the
+machine-code sequence following the instruction, and it returns an
+extended machine-code sequence; that is, a machine-code sequence is
+built end-to-start).
+
+An instruction procedure typically has a name prefixed with `asm-`.
+So, for example, the `%logand` primitive's implementation in the
+backend may produces a result that includes a reference to an
+`asm-logand` instruction procedure. Or maybe the machine instruction
+for logical "and" has a variant that sets condition codes and one that
+doesn't, and they're both useful, so `asm-logand` actually takes a
+curried bboolean to pick; in thatt case, `%logand` returns an
+instruction with `(asm-logand #f)`, which produces a function that
+takes the destination and `asm` arguments. Whether an argument to
+`asm-logand` is suitable for currying or inclusion as an `asm`
+argument depends on whether it makes sense in the `asm` grammar and
+whether it needs to be exposed for register allocation.
+
+The compiler may refer to some instructions directly. Of particular
+importance are `asm-move` and `asm-fpmove`, which are eventually used
+for `set!` forms after the instruction-selection pass. That is, the
+output of instruction selection still uses `set!`, and then those are
+converted to memory and register-moving instructions later. The
+instruction-selecton pass must ensure that any surving `set!`s are
+simple enough, though, to map to instructions without further register
+allocation. In other words, the backend instruction selector should
+only return `set!`s as instructions when they are simple enough, and
+it should generate code to simplify the ones that don't start out
+simple enough. To give the backend control over `set!`s in the *input*
+of instruction selection, those are send to the backend as `%move` and
+`%fpmove` primitives (which may simply turn back into `set!s` using
+the output language, or they may get simplified). When the compiler
+generates additional `set!`s after instruction selection, it generates
+only cnstrainted forms, where target or source `mref`s have a single
+register and a small, aligned offset.
+
+To organize all of this work, a backend implementation like
+"x86_64.ss" or "arm64.ss" must be organized into three parts, which
+are implemented by three S-expressions:
+
+ * `define-registers`
+
+ * a module that implements primitives (that convert to instructions),
+   installing them with `primitive-handler-set!`
+
+ * a module that implements instructions (that convert to machine
+   code), a.k.a. the "assembler", defining the instructions as
+   functions
+
+That last module must also implement a few things that apply earlier
+than assembling (or even instruction selection), notably including
+`asm-foreign-call` and `asm-foreign-callable`. For more on those two,
+see "Foreign Function ABI" below.
+
+To summarize the interface between the compiler and backend is:
+
+ primitive : L15c.Triv ... -> (listof L15d.Effect)
+
+ instruction : (listof code) L16.Triv ... -> (listof code)
+
+A `code` is mostly bytes to be emitted, but it also contains
+relocation entries and human-readable forms that are printed when
+assembly printing is enabled. The `aop-cons*` helper macro (in
+"cpnanopass.ss") is like `cons*`, but it skips its first argument if
+human-readable forms aren't being kept.
+
+Instruction Selection: Backend Structure
+----------------------------------------
+
+To further organize the work of instruction selection and assembly,
+all of the current backends use a particular internal structure:
+
+ * primitives are defined through a `define-instruction` form that
+   helps with pattern matching and automatic conversion/simplification
+   of arguments; and
+
+ * instructions are defined as functions that use an `emit` form,
+   which in turn dispatches to function that represent actual
+   machine-level operations, where the functions for machine-level
+   operations typically have names ending in `-op`.
+
+Consider the "arm64.ss" definition fo `%logand`, which should accept a
+destination (here called "z") and two arguments:
+
+  (define-instruction value (logand)
+    [(op (z ur) (x ur) (y funkymask))
+     `(set! ,(make-live-info) ,z (asm ,info ,(asm-logand #f) ,x ,y))]
+    [(op (z ur) (x funkymask) (y ur))
+     `(set! ,(make-live-info) ,z (asm ,info ,(asm-logand #f) ,y ,x))]
+    [(op (z ur) (x ur) (y ur))
+     `(set! ,(make-live-info) ,z (asm ,info ,(asm-logand #f) ,x ,y))])
+
+The A64 instruction set supports a logical "and" on either two
+registers or a register and an immediate, but the immediate value has
+to be representable with a funky encoding. The pattern forms above
+require that the destination is always a register/variable, and either
+of the arguments can be a literal that fits into the funky encoding or
+a register/variable. The `define-instruction` macro is itself
+implemented in "arm64.ss", so it can support specialized patterns like
+`funkymask`.
+
+If a call to this `%logand` function is triggered by a form
+
+  `(set! ,info (mref ,var1 ,%zero 8) ,var2 ,7)
+
+then the code generated by `define-instruction` will notice that the
+first argument is not a register/variable, while 7 does encode as a
+mask, so it will arrange to produce the same value as
+
+   (let ([u (make-tmp 'u)])
+      (list
+        (%logand u var2 7)
+        `(set! ,(make-live-info) (mref ,var1 ,%zero 8) ,u)))
+
+Then, the first case of `%logand` will match, and the result will be
+the same as
+
+   (let ([u (make-tmp 'u)])
+      (list
+        `(set! ,(make-live-info) ,u (asm,(asm-logand #f) ,var2 ,7)
+        `(set! ,(make-live-info) (mref ,var1 ,%zero 8) ,u))))
+
+If the offset 8 were instead a very large number, then auto-conversion
+would have to generate an `add` into a second temporary variable.
+Otherwise, `asm-move` would not be able to deal with the generated
+`set!` to move `u` into the destination. The implementation of
+`define-instruction` uses a `mem->mem` helper function to simplify
+`mref`s. In the "arm32.ss" backend, there's an additional `fpmem`
+pattern and `fpmem->fpmem` helper, because the constraints on memory
+references for floating-point operations are different than than the
+constraints on memory references to load an integer/pointer.
+
+Note that `%logand` generates a use of the same `(asm-logand #f)`
+instruction for the register--register and the register--immediate
+cases. A more explicit distinction could be made in the output of
+instruction selection, but delaying the choice is anologous to how
+assembly languages often use the same mnemonic for related
+instructions. The `asm-move` and `asm-fpmove` must accomodate
+register--memory, memory--register, and register--register cases,
+because `set!` forms after instruction selection can have those
+variants.
+
+The `asm-logand` instruction for "arm64.ss" is implemented as
+
+   (lambda (set-cc?)
+     (lambda (code* dest src0 src1)
+       (Trivit (dest src0 src1)
+          (record-case src1
+            [(imm) (n) (emit andi set-cc? dest src0 n code*)]
+            [else (emit and set-cc? and src0 src1 code*)]))))
+
+The `set-cc?` argument coresponds to the `#f` in `(asm-logand #f)`.
+The inner lambda reprsents the instruction --- that is, it's the
+function in an `asm` form. The function takes `code*` first, which is
+a list of machine codes for all instructions after the `asm-logand`.
+The `dest` argument corresponds to the result register, and `src0` and
+`src1` are the two arguments.
+
+The `Trivit` form is a bridge between intermediate languages. It takes
+variables that are boudn already and it rebinds them for the body of
+the `Trivit` form. Each rebinding translate the argument from an `L16`
+`Triv` record to a list that starts 'reg, 'disp, 'index, 'literal, or
+'literal@. (Beware of missing this step, and beware of backends that
+sometimes intentionally skip this step because the original is known
+to be, say, a register.)
+
+The `emit` form is defined in the "arm64.ss" backend and others, and
+it's just a kind of function call that cooperates with `define-op`
+declarations. For example, `(define-op andi logical-op arg1 ...)`
+binds `andi-op`, and `(emit andi arg2 ...)` turns into `(logical-op
+'and arg1 ... arg2 ...)`; that is, `andi-op` first receives the symbol
+'andi, then arguments listed at `define-op`, then arguments listed at
+`emit`. The last argument is conventionally `code*`, which is the code
+list to be extended with new code at its beginning (because the
+machine-code list is built end to start). The bounce from `andi-op` to
+`logicial-op` is because many instructions follow a similar encoding,
+such as different bitwise-logicial operations like `and` and `or`.
+Meanwhile, `logical-op` uses an `emit-code` form, which is also in
+"arm64.ss" and other backends, that calls `aop-cons` with a suitable
+human-readable addition.
+
+All of that could be done with just plain functions, but the macros
+help with boilerplate and arrange some helpful compile-time checking.
+
+Foreign Function ABI
+--------------------
+
+Support for foreign procedures and callables in Chez Scheme boils down
+to foriegn calls and callable stubs for the backend. A backend's
+`asm-foreign-call` and `asm-forieng-callbable` function receives an
+`info-foreign` record, which describes the argument and result types
+in relatively primitive forms:
+
+ * double
+ * float
+ * [signed] integer of {8,16,32,64} bits
+ * generic pointer or scheme-object (to treat as a generic pointer)
+ * a "&" form, which is a pointer on the Scheme side and by-value on
+   the C side, and can be a struct/union; layout info is reported
+   by `$ftd-...` helpers
+
+If the result type is a "&" type, then the function expects an extra
+first argument on the Scheme side. That extra argument is reflected by
+an extra pointer type at the statr of the argument list, but the "&"
+type is also left for the result type as an indication about that
+first argument. In other words, the result type is effectively
+duplicated in the result (matching the C view) and an argument
+(mathing the Scheme view) --- so, overall, the given type matches
+neither the C nor Scheme view, but either view can be reconstructed.
+
+The compiler creates wrappers to take care of further conversion
+to/from these primitive shapes.
+
+The `asm-foreign-call` function returns 5 values:
+
+ * allocate : -> L13.Effect
+
+   Any needed setup, such as allocating C stack space for arguments.
+
+ * c-args : (listof (uvar/reg -> L13.Effect))
+
+   Generate code to convert each argument. The generated code will be
+   in reverse order, with the first argument last, because that tends
+   to improve register allocation.
+
+   If the result type is "&", then `c-arg`s must include a function to
+   accept the pointer that receives the function result (i.e., the
+   length of `c-args` should match the length of the agument-type list
+   in the given `info-foreign`). The pointer may need to be stashed
+   somewhere by the generated code for use after the function returns.
+
+   The use of the src variable for an argument depends on its type:
+
+     - double or float: an 'fp-typed variable
+     - integer or pointer: a 'uptr-typed variable that has the integer
+     - "&": a 'uptr-typed variable that has a pointer to the argument
+
+ * c-call : uvar/reg boolean -> L13.Effect
+
+   Generate code to call the C function whose address is in the given
+   register. The boolean if #t if the call can assume that the C
+   function is not a varargs function on platformss where varargs
+   support is the default.
+
+ * c-result : uvar/reg -> L13.Effect
+
+   Similar to the conversions in `c-args`, but for the result, so the
+   given argument is a destination variable. This function will not be
+   used if the foreign call's result type is void. If the result if a
+   floating-point value, the provided destination variable has type
+   'fp.
+
+ * allocate : -> L13.Effect
+
+   Any needd teardown, such as deallocating C stack space.
+
+The `asm-foreign-callable` function returns 4 values:
+
+ * c-init : -> L13.Effect
+
+   Anything that needs to be done just before transitioning into
+   Scheme, such as saving preserved registers that call be used within
+   the callable stub.
+
+ * c-args : (listof (uvar/reg -> L13.Effect))
+
+   Similar to the `asm-foreign-call` result case, but each function
+   should fill a destination variable form platform-specific argument
+   registers and stack locations.
+
+   If the result type is "&", then `c-arg`s must include a function to
+   produce a pointer that receives the function result. Space for this
+   pointer may needed to be allocated (probably on the C stack),
+   possibly in a way that can be found on return.
+
+   The use of the destination variable is different than for the
+   `asm-foreign-call` in the case of floating-point arguments:
+
+     - double or float: pointer to a flonum to be filled with the value
+     - integer or pointer: a 'uptr-typed variable to receive the value
+     - "&": a 'uptr-typed variable to receive the pointer
+
+ * c-result : (uvar/reg -> L13.Effect) or (-> L13.Effect)
+
+   Similar to the `asm-foreign-call` arrgument cases, but for a
+   floating-point result, the given result register holds pointer to a
+   flonum. Also, if the function result is a "&" or void type, then
+   `c-result` takes no argument (because the destination pointer was
+   already produced or there's no result).
+
+ * c-return : (-> L13.Effect)
+
+   Generate the code for a C return, including any teardown needed to
+   balance `c-init`.
diff --git a/s/cmacros.ss b/s/cmacros.ss
index 0561611243..ef99a6e9e1 100644
--- a/s/cmacros.ss
+++ b/s/cmacros.ss
@@ -12,6 +12,9 @@
 ;;; See the License for the specific language governing permissions and
 ;;; limitations under the License.
 
+;; ---------------------------------------------------------------------
+;; Initial helper macros and functions:
+
 (define-syntax disable-unbound-warning
   (syntax-rules ()
     ((_ name ...)
@@ -190,6 +193,13 @@
   (lambda (x)
     (syntax-error x "misplaced aux keyword")))
 
+;; ---------------------------------------------------------------------
+;; Libspec representation:
+
+;; A libspec is a description of a runtime function to be represenced
+;; by machine code, where the linker will find the library funtion and
+;; update code to reference it as code is loaded/linked
+
 ;; layout of our flags field:
 ;; bit 0: needs head space?
 ;; bit 1 - 9: upper 9 bits of index (lower bit is the needs head space index
@@ -290,6 +300,9 @@
          (fxlogand (libspec-flags libspec)
            (fxlognot (fxsll 1 (constant libspec-does-not-expect-headroom-index))))))]))
 
+;; ---------------------------------------------------------------------
+;; More helpers:
+
 (define-syntax return-values
   (syntax-rules ()
     ((_ args ...) (values args ...))))
@@ -328,6 +341,13 @@
                  [(_ foo e1 e2) e1] ...
                  [(_ bar e1 e2) e2]))))])))
 
+(define-syntax log2
+  (syntax-rules ()
+    [(_ n) (integer-length (- n 1))]))
+
+;; ---------------------------------------------------------------------
+;; Version and machine types:
+
 (define-constant scheme-version #x0905031F)
 
 (define-syntax define-machine-types
@@ -370,9 +390,8 @@
 
 (define-constant machine-type-name (cdr (assv (constant machine-type) (constant machine-type-alist))))
 
-(define-syntax log2
-  (syntax-rules ()
-    [(_ n) (integer-length (- n 1))]))
+;; ---------------------------------------------------------------------
+;; Some object-layout constants:
 
 ; a string-char is a 32-bit equivalent of a ptr char: identical to a
 ; ptr char on 32-bit machines and the low-order half of a ptr char on
@@ -425,6 +444,9 @@
 
 (define-constant list-bits-mask (- (expt 2 (constant ptr-alignment)) 1))
 
+;; ---------------------------------------------------------------------
+;; Fasl encoding tags:
+
 ;;; fasl codes---see fasl.c for documentation of representation
 
 (define-constant fasl-type-header 0)
@@ -495,6 +517,12 @@
   (bytevector (constant fasl-type-header) 0 0 0
     (char->integer #\c) (char->integer #\h) (char->integer #\e) (char->integer #\z)))
 
+;; ---------------------------------------------------------------------
+;; Relocation repersentation
+
+;; A recolcation tells the linker where to update machine code to link
+;; in library functions, literal Scheme objects, etc.
+
 (define-syntax define-enumerated-constants
   (lambda (x)
     (syntax-case x ()
@@ -543,6 +571,9 @@
 
 (macro-define-structure (reloc type item-offset code-offset long?))
 
+;; ---------------------------------------------------------------------
+;; Some flags to cooperate with the C-implemented kernel:
+
 (define-constant SERROR    #x0000)
 (define-constant STRVNCATE #x0001) ; V for U to avoid msvc errno.h conflict
 (define-constant SREPLACE  #x0002)
@@ -621,6 +652,9 @@
 (define-constant ERROR_VALUES 7)
 (define-constant ERROR_MVLET 8)
 
+;; ---------------------------------------------------------------------
+;; GC constants
+
 (define-syntax define-alloc-spaces
   (lambda (x)
     (syntax-case x (real swept unswept unreal)
@@ -714,7 +748,10 @@
 (define-constant countof-phantom 28)
 (define-constant countof-types 29)
 
-;;; type-fixnum is assumed to be all zeros by at least by vector, fxvector,
+;; ---------------------------------------------------------------------
+;; Tags that are part of the pointer represeting an object:
+
+;;; type-fixnum is assumed to be all zeros by at least vector, fxvector,
 ;;; and bytevector index checks
 (define-constant type-fixnum           0) ; #b100/#b000 32-bit, #b000 64-bit
 (define-constant type-pair         #b001)
@@ -725,6 +762,9 @@
 (define-constant type-immediate    #b110)
 (define-constant type-typed-object #b111)
 
+;; ---------------------------------------------------------------------
+;; Immediate values; note that these all end with `type-immediate`:
+
 ;;; note: for type-char, leave at least fixnum-offset zeros at top of
 ;;; type byte to simplify char->integer conversion
 (define-constant type-boolean           #b00000110)
@@ -740,6 +780,10 @@
 (define-constant ptr sbwp               #b01001110)
 (define-constant ptr ftype-guardian-rep #b01010110)
 
+;; ---------------------------------------------------------------------
+;; Initial type word in an object that is represented by a
+;; `type-typed-object` pointer:
+
 ;;; on 32-bit machines, vectors get two primary tag bits, including
 ;;; one for the immutable flag, and so do bytevectors, so their maximum
 ;;; lengths are equal to the most-positive fixnum on 32-bit machines.
@@ -781,6 +825,9 @@
 (define-constant type-phantom          #b01111110)
 (define-constant type-record                #b111)
 
+;; ---------------------------------------------------------------------
+;; Bit and byte offsets for different types of objects:
+
 (define-constant code-flag-system           #b0000001)
 (define-constant code-flag-continuation     #b0000010)
 (define-constant code-flag-template         #b0000100)
@@ -911,6 +958,9 @@
            (fxsll (constant code-flag-single-valued)
                   (constant code-flags-offset))))
 
+;; ---------------------------------------------------------------------
+;; Masks and offsets for checking types:
+
 ;; type checks are generally performed by applying the mask to the object
 ;; then comparing against the type code.  a mask equal to
 ;; (constant byte-constant-mask) implies that the object being
@@ -1041,6 +1091,9 @@
 (define-constant stencil-vector-mask-bits    (fx- (constant ptr-bits)
                                                   (constant stencil-vector-mask-offset)))
 
+;; ---------------------------------------------------------------------
+;; Helpers to define object layouts:
+
 ;;; record-datatype must be defined before we include layout.ss
 ;;; (maybe should move into that file??)
 ;;; We allow Scheme inputs for both signed and unsigned integers to range from
@@ -1264,6 +1317,9 @@
                      (define-constant name-field-disp field-disp)
                      ...))))))])))
 
+;; ---------------------------------------------------------------------
+;; Object layouts:
+
 (define-primitive-structure-disps typed-object type-typed-object
   ([iptr type]))
 
@@ -1615,6 +1671,9 @@
     (with-syntax ([type (datum->syntax #'* (filter-scheme-type 'string-char))])
       #''type)))
 
+;; ---------------------------------------------------------------------
+;; Flags and structures for the compiler's internal communcation:
+
 (define-constant annotation-debug   #b0001)
 (define-constant annotation-profile #b0010)
 (define-constant annotation-all     #b0011)
@@ -1907,7 +1966,8 @@
   (syntax-rules ()
     ((_ x) (let ((t x)) (and (pair? t) (symbol? (car t)))))))
 
-;;; heap/stack mangement constants
+;; ---------------------------------------------------------------------
+;; Heap/stack mangement constants:
 
 (define-constant collect-interrupt-index 1)
 (define-constant timer-interrupt-index 2)
@@ -2017,6 +2077,9 @@
          (lambda () (mutex-release $tc-mutex) (enable-interrupts)))])
     (identifier-syntax critical-section)))
 
+;; ---------------------------------------------------------------------
+;; More object-representation flags and offsets:
+
 (define-constant hashtable-default-size 8)
 
 (define-constant eq-hashtable-subtype-normal 0)
@@ -2046,6 +2109,9 @@
 (define-constant time-collector-cpu 5)
 (define-constant time-collector-real 6)
 
+;; ---------------------------------------------------------------------
+;; General helpers for the compiler and runtime implementation:
+
 (define-syntax default-run-cp0
   (lambda (x)
     (syntax-case x ()
@@ -2417,7 +2483,20 @@
        #`(let ([x arg])
            (unless (pred x)
              ($oops who #,(format "~~s is not a ~a" (datum type)) x)))])))
- 
+
+;; ---------------------------------------------------------------------
+;; Library entries and C entries
+
+;; A library entry connects with a libspec to describe a library
+;; function that can be referenced directly by machine code and that
+;; will need to be updated by the linker. The C-implemented kernel may
+;; also refer to these values.
+
+;; A C entry is a pointer communicated from the C-implemented kernel
+;; to the compiler and runtime system. The linker deals with them in a
+;; similar way --- it's just that the refer to C functions and globals
+;; instead of Scheme-implemented functions.
+
 (eval-when (load eval)
 (define-syntax lookup-libspec
   (lambda (x)