tock-mirror/fco/doc/writeup.tex

\documentclass[a4paper,12pt]{article}

\usepackage{times}
\usepackage{a4wide}
\usepackage{xspace}

\def\occam{{\sffamily occam}\xspace}
\def\occampi{{\sffamily occam-\Pisymbol{psy}{112}}\xspace}

\begin{document}

\title{Compiling \occam using Haskell}
\author{Adam Sampson}
\maketitle

\section{Introduction}

This is the ongoing story of FCO, a functional compiler for \occam.
FCO is a spike solution (albeit a fairly elaborate one): the aim is to
implement just enough of a compiler in Haskell to tell us whether
using it for a proper compiler would be a sensible idea.

The result is that my goal is fairly modest: FCO translates an \occam 2.1
subset into ANSI C, using CIF for concurrency facilities. It should
support enough of the \occam language to do commstime and q7-ats1.

By design, FCO is a whole-program compiler: it does not support separate
compilation of libraries. The downside is that you need to have the
AST for the entire program, including the standard library, available at
code-generation time; the major upside for now is that it's much easier
to write. I believe that the whole-program strategy may be worth
pursuing in a production compiler, since it would also allow
whole-program optimisations and specialisations; it need not cause
horrendous performance problems, since libraries can still be parsed and
usage-checked ahead of time.

I'll assume the reader has some knowledge of both \occam and Haskell; if
there's anything that's not clear, please let me know. I'll also assume
the reader has access to the source of FCO while reading this document.

I would throroughly recommend the Haskell history paper (cite) -- it
explains many of the design decisions behind Haskell, and it's an
excellent overview of the features available in the language.

\section{Why Haskell?}

Why should we consider Haskell as an option for an implementation
language? Like Scheme, it's a popular, mature, well-documented
functional language, it's used heavily by people who're into programming
language research, and it's been used to implement a number of solid
compilers for other languages. The result is that there are a number of
useful libraries that we can take advantage of.

There's lots of Haskell experience in the department already. It's the
only language other than Java that our undergrads are guaranteed to have
experience with, which might be useful for student projects.

Haskell also has some similarities with \occam: it has an
indentation-based syntax, it makes a point of distinguishing between
side-effecting and functional code, it emphasises compile-time safety
checks, and it has excellent support for lightweight concurrency. \occam
may therefore be of interest to some Haskell programmers.

\section{Existing work}

42 -- \occam to ETC, Scheme

JHC -- Haskell to C, Haskell

Pugs -- Perl 6 to various, Haskell

GHC -- probably not!

Mincaml -- ML subset to assembler, ML

\section{Technologies}

\subsection{Monads}

\subsection{SYB Generics}

\cite{syb1}

\label{gen-par-prob} Using generics with parametric types confuses the
hell out of the typechecker; you can work around this by giving explicit
instances of the types you want to use, but it's not very nice.
(This is a downside of using a statically-typed language; code that's
obviously correct sometimes needs very non-obvious type declarations, or
can't be statically typed at all.)

There's also Strafunski. (Which differs how?)

And HSXML. Actually looks more useful -- but would require using DrIFT
to generate instances of the classes.
http://article.gmane.org/gmane.comp.lang.haskell.general/13589

\subsection{Parsec}

Parsec is a combinator-based parsing library, which means that you're
essentially writing productions that look like BNF with variable
bindings, and the library takes care of matching and backtracking as
appropriate. Parsec's dead easy to use.

The parsing operations are actually operations in the \verb|Parser t|
monad.

\section{Parsing}

The parser is based on the grammar from the \occam 2.1 manual, with a
number of alterations:

\begin{itemize}

\item I took a leaf out of Haskell's book for handling the
indentation-based syntax: a preprocessor analyses the indentation and
adds explicit markers for "indent", "outdent" and "end of significant
line" that the parser can match later. The preprocessor's a bit limited
at the moment; it doesn't handle continuation lines or inline
\verb|VALOF|.

\item The original compiler assumes you're keeping track of what's in
scope while you're parsing, which we don't want to do. This makes some
things ambiguous, and some productions in the grammar turn out to be
identical if you don't know what type things are (for example, you can't
tell the difference between channels, ports and timers at parse time, so
the FCO grammar handles them all with a single set of productions).

(I think it'd be possible to simulate the behaviour of the original
compiler by using the GenParser monad rather than Parser, since that
lets you keep state. I'm pretty sure we wouldn't want to track scope
this way, but it might turn out not to be too painful to handle
indentation directly in the parser.)

\item Left-recursive productions (those that parse subscripts) don't
work; I split each into two productions, one which parses everything
that isn't left-recursive in the original grammar, and one which parses
the first followed by one or more subscripts.

\item The original grammar would parse \verb|x[y]| as a conversion of
the array literal \verb|[y]| to type \verb|x|, which isn't legal \occam.
I split the \verb|operand| production into a version that didn't include
\verb|table| and a version that did, so \verb|conversion| can now
explicitly match an operand that isn't an array literal.

\item Similarly, you can't tell at parse time whether in \verb|c ! a; b|
or \verb|x[a]| whether \verb|a| is a variable or a tag -- I'll have to
fix this up in a later pass.

\item I rewrote the production for lists of formal arguments, since the
original one's specified as lists of lists of arguments which might be
typed, and that doesn't work correctly in Parsec when written in the
obvious way. (It should be possible to express it more elegantly with a
bit more work.)

\end{itemize}

The parser was the first bit of FCO I wrote, and partly as a result my
Haskell coding style in the parser is especially poor; the Pugs parser,
also using Parsec, is a much better example. (But theirs doesn't parse
\occam, obviously.)

\section{Data structures}

I've experimented with two different ways of encoding the abstract
syntax tree in FCO.

\subsection{Parse tree}

\subsection{AST}

My first version of the AST types included a parametric
\verb|Structured t| type used to represent things that could include
replicators and specifications, such as \verb|IF| and \verb|ALT|
processes; I couldn't combine generic operations over these with others,
though (see \ref{gen-par-prob}).

Some things are simplified in the AST when compared with the grammar:
channels are just variables, for example.

Need to pass metadata through to the AST.

\section{Generic strategies}

Need to walk over the tree, tracking state.

Unique naming comes out nicer in Haskell than in Scheme, since I can
just use monads and generic transformations, and don't need to write out
all the productions again just to add an extra argument.

Generics appear to work better in GHC 6.6 than GHC 6.4, since some
restrictions on the types of mutually recursive functions have been
lifted. (Check this against release notes.)

\section{C generation}

\section{Future work}

The obvious bit of future work is writing the full compiler that this
was a prototype of.

Turns out I quite like Haskell -- and there are tools provided with GHC
to parse Haskell. If we wrote a Haskell concurrency library (CSP-style),
we should investigate writing an \occam-style usage checker for it.

\bibliographystyle{unsrt}
\bibliography{the}
\end{document}