208 lines
7.8 KiB
TeX
208 lines
7.8 KiB
TeX
\documentclass[a4paper,12pt]{article}
|
|
|
|
\usepackage{times}
|
|
\usepackage{a4wide}
|
|
\usepackage{xspace}
|
|
|
|
\def\occam{{\sffamily occam}\xspace}
|
|
\def\occampi{{\sffamily occam-\Pisymbol{psy}{112}}\xspace}
|
|
|
|
\begin{document}
|
|
|
|
\title{Compiling \occam using Haskell}
|
|
\author{Adam Sampson}
|
|
\maketitle
|
|
|
|
\section{Introduction}
|
|
|
|
This is the ongoing story of FCO, a functional compiler for \occam.
|
|
FCO is a spike solution (albeit a fairly elaborate one): the aim is to
|
|
implement just enough of a compiler in Haskell to tell us whether
|
|
using it for a proper compiler would be a sensible idea.
|
|
|
|
The result is that my goal is fairly modest: FCO translates an \occam 2.1
|
|
subset into ANSI C, using CIF for concurrency facilities. It should
|
|
support enough of the \occam language to do commstime and q7-ats1.
|
|
|
|
By design, FCO is a whole-program compiler: it does not support separate
|
|
compilation of libraries. The downside is that you need to have the
|
|
AST for the entire program, including the standard library, available at
|
|
code-generation time; the major upside for now is that it's much easier
|
|
to write. I believe that the whole-program strategy may be worth
|
|
pursuing in a production compiler, since it would also allow
|
|
whole-program optimisations and specialisations; it need not cause
|
|
horrendous performance problems, since libraries can still be parsed and
|
|
usage-checked ahead of time.
|
|
|
|
I'll assume the reader has some knowledge of both \occam and Haskell; if
|
|
there's anything that's not clear, please let me know. I'll also assume
|
|
the reader has access to the source of FCO while reading this document.
|
|
|
|
I would throroughly recommend the Haskell history paper (cite) -- it
|
|
explains many of the design decisions behind Haskell, and it's an
|
|
excellent overview of the features available in the language.
|
|
|
|
\section{Why Haskell?}
|
|
|
|
Why should we consider Haskell as an option for an implementation
|
|
language? Like Scheme, it's a popular, mature, well-documented
|
|
functional language, it's used heavily by people who're into programming
|
|
language research, and it's been used to implement a number of solid
|
|
compilers for other languages. The result is that there are a number of
|
|
useful libraries that we can take advantage of.
|
|
|
|
There's lots of Haskell experience in the department already. It's the
|
|
only language other than Java that our undergrads are guaranteed to have
|
|
experience with, which might be useful for student projects.
|
|
|
|
Haskell also has some similarities with \occam: it has an
|
|
indentation-based syntax, it makes a point of distinguishing between
|
|
side-effecting and functional code, it emphasises compile-time safety
|
|
checks, and it has excellent support for lightweight concurrency. \occam
|
|
may therefore be of interest to some Haskell programmers.
|
|
|
|
\section{Existing work}
|
|
|
|
42 -- \occam to ETC, Scheme
|
|
|
|
JHC -- Haskell to C, Haskell
|
|
|
|
Pugs -- Perl 6 to various, Haskell
|
|
|
|
GHC -- probably not!
|
|
|
|
Mincaml -- ML subset to assembler, ML
|
|
|
|
\section{Technologies}
|
|
|
|
\subsection{Monads}
|
|
|
|
\subsection{SYB Generics}
|
|
|
|
\cite{syb1}
|
|
|
|
\label{gen-par-prob} Using generics with parametric types confuses the
|
|
hell out of the typechecker; you can work around this by giving explicit
|
|
instances of the types you want to use, but it's not very nice.
|
|
(This is a downside of using a statically-typed language; code that's
|
|
obviously correct sometimes needs very non-obvious type declarations, or
|
|
can't be statically typed at all.)
|
|
|
|
There's also Strafunski. (Which differs how?)
|
|
|
|
And HSXML. Actually looks more useful -- but would require using DrIFT
|
|
to generate instances of the classes.
|
|
http://article.gmane.org/gmane.comp.lang.haskell.general/13589
|
|
|
|
\subsection{Parsec}
|
|
|
|
Parsec is a combinator-based parsing library, which means that you're
|
|
essentially writing productions that look like BNF with variable
|
|
bindings, and the library takes care of matching and backtracking as
|
|
appropriate. Parsec's dead easy to use.
|
|
|
|
The parsing operations are actually operations in the \verb|Parser t|
|
|
monad.
|
|
|
|
\section{Parsing}
|
|
|
|
The parser is based on the grammar from the \occam 2.1 manual, with a
|
|
number of alterations:
|
|
|
|
\begin{itemize}
|
|
|
|
\item I took a leaf out of Haskell's book for handling the
|
|
indentation-based syntax: a preprocessor analyses the indentation and
|
|
adds explicit markers for "indent", "outdent" and "end of significant
|
|
line" that the parser can match later. The preprocessor's a bit limited
|
|
at the moment; it doesn't handle continuation lines or inline
|
|
\verb|VALOF|.
|
|
|
|
\item The original compiler assumes you're keeping track of what's in
|
|
scope while you're parsing, which we don't want to do. This makes some
|
|
things ambiguous, and some productions in the grammar turn out to be
|
|
identical if you don't know what type things are (for example, you can't
|
|
tell the difference between channels, ports and timers at parse time, so
|
|
the FCO grammar handles them all with a single set of productions).
|
|
|
|
(I think it'd be possible to simulate the behaviour of the original
|
|
compiler by using the GenParser monad rather than Parser, since that
|
|
lets you keep state. I'm pretty sure we wouldn't want to track scope
|
|
this way, but it might turn out not to be too painful to handle
|
|
indentation directly in the parser.)
|
|
|
|
\item Left-recursive productions (those that parse subscripts) don't
|
|
work; I split each into two productions, one which parses everything
|
|
that isn't left-recursive in the original grammar, and one which parses
|
|
the first followed by one or more subscripts.
|
|
|
|
\item The original grammar would parse \verb|x[y]| as a conversion of
|
|
the array literal \verb|[y]| to type \verb|x|, which isn't legal \occam.
|
|
I split the \verb|operand| production into a version that didn't include
|
|
\verb|table| and a version that did, so \verb|conversion| can now
|
|
explicitly match an operand that isn't an array literal.
|
|
|
|
\item Similarly, you can't tell at parse time whether in \verb|c ! a; b|
|
|
or \verb|x[a]| whether \verb|a| is a variable or a tag -- I'll have to
|
|
fix this up in a later pass.
|
|
|
|
\item I rewrote the production for lists of formal arguments, since the
|
|
original one's specified as lists of lists of arguments which might be
|
|
typed, and that doesn't work correctly in Parsec when written in the
|
|
obvious way. (It should be possible to express it more elegantly with a
|
|
bit more work.)
|
|
|
|
\end{itemize}
|
|
|
|
The parser was the first bit of FCO I wrote, and partly as a result my
|
|
Haskell coding style in the parser is especially poor; the Pugs parser,
|
|
also using Parsec, is a much better example. (But theirs doesn't parse
|
|
\occam, obviously.)
|
|
|
|
\section{Data structures}
|
|
|
|
I've experimented with two different ways of encoding the abstract
|
|
syntax tree in FCO.
|
|
|
|
\subsection{Parse tree}
|
|
|
|
\subsection{AST}
|
|
|
|
My first version of the AST types included a parametric
|
|
\verb|Structured t| type used to represent things that could include
|
|
replicators and specifications, such as \verb|IF| and \verb|ALT|
|
|
processes; I couldn't combine generic operations over these with others,
|
|
though (see \ref{gen-par-prob}).
|
|
|
|
Some things are simplified in the AST when compared with the grammar:
|
|
channels are just variables, for example.
|
|
|
|
Need to pass metadata through to the AST.
|
|
|
|
\section{Generic strategies}
|
|
|
|
Need to walk over the tree, tracking state.
|
|
|
|
Unique naming comes out nicer in Haskell than in Scheme, since I can
|
|
just use monads and generic transformations, and don't need to write out
|
|
all the productions again just to add an extra argument.
|
|
|
|
Generics appear to work better in GHC 6.6 than GHC 6.4, since some
|
|
restrictions on the types of mutually recursive functions have been
|
|
lifted. (Check this against release notes.)
|
|
|
|
\section{C generation}
|
|
|
|
\section{Future work}
|
|
|
|
The obvious bit of future work is writing the full compiler that this
|
|
was a prototype of.
|
|
|
|
Turns out I quite like Haskell -- and there are tools provided with GHC
|
|
to parse Haskell. If we wrote a Haskell concurrency library (CSP-style),
|
|
we should investigate writing an \occam-style usage checker for it.
|
|
|
|
\bibliographystyle{unsrt}
|
|
\bibliography{the}
|
|
\end{document}
|