dccc

Introduction

dccc stands for DooMeeR's Compiler Compiler Compiler. Tools like yacc are compiler compilers; dccc compiles some files into ocamlyacc files, which is yacc for OCaml.

dccc is a tool I made because I didn't like yacc.

So I made a tool that could fulfill my very specific needs, and now I can write a parser much faster and much more easily. It might not help you at all, especially if you like to tweak the parser with semantic actions before returning an AST.

Basically, dccc is a language that combines powerful grammar specification with AST definition in a very concise and readable way.

Download, installation and use

Linux binaries
Source files

To install, just put the file "dccc" in a location where your PATH environment variable can find it.

Type "dccc --help" to get a description of the command line parameters.

Example: calculator

Here is how you describe the grammar for a calculator, with +, -, * and / operations, negation (unary -) and parenthesis.

token INT: int

prec
  PLUS MINUS: left,
  TIMES DIV: left

expr ::=
  Int: INT
| Plus: expr, PLUS, expr
| Neg: MINUS, expr
| Minus: expr, MINUS, expr
| Times: expr, TIMES, expr
| Div: expr, DIV, expr
| Expr: LEFTPAR, expr, RIGHTPAR
Save it as "expr.dg" and compile it with the following command:

dccc expr.dg --yacc expr.mly --tree ast.ml
This creates a file "expr.mly", which can be compiled with ocamlyacc, and a file "ast.ml", which contains the AST declaration. The yacc file doesn't look very pretty, but the AST file is interesting:
type loc = Lexing.position * Lexing.position
type 'a located = { loc : loc; node : 'a }

and expr = expr_node located
and expr_node = [
  `Expr of expr
| `Div of (expr * expr)
| `Times of (expr * expr)
| `Minus of (expr * expr)
| `Neg of expr
| `Plus of (expr * expr)
| `Int of int ]
As you see, every expressions are located. Polymorphic variants are used instead of variants. This allows the name space to be more flexible, but it might be a bit harder to read type errors. just write type annotations for your functions using the definitions of this AST file to avoid this problem.

The token system

In yacc, a token can only be a terminal (a token from the lexer) or a non-terminal. If you want a token to be repeatable, you have to write something like:
my_rule: ... auxilliary_rule ...

auxilliary_rule:
  TOKEN auxilliary_rule { $1::$2 }
|                       { [] }
The semantic actions might be different, but in OCaml, using list here is very natural. With dccc, you can just write:
my_rule: ... TOKEN* ...
This is much less verbose and tedious.

In fact, a token in dccc can be a complex expression. Here is a table describing the possible tokens. The type "token" refers to the associated type of the token, which is built recursively from this table (see later for remarks on types).

Token Description Type
identifier A terminal or a non-terminal. (associated type)
(token) token
token1, token2 Sequence of tokens. token1 * token2
{token} Token group. token
#token Ignore the value of this token, even if it has an associated type different than unit. unit
token1 | token2 Non-determinism. token1 and token2 must have compatible types. token1
Label: token Labeled token. The token is labeled with the polymorphic variant label `Label. [> `Label of token]
token? Optional token: it appears 0 or 1 time. token option
token* Repeat token an arbitrary number of times, 0 included. token list
token+ Repeat token an arbitrary number of times, at least 1 time. token list
token1!token2 Repeat token1 an arbitrary number of times, separating the occurences using token2. token1 list
prec identifier The token indicates a precedence and is inserted as a "%prec" in the yacc file. unit

Terminals

Terminals are tokens that are provided by the lexer. dccc automatically detects every identifier which does not define a new rule (i.e. at the left of a "::=") and treats them as terminals.

The default type associated to a terminal is unit, but you can change it with the construction "token":

token token_name: token_type
This changes the type of token_name to token_type. Moreover, the semantic actions and the AST will be modified so that the AST contain the value associated to the terminal.

If you don't want the value of a terminal to be stored into the AST, you can use the operator #.

Sequences

The type shown in the previous table is misleading. Indeed, the type of the sequence "a, b, c", which is parsed as "a, (b, c)", is not "a * (b * c)" but "a * b * c": the type of sequences is flattened.

If you really need a type of "a * (b * c)", you can use the construction {}:

my_rule ::= a, { b, c }
bc ::= b, c
The type "unit" is discarded in tuples:
token INT: int
my_rule ::= INT, COMMA, INT
The type of the non-terminal "my_rule" will be "int * int", because the non-terminal "COMMA", with no associated type, has the type "unit".

Non-determinism

In yacc, a non-terminal can have several rules. For instance:
rule ::= A { ... } | B { ... } | C { ... }
In dccc, this example is written like this:
rule ::= A | B | C
It looks pretty much the same, doesn't it? But actually, "A | B | C" is just one token, which is the non-deterministic token "A | (B | C)" which can reduces either to "A" or "B | C", which itself can reduce to "B" or "C".

In dcc, one can use this | construction anywhere:

rule ::= A, (B | C), D
In yacc, this example must be written using an auxilliary non-terminal:
rule ::= A bc D { ... }
bc ::= B C { ... }
In dccc, because every token returns a value thanks to the automatically generated semantic actions, the tokens in a | construction must have the same type. Labels are used to turn a token into a polymorphic variant. This is very useful to use two tokens of different types in a | construction. Here are some examples:
token INT: int
rule ::= INT, DOT | INT, SEMICOLON
Here, the grammar allows both "15." and "15;". They are treated equally: only the value 15 is seen in the AST, which doesn't make the difference between the two choices.
token INT: int
rule ::= Dot: INT, DOT | Semi: INT, SEMICOLON
This is almost the same grammar, but the choices are labelled. Thus, in the AST, the non-terminal "rule" has type "[> `Dot of int | `Semi of int]", and one can make a difference between "15." (which is represented in the AST as "`Dot 15") and "15;" (which is represented as "`Semi 15").
token INT: int
token STRING: string
rule ::= INT | STRING
This is not valid, because tokens INT and STRING have different types, namely int and string. To make this work, we use labels:
token INT: int
token STRING: string
rule ::= Int: INT | String: STRING

Precedences and associativities

You can specify the precedence and the associativity of a terminal using the construction "prec". For example:
prec
  STAR: nonassoc,
  VERT COLON: left,
  COMMA: right,
  SHARP PLUS: nonassoc
Tokens STAR and PLUS are non associative. Tokens VERT and COLON are left-associative. Token COMMA is right-associative. Token STAR has the lowest precedence, while PLUS has the highest.

Comments

You can put comments in your dccc file using (* and *), as in OCaml source files.

Another example: dccc grammar

Here is the dccc grammar for dccc files:
token INT: int
token ID: string

prec
  VERT COLON COMMA BANG: left,
  SHARP STAR PLUS QUESTIONMARK: nonassoc

grammar ::=
  ( GRule: ID, COLONCOLONEQUAL, tok
  | GToken: TOKEN, ID, COLON, ID
  | GPrec: PREC, (ID+, COLON, ID)!COMMA)*, EOF

tok ::=
  TIdent: ID
| TIgnore: SHARP, tok
| TSeq: tok, COMMA, tok
| TBranch: tok, VERT, tok
| TStar: tok, STAR
| TPlus: tok, PLUS
| TOption: tok, QUESTIONMARK
| TBang: tok, BANG, tok
| TLabeled: ID, COLON, tok
| TSub: LEFTACC, tok, RIGHTACC
| TPrec: PREC, ID
| TTok: LEFTPAR, tok, RIGHTPAR
The tokens (non-terminal tok) are easy to understand. The non-terminal grammar, which describes an entire file, makes good use of the constructions *, !, and +.

Save this file as "dccc.dg" and compile it using:

dccc dccc.dg --yacc parser.mly --tree ast.ml
Now you can compile the file "parser.mly" using ocamlyacc:
ocamlyacc parser.mly
Your parser can now be compiled using ocamlc:
ocamlc -c ast.ml parser.mli parser.ml
Now you just have to write a lexer and you'll be able to parse grammars using Parser.grammar, or tokens using Parser.tok. For more information about writing lexers and an example of a main file that uses a parser, see the OCaml documentation on the INRIA's website.

Limitations and future works

Beside being a small, barely maintained tool, dccc has some limitations. Some features that could be interesting to implement could be: With all these features, dccc would really become the parser generator of my dream: describe a grammar in 2 minutes, compile, and automatically get all the functions (parse, pretty-print, documentation) needed to create and manipulate an AST.