dccc

Introduction

dccc stands for DooMeeR's Compiler Compiler Compiler. Tools like yacc are compiler compilers; dccc compiles some files into ocamlyacc files, which is yacc for OCaml.

dccc is a tool I made because I didn't like yacc.

yacc doesn't handle basic stuff like Kleene's star *, +, optional tokens, or sequences of tokens separated by a symbol.
yacc uses semantic actions, and I always use the same semantic actions, so it gets repetitive.
yacc needs the abstract syntax tree (AST) to be written by the programmer, and this is very repetitive too.
Because you write the semantic actions yourself, if you want to store the locations of your expressions in the AST (and you should), you can't just ask yacc to do it for you.

So I made a tool that could fulfill my very specific needs, and now I can write a parser much faster and much more easily. It might not help you at all, especially if you like to tweak the parser with semantic actions before returning an AST.

Basically, dccc is a language that combines powerful grammar specification with AST definition in a very concise and readable way.

Download, installation and use

Linux binaries
Source files

To install, just put the file "dccc" in a location where your PATH environment variable can find it.

Type "dccc --help" to get a description of the command line parameters.

Example: calculator

Here is how you describe the grammar for a calculator, with +, -, * and / operations, negation (unary -) and parenthesis.

token INT: int

prec
  PLUS MINUS: left,
  TIMES DIV: left

expr ::=
  Int: INT
| Plus: expr, PLUS, expr
| Neg: MINUS, expr
| Minus: expr, MINUS, expr
| Times: expr, TIMES, expr
| Div: expr, DIV, expr
| Expr: LEFTPAR, expr, RIGHTPAR

Save it as "expr.dg" and compile it with the following command:

dccc expr.dg --yacc expr.mly --tree ast.ml

This creates a file "expr.mly", which can be compiled with ocamlyacc, and a file "ast.ml", which contains the AST declaration. The yacc file doesn't look very pretty, but the AST file is interesting:

type loc = Lexing.position * Lexing.position
type 'a located = { loc : loc; node : 'a }

and expr = expr_node located
and expr_node = [
  `Expr of expr
| `Div of (expr * expr)
| `Times of (expr * expr)
| `Minus of (expr * expr)
| `Neg of expr
| `Plus of (expr * expr)
| `Int of int ]

As you see, every expressions are located. Polymorphic variants are used instead of variants. This allows the name space to be more flexible, but it might be a bit harder to read type errors. just write type annotations for your functions using the definitions of this AST file to avoid this problem.

The token system

In yacc, a token can only be a terminal (a token from the lexer) or a non-terminal. If you want a token to be repeatable, you have to write something like:

my_rule: ... auxilliary_rule ...

auxilliary_rule:
  TOKEN auxilliary_rule { $1::$2 }
|                       { [] }

The semantic actions might be different, but in OCaml, using list here is very natural. With dccc, you can just write:

my_rule: ... TOKEN* ...

This is much less verbose and tedious.

In fact, a token in dccc can be a complex expression. Here is a table describing the possible tokens. The type "token" refers to the associated type of the token, which is built recursively from this table (see later for remarks on types).

Token Description Type

identifier A terminal or a non-terminal. (associated type)

(token) token

token1, token2 Sequence of tokens. token1 * token2

{token} Token group. token

#token Ignore the value of this token, even if it has an associated type different than unit. unit

token1 | token2 Non-determinism. token1 and token2 must have compatible types. token1

Label: token Labeled token. The token is labeled with the polymorphic variant label `Label. [> `Label of token]

token? Optional token: it appears 0 or 1 time. token option

token* Repeat token an arbitrary number of times, 0 included. token list

token+ Repeat token an arbitrary number of times, at least 1 time. token list

token1!token2 Repeat token1 an arbitrary number of times, separating the occurences using token2. token1 list

prec identifier The token indicates a precedence and is inserted as a "%prec" in the yacc file. unit

Token	Description	Type
identifier	A terminal or a non-terminal.	(associated type)
(token)		token
token1, token2	Sequence of tokens.	token1 * token2
{token}	Token group.	token
#token	Ignore the value of this token, even if it has an associated type different than unit.	unit
token1 \| token2	Non-determinism. token1 and token2 must have compatible types.	token1
Label: token	Labeled token. The token is labeled with the polymorphic variant label `Label.	[> `Label of token]
token?	Optional token: it appears 0 or 1 time.	token option
token*	Repeat token an arbitrary number of times, 0 included.	token list
token+	Repeat token an arbitrary number of times, at least 1 time.	token list
token1!token2	Repeat token1 an arbitrary number of times, separating the occurences using token2.	token1 list
prec identifier	The token indicates a precedence and is inserted as a "%prec" in the yacc file.	unit

Terminals

Terminals are tokens that are provided by the lexer. dccc automatically detects every identifier which does not define a new rule (i.e. at the left of a "::=") and treats them as terminals.

The default type associated to a terminal is unit, but you can change it with the construction "token":

token token_name: token_type

This changes the type of token_name to token_type. Moreover, the semantic actions and the AST will be modified so that the AST contain the value associated to the terminal.

If you don't want the value of a terminal to be stored into the AST, you can use the operator #.

Sequences

The type shown in the previous table is misleading. Indeed, the type of the sequence "a, b, c", which is parsed as "a, (b, c)", is not "a * (b * c)" but "a * b * c": the type of sequences is flattened.

If you really need a type of "a * (b * c)", you can use the construction {}:

my_rule ::= a, { b, c }
bc ::= b, c

The type "unit" is discarded in tuples:

token INT: int
my_rule ::= INT, COMMA, INT

The type of the non-terminal "my_rule" will be "int * int", because the non-terminal "COMMA", with no associated type, has the type "unit".

Non-determinism

In yacc, a non-terminal can have several rules. For instance:

rule ::= A { ... } | B { ... } | C { ... }

In dccc, this example is written like this:

rule ::= A | B | C

In dcc, one can use this | construction anywhere:

rule ::= A, (B | C), D

In yacc, this example must be written using an auxilliary non-terminal:

rule ::= A bc D { ... }
bc ::= B C { ... }

In dccc, because every token returns a value thanks to the automatically generated semantic actions, the tokens in a | construction must have the same type. Labels are used to turn a token into a polymorphic variant. This is very useful to use two tokens of different types in a | construction. Here are some examples:

token INT: int
rule ::= INT, DOT | INT, SEMICOLON

Here, the grammar allows both "15." and "15;". They are treated equally: only the value 15 is seen in the AST, which doesn't make the difference between the two choices.

token INT: int
rule ::= Dot: INT, DOT | Semi: INT, SEMICOLON

This is almost the same grammar, but the choices are labelled. Thus, in the AST, the non-terminal "rule" has type "[> `Dot of int | `Semi of int]", and one can make a difference between "15." (which is represented in the AST as "`Dot 15") and "15;" (which is represented as "`Semi 15").

token INT: int
token STRING: string
rule ::= INT | STRING

This is not valid, because tokens INT and STRING have different types, namely int and string. To make this work, we use labels:

token INT: int
token STRING: string
rule ::= Int: INT | String: STRING

Precedences and associativities

You can specify the precedence and the associativity of a terminal using the construction "prec". For example:

prec
  STAR: nonassoc,
  VERT COLON: left,
  COMMA: right,
  SHARP PLUS: nonassoc

Tokens STAR and PLUS are non associative. Tokens VERT and COLON are left-associative. Token COMMA is right-associative. Token STAR has the lowest precedence, while PLUS has the highest.

Comments

You can put comments in your dccc file using (* and *), as in OCaml source files.

Another example: dccc grammar

Here is the dccc grammar for dccc files:

token INT: int
token ID: string

prec
  VERT COLON COMMA BANG: left,
  SHARP STAR PLUS QUESTIONMARK: nonassoc

grammar ::=
  ( GRule: ID, COLONCOLONEQUAL, tok
  | GToken: TOKEN, ID, COLON, ID
  | GPrec: PREC, (ID+, COLON, ID)!COMMA)*, EOF

tok ::=
  TIdent: ID
| TIgnore: SHARP, tok
| TSeq: tok, COMMA, tok
| TBranch: tok, VERT, tok
| TStar: tok, STAR
| TPlus: tok, PLUS
| TOption: tok, QUESTIONMARK
| TBang: tok, BANG, tok
| TLabeled: ID, COLON, tok
| TSub: LEFTACC, tok, RIGHTACC
| TPrec: PREC, ID
| TTok: LEFTPAR, tok, RIGHTPAR

The tokens (non-terminal tok) are easy to understand. The non-terminal grammar, which describes an entire file, makes good use of the constructions *, !, and +.

Save this file as "dccc.dg" and compile it using:

dccc dccc.dg --yacc parser.mly --tree ast.ml

Now you can compile the file "parser.mly" using ocamlyacc:

ocamlyacc parser.mly

Your parser can now be compiled using ocamlc:

ocamlc -c ast.ml parser.mli parser.ml

Now you just have to write a lexer and you'll be able to parse grammars using Parser.grammar, or tokens using Parser.tok. For more information about writing lexers and an example of a main file that uses a parser, see the OCaml documentation on the INRIA's website.

Limitations and future works

Beside being a small, barely maintained tool, dccc has some limitations.

Its main advantage (no semantic actions) can be an inconvenient for grammars which are not writable without some semantic action hacks. If you want to write a parser for an existing language which suffers from this problem, you can't use dccc. On the other hand, if you are inventing a new language, it forces you not to use such hacks, which is likely to result in a clearer language.
The ocamlyacc file produced has a lot of non-terminals, because no optimisation is done. As a result, this file is bigger and less readable than if you had written it by hand.
Because special tokens are compiled using auxilliary non-terminals, and because of the way ocamlyacc handles precedences, the resulting grammar might not behave exactly like expected.

Some features that could be interesting to implement could be:

Bootstrap dccc using the dccc grammar in the last example.
Integrated lexer specification, especially for keywords and symbols. And if the lexer is automatically generated, location would be done once for all. But it might need to impose some style for comments and strings, because they need special semantic actions.
AST pretty-printing functions in the ast file. It should be easy to print the AST with the OCaml syntax; but it might even be possible to actually reproduce a valid source code in the language you're defining.
Add automatic ocamldoc specification to the AST file.
An AST interface which contains the functions to parse a string, a buffer or a file, without having to handle the lexing.
A Menhir version if it proves to be better than ocamlyacc for dccc. Actually, I know nothing about Menhir.
dccc is designed to be language-independant. So it might be interesting to implement a dccc compiler for other languages; then we can use the same dccc file for each of these languages. This is possible only because dccc doesn't have semantic actions. Semantic actions are bad for modularity!

With all these features, dccc would really become the parser generator of my dream: describe a grammar in 2 minutes, compile, and automatically get all the functions (parse, pretty-print, documentation) needed to create and manipulate an AST.