Thursday, November 20, 2014

Menhir



This blog post about  -  Menhir.


According to Wikipedia:

A menhir (French, from Middle Breton: men, "stone" and hir, "long"[1]), standing stone, orthostat, or lith is a large upright standing stone.
Coincidently, Menhir, is also the name for LR(1) parser generator for OCaml.
 
I followed the recommendation in the Real World OCaml to use it, rather than ocamlyacc:
"Menhir is an alternative parser generator that is generally superior to the venerable ocamlyacc, which dates back quite a few years. Menhir is mostly compatible with ocamlyacc grammars, and so you can usually just switch to Menhir and expect older code to work (with some minor differences described in the Menhir manual)

I found a few sources online that help you understand Menhir but it took me some time to get my head around it.

This blog (and this post in particular) is mainly for me to record my activities and a way to understand things better. Nevertheless, I hope that by going through and discussing the code I've written, will shorten the learning curve for some of you -  or the very least entertain you :)

On y vas! For my purposes, I started by parsing a simple n-tuple string, for the MoanaML code.
(?x,hasColor,?y,context)
Following the instructions in the Real World OCaml I knew that I had to create two files, a namely, Parser.mly and Lexer.mll

The parser file is used to construct and parse the grammar. You can define tokens and describe their required sequences.

For example, for the Moana tuple, I defined the following tokens:
%token STRING
%token VAR
%token LEFT_BRACE
%token RIGHT_BRACE
%token START
%token END
%token COMMA
%token EOF
 I the used them to define the required parsing sequence:
LEFT_BRACE; s = elem; COMMA; p = elem; COMMA; o = elem; COMMA; c  = elem; RIGHT_BRACE 
elem:
 | v = VAR {Variable v}
 | c = STRING {Constant c} 

The elem is there to differentiate constants and variables and consequently pass the parsed value into a relevant type constructor.

You also define the parsing function and its return type
 start < config .tuple > parse
Now once we have the parser, we can move to the lexer file.  Here we define rules using regular expressions in order to match, capture and convert strings into the previously defined tokens.

In my case:

rule lex = parse
  | [' ' '\t' '\n']      { lex lexbuf }
  | newline         { next_line lexbuf; lex lexbuf }
  | ","                 { COMMA }
  | "("                 { LEFT_BRACE }
  | "{"                {START}
  | eof                {EOF }

To do the actual parsing, we use:
Parser.parse_tuple Lexer.lex (Lexing.from_string s)
the Real World OCaml: helps us understand what's happening
[Lexing.from_string] function is used to construct a lexbuf [from a string], which is passed with the lexing function [Lexer.lex] to the [Parser.parse] functions.
That's all, folks!

Pitfalls.

I was expecting Menhir to provide me with nice exceptions to debug my code, as was promised in the Real World OCaml:
The biggest advantage of Menhir is that its error messages are generally more human-comprehensible, ...
but it didn't. At least, I couldn't find the way to invoke it.
In any case, the book does provide you with some code to make the debugging a bit easier.

Anyway, online regular expression editor is your friend - I used http://www.regexr.com. Use it to test your regular expression and adjust the lexer accordingly.

One more thing, the order of your lexer rules matters!

Finally, in addition to the the Real World OCaml book chapter, I also found very useful this example and this Mini Ocaml tutorial.

That's about it, very simple and elegant once you get your head around it.

Disclaimer: As I mentioned in the beginning, I am a newbie, so please let me know if I got some of the stuff wrong or there is a more efficient way of doing things. I am more than happy to hear from you guys!

No comments:

Post a Comment