Content

Parsers

Every programmer has to deal with input data. The input are commands for the program to process. These commands can be very basic, but often they become more complex. They become what is called a domain specific language. Then coding the processing of the input, which ultimate goal is to call the program functions becomes a nightmare. Syntax directed parsing is a technique for addressing these difficulties. In syntax directed parsing, the coding of the input processing is constructed automatically, by means of a standard algorithm and based on a high level description of the input data structure. The program functions required to apply to the input data are directly attached to this description in a convenient way. This description is called a grammar for the domain specific language and the algorithm that creates the code is called a parser generator. The input for the compiler that defines the grammar is called a syntax file. The ouptut of compiler is a code file and called a parser. The parser becomes part of the program to process the input. The functions attached to the grammer are called reduction procedures. They are linked to the structures of the input, so that they parser will call them at precisely the right times with precisely the right data. When you write reduction procedures, you can concentrate entirely on what you have to do with the data. You don't have to encumber your code with switches and tests to determine the structure of your input. Often the process of a parser is split into two phases: lexical and syntax analysis. A lexical scanner is a program or procedure used as a preprocessor for a parser. It scans input characters and lumps them together into tokens. The tokens are defined in the grammer.

Anagram

The best way to learn how to specify a grammaer and using a parser generator is to experiment with one. There are a number of available tools:

We choose to work with Anagram as AnaGram does not need a lexical scanner for most practical problems. AnaGram avoids the need for a lexical scanner by using character sets and disregard and lexeme statements. If a problem does in fact need a lexical scanner, AnaGram itself can define such scanner.

Command Interpreter

We will develop here a command interpreter. The goal of a command interpreter is to map input to valid commands the program can execute. The input for a command interpreter is a string of characters. This string of characters should be split up in a sequence of commands. It is the purpose of grammer to specify the structure of the input string of characters, so the parser can use this grammer to parse the string into a sequence of valid commands.

An input stream always contains characters, which are terminal tokens. We need to define the different sets of characters we need to recognize. We choose for digit,letter, delimiters:end-of-line, end-of-file and white characters, operators, blocks and text char. We define whitespace as a white character or a comment. A comment is anything appearing after // till the end of the line.

/*
-: range of chars
+: union of sets
~: complement of set
*/

// CHARACTER SETS
digit = '0-9'   										          
letter = 'a-z'+'A-Z'+ '_'   							         
eof = 0 + ^Z + ^D + 1	// 0=c-string / 1=input stream / ^Z eof windows / ^D eof  unix		
eol = '\n'									          
white = ' ' + '\t' + '\v' + '\f' + '\r' // white space: space or tab or vertical tab or formfeed or return
operator = '#' + '=' + '<' + '>' + '|'	+ '@'   		          
block = '(' + ')' + '{' + '}' + '[' + ']' + '"' 
text char= ~(eof + eol + white + operator + block)   	

// WHITESPACE 
ws
 -> white | comment

comment
 -> ["//",~(eof+'\n')?...],eol	 

We now introduce the following nonterminal tokens in the grammar definition: Script File, Command Line, Command, Identifier and Parameters. The script file contains zero or more command lines with commands. The commands consists of an identifier and zero or more parameters. We define that a parameter starts with / has a two letter/digit code and optional a parameter value. A parameter value can be any text optionally between double qoutes.

/*
$: defines the top level grammar token 
->: produces
[]: optional rule 
|: or
...: repitition
,:delimiter for tokens
?: optional token
*/
script file $								
 -> [execution block | eol]..., eof			
command line:
 -> eol?...,[command, eol?...]...
command
  -> identifier, parameters
identifier
 -> literal
parameters
 -> parameter?... ,eol
parameter
 -> "/",parameter code,parameter value?
parameter code
 -> [digit|letter],[digit|letter]
parameter value
 -> '"'?,literal,'"'?

Now the parser must do something usefull during the parsing phase. We must let the parser store the pieces of information its finds in some data structure we can later use to perform the command. For a command to run we must know its identifier and zero or more parameters, a parameter consist of a code and optionally a parameter value. We define two structures, param and command.

struct param {
	char* pname;
	char* pvalue;
};
struct command {
	char* cname;   			// command name
	vector<param>vp;     //a vector with parameters
}
When the parser finds an command identifier it should create a command object and assign the identifier to the cname field. Once it finds a parameter a new parameter object is created and added to the parameter vector vp. Once the whole command line has been parsed the parser initiates an exec() command. The exec funtion tries to locate the identifier in a symbol table, it checks all the parameters on exisitence and convert the parameter values to the correct type and the calls the program function with the parameters. The parser reads characters by character and we need to store these characters whenever we need to use them later. We use a stack accumulator, sa to store individual characters, and once the characters form a token, we store a reference to these tokens in the baove defined structure. Our grammar above still lacks the definition of literal.


Reference

  • [1]    A Lex & Yacc Tutorial
  • [2]    The gritty details of a lexical analyzer in c using re2c
  • [3]    RE2C
  • [4]    Anagram
  • [5]    Lex and Yacc page
  • [6]    Lexer and Parser Generators
  • [7]    Lemon Tutorial

  • Copyright 2012 Jacq Krol. All rights reserved. Created July 2012; last updated July 2012.