Structure of a Generic Source-Code Formatter
Introduction
In order to avoid repetition please refer to the summary document which contains a
description of the problem.
The Components
This section does not attempt to justify the decisions made,
as these will eb made elsewhere, but simply to explain teh
arrangement of the system.
In order to function there are several tasks which have to be
achieved:
- The lexical analysis of the provided source code using
some given rules, e.g. what the keywords are
- The syntax analysis of the tokens using rules given as to
the syntax of the language
- The execution of some small language for controlling what
is sent to output and keeping track of such things as
indentation levels
- Some output dependant translator which given an
instruction such as SPACE can turn that into the required
language's representation of a space. This is a simple
language and this section may have to deal with more
complicated constructs, e.g. the use of tables in HTML to
lay the document out correctly
- Some methoid of parseing the rules as to the structure
and syntax of the language so as they can be passed to
the first four sections
in addition the following have to be catered for:
- The prescence of elements, such as comments or
preprocessor declaratives which can't be parsed
- The prescence of irrelevant elements in the source code,
such as extraneous spaces or tabs
To achieve these goals it is best to consider the project as
five seperate projects (these directly relate to numbered tasks
above):
- A generic lexical analyser which accepts it's definitions
at run time
- A run-time syntax analyser generator
- An implementation of a small interpreted programming
language
- An Output section which contains basic translation and
conversion
- A controlling sections
Formally
Before considering each of these in term I would like to
provide a more formal definition of each. I relaise the notation
is a little imprecise and strange but I don't want to provide a
rigerous description at this point so I have chopsen to adopt
some flexibility of natural language while applying some common
notational techniques in places.
- Lexcial Analyser
- lex ( source code, list of nfa representing identifiers,
list of keywords or static elements ) -> list of
tokens
- Syntax Analyser
- analyser ( list of productions in language - with
embedded peices of code, list of all names of
productions, tokens and identifiers ) -> parse ( list
of tokens ) -> list of embedded code in order
representing the code matched with each production
represented by each elemewnt in the source code
- Small interpreted programming language
- execute ( item of embedded code) -> a call to the
Output section with te contents of any print() or send()
command from the code
- Output section
- print() or send() -> some series of ASCII characters
directly related to the arguments, i.e. either direct
copies or output found in a simple lokup table of
replacements, e.g. "SPACE" could translate as
" " but 5 could not translate as five
occurances of some symbol (unless every possibility had
been entered in the lookup table)
- Controller
- Given some language rules:
- creates a list of nfa representing
identifier (nfa)
- creates a st of keywords or static
elements (st)
- creates a list of productions with
embedded code(grammar)
- creates a list of st of all names of
productions, tokens and identifiers
(grammarSymbols)
- execute ( analyser( grammar, grammarSymbols
).parse( lex( source code, nfa, st ) ) ) produces
a series of calls to the Output Section
The Different Packages
The Lexical Analyser
The task of the lexical analyser is to take :
- some list of tokens represented as strings as some flag
as to whether they are case sensitive or not
- some list of tokens represented as non-determenistic
finitte automata (allows regular expressions to be
matched)
- some list of nfa's to ignore
- some list of nfa's to pass as a special case when found,
e.g. comments
and allow the user to access the list of tokens and special
strings through a stack. It was decided that instead of creating
a stack the lexical analyser would in fact implement a stack,
which would be initialised with this list.
The primary components which had to be created for this
section were:
- an implementation of a nfa
- an implementation of a list of strings and a flag
indicating wether they are case sensitive, together with
a checking mechanism, which, given a string will decide
if (taking into account all issues of case sensitivity):
- the string won't match any string in the list
whatever you concatenate to the end of the string
- the string exactly matches an element in the list
- the string could match some string in the list
whatever you concatenate to the end of the string
After this is was simply a matter of finding the longest
string from input which matches one of the above.
The implementation of the nfa class is complex and deserves a
whole section to itself. This will be added later.
The Syntax Analyser
This section involved taking a grammar and a list of all
grammar elements and constructing a parse table from them. The
algorithm for this is taken from the Aho, Sethi & Ullman
book. Despite the algorithm being fairly simple to write it
required a great deal of supplementary routines and classes to
emulate classes and varoius other data structures.
Once this has been done it becomes possible to analyse a list
of tokens using the parse table with the rules defined in the
Aho, Sethi & Ullman book. By applying these rules we end up
with a list of products and tokens which the source code uses.
These can either be used to build up a tree or execute the code
attached to the productions.
This section is by the far the most complex part of the
assignment, and can be compared to other compilier generators
such as yacc, bison and JavaCC. The primary difference is that
instead of creating source code in some language it creates data
structures which contain the same information but have to be
accessed by routines which also have to be written, rather than
using the built in acapabilities of a lnaguage, such as accessing
the elements of an array.
The Other Sections
There otehr sections are not fell enough designed to comment
on them here yet.
Last updated on 09 January 1998 by Leo Crawford