Structure of a Generic Source-Code Formatter

Introduction

In order to avoid repetition please refer to the summary document which contains a description of the problem.

The Components

This section does not attempt to justify the decisions made, as these will eb made elsewhere, but simply to explain teh arrangement of the system.

In order to function there are several tasks which have to be achieved:

  1. The lexical analysis of the provided source code using some given rules, e.g. what the keywords are
  2. The syntax analysis of the tokens using rules given as to the syntax of the language
  3. The execution of some small language for controlling what is sent to output and keeping track of such things as indentation levels
  4. Some output dependant translator which given an instruction such as SPACE can turn that into the required language's representation of a space. This is a simple language and this section may have to deal with more complicated constructs, e.g. the use of tables in HTML to lay the document out correctly
  5. Some methoid of parseing the rules as to the structure and syntax of the language so as they can be passed to the first four sections

in addition the following have to be catered for:

To achieve these goals it is best to consider the project as five seperate projects (these directly relate to numbered tasks above):

Formally

Before considering each of these in term I would like to provide a more formal definition of each. I relaise the notation is a little imprecise and strange but I don't want to provide a rigerous description at this point so I have chopsen to adopt some flexibility of natural language while applying some common notational techniques in places.

Lexcial Analyser
lex ( source code, list of nfa representing identifiers, list of keywords or static elements ) -> list of tokens
Syntax Analyser
analyser ( list of productions in language - with embedded peices of code, list of all names of productions, tokens and identifiers ) -> parse ( list of tokens ) -> list of embedded code in order representing the code matched with each production represented by each elemewnt in the source code
Small interpreted programming language
execute ( item of embedded code) -> a call to the Output section with te contents of any print() or send() command from the code
Output section
print() or send() -> some series of ASCII characters directly related to the arguments, i.e. either direct copies or output found in a simple lokup table of replacements, e.g. "SPACE" could translate as " " but 5 could not translate as five occurances of some symbol (unless every possibility had been entered in the lookup table)
Controller
  1. Given some language rules:
    1. creates a list of nfa representing identifier (nfa)
    2. creates a st of keywords or static elements (st)
    3. creates a list of productions with embedded code(grammar)
    4. creates a list of st of all names of productions, tokens and identifiers (grammarSymbols)
  2. execute ( analyser( grammar, grammarSymbols ).parse( lex( source code, nfa, st ) ) ) produces a series of calls to the Output Section

The Different Packages

The Lexical Analyser

The task of the lexical analyser is to take :

and allow the user to access the list of tokens and special strings through a stack. It was decided that instead of creating a stack the lexical analyser would in fact implement a stack, which would be initialised with this list.

The primary components which had to be created for this section were:

After this is was simply a matter of finding the longest string from input which matches one of the above.

The implementation of the nfa class is complex and deserves a whole section to itself. This will be added later.

The Syntax Analyser

This section involved taking a grammar and a list of all grammar elements and constructing a parse table from them. The algorithm for this is taken from the Aho, Sethi & Ullman book. Despite the algorithm being fairly simple to write it required a great deal of supplementary routines and classes to emulate classes and varoius other data structures.

Once this has been done it becomes possible to analyse a list of tokens using the parse table with the rules defined in the Aho, Sethi & Ullman book. By applying these rules we end up with a list of products and tokens which the source code uses. These can either be used to build up a tree or execute the code attached to the productions.

This section is by the far the most complex part of the assignment, and can be compared to other compilier generators such as yacc, bison and JavaCC. The primary difference is that instead of creating source code in some language it creates data structures which contain the same information but have to be accessed by routines which also have to be written, rather than using the built in acapabilities of a lnaguage, such as accessing the elements of an array.

The Other Sections

There otehr sections are not fell enough designed to comment on them here yet.

 


Last updated on 09 January 1998 by Leo Crawford