There are a number of code formatters available, some in the public domain and some as part of integrated development packages. These programs range from the very basic to the slightly more complex, some will generate true code which can be compiled while others turn the document into HTML or similar.
All have one major limitation however - they all work for one language and one language only. A source code formatter written to format C code will not work on Pascal code, SR code or even work very well with C++ code. This means anyone who programs in a number of different languages needs to have a number of different programs - one for each language, all potentially with a different syntax and producing results with slightly different formatting.
I intend to design and write a completely generic source code formatter, which will be able to format any specified language into any given format.
As an implementational detail it is expected that the processing of a new source code definition will be slow, and as such there may be some method to save the developed analyser so that it can be reused without having to recreate it from scratch. It is also expected that the actual lexical and syntax analysing of a new peice of source code for a normal sized language, e.g. Pascal, be done in real time. This is despite teh fact that it is being implemented in an interpreted language, and the syntax anlyser is also, effectively, interpreted.
The language of the code to be formatted must be specified to the program. To do this an adaptation of BNF will be used. The format will be:
Keywords
##
Terms
##
Special
##
Ignores
##
Structure
Throughout this Identifier is any sequence of characters and numerials. No special characters.
In addition to this a simple programming language has been developed to pass information to the display section of the program. The program constructs can be embedded between identifiers in the Structure section of the definition in braces { and }. This programming language is defined later in this document.
I propose that this definition language be named Source Code Definition Language (SCDL).
This is the definition of the language that can be embedded into source code to keep track of the current state of the source code and instruct the program display component to output parts of the source code.
This language implements sequence, repetition and selection in addition to integer only variables and some predefined functions.
One or more commands can be embedded before or after each term in a production defining the structure of the language.
If the action is immediately after a token then token contains a
textual representation of the token. There is only very limited
functionality available to the string. That is it can't be
compared with anotehr string but the string functions described
below can be used.
operator
or keyword,
and the content of the given token. This allows the
output section to format different types of tokens in
different ways.and are sent to the output section as such:
A semicolon separates each statement.
The follwoing functions are included:
The following binary operator will be available for use with integers:
The following binary comparison operators will be available for use with integers:
Currently there is no negation function, for example not.
If no instruction is given after a token then it will default
to print(token)
The above will have gone some way to providing an informal idea of the EAL. Below it is defined in a slightly more formal way, though it is still a little imprecise by intention:
InstructionList -> Instruction ";" | Instruction ";" InstructionList
Instruction -> Selection
| Iteration | Assignment | Command
SelectionCondition "then" -> "if"
Iteration -> "while"
Condition "{" InstructionList "}"
Condition -> Expression binary operator Expression
Expression -> identifier
| "length(token)" | Expression
binary operator Expression
Assignemnt -> identifier ":=" Expression
Command -> print(String)
| send(String) | send(String,String)
String -> """
string """ | "token" | string
function "(" String
")"
The second main part of the project will be to convert the parsed text into some final state, which could be any number of defined format, e.g. pure source code, HTML, postscript, latex, etc. This part of the project requires further investigation before a solution is reached but the idea is that the output format will be specified using a fairly complex language.
The idea is that a language and output format can be defined independently allowing any language to be converted to any form. This could introduce the interesting concept of converting one language to be output in the same language, a brilliant example of this is HTML. It should also be possible to extend the definition of the output to take account of new structures of a language. This means that there will not be a pre-defined set of constructs (e.g. procedures) which can be applied to a source document, but a set of constructs that are developed or modified by the user. This would allow for completely new constructs to be developed (e.g. the comparatively recent introduction of classes into programming languages) without the need for a program rewrite.
In addition to this the program will be modular so that it can be easily developed in the future. One prime example of possible development would be the front-end. Initially a text only front end on either the UNIX or PC system will be developed, but this could easily be replaced by the introduction of a graphical user interface.
There are also a number of developments that could be added in either at the initial development time or at a later date, depending on a later assessment of what the project should entail. These include the ordering of block to reflect the order in which they are used (to allow a series of procedures to be only backward referencing - some languages/compilers which only use one parse at compilation stage dont allow forward reference of procedures) or some global table of variables, etc. which would allow a hypertext document to be created with links between the same variables or calls and declarations of the same procedure.
The definition of this language is still undefined.
The previous version of this document is also available.