crs Lexical Analysis

Lexical Specification

Deterministic Finite Automaton

The following list of regular expressions were used in order to define our language's mechanism for recognizing patterns when tokenizing. For brevity's sake, a variant on the POSIX standard for regular expressions is shown.

Pattern Label Simplified Expression Final Regular Expression
lettera..z¦A..Z[a-zA-Z]
digit 0..9[0-9]
nonzero 1..9[1-9]
alphanumeric letter ¦ digit ¦ _ [a-zA-Z]¦[0-9]¦_
identifier letter (alphanumeric)* [a-zA-Z]([a-zA-Z]¦[0-9]¦_)*
fraction .digit* nonzero ¦ .0 \.(([0-9]*[1-9])¦0)
integer nonzero digit* ¦ 0 (([1-9][0-9]*)¦0)
float integer fraction (e (+¦-)? integer)? (([1-9][0-9]*)¦0)\.(([0-9]*[1-9])¦0)(e(+¦−)?(([1-9][0-9]*)¦0))?

From the regular expressions for tokens defined above, as well as additional simpler language features, a deterministic finite automaton (DFA) was constructed. Tools such as AtoCC and FSM Simulator were used in order to confirm the validity of the regular expressions as finite state machines.

The FSM diagram was created using draw.io and can be modified by opening the lexer_dfa.xml file.

DFA of lexer

State Transition Table

The automaton implementation uses a static state transition table. Any change in the lexical specification above requires rebuilding this table, but should not impact the implementation of the compiler significantly.

The following legend defines some shortcuts for transition characters:

Transition LabelList of Valid CharactersRegular Expression
whitespace , \t, \r(\ +\t+\r)
newline \n \n
letter a..z and A..Z except for e (a+b+c+d+f+g+h+i+j+k+l+m+n+o+p+q+r+s+t+u+v+w+x+y+z)
+
(A+B+C+D+E+F+G+H+I+J+K+L+M+O+P+Q+R+S+T+U+V+W+X+Y+Z)
digit 0..9(0+1+2+3+4+5+6+7+8+9)
nonzero 1..9(1+2+3+4+5+6+7+8+9)

This state transition table was constructed based on the above DFA. The first column indicates the current state. The last few columns indicate any peculiarities of that state. The remaining columns represent all possible input characters which correspond to transition functions.

Current State whitespacenewlineletterenonzero 0 (){}+-*/%!&¦;><=_.:[] Final State Backtrack Required Error State Token Class
0000000000000000000000000000001
118224535121314152021942227232519283116051464950000
2332222333333333033333323333000
3111111111111111111111111111110<identifier,>
<keyword,>
4555555500005670000000000000000
5111111111111111111111111111110<math_operator,/>
6111111111111111111111111111100<open_multi_line_comment>
7111111111111111111111111111100<single_line_comment>
8111111111111111111111111111100<newline>
9111111111111110000110100000000000000000
10111111111111111111111111111100<close_multi_line_comment>
11111111111111111111111111111110<math_operator,*>
12111111111111111111111111111100<open_parens>
13111111111111111111111111111100<close_parens>
14111111111111111111111111111100<open_curly_brace>
15111111111111111111111111111100<close_curly_brace>
161818181818181800001801800000001700000000
17111111111111111111111111111100<relational_operator,==>
18111111111111111111111111111110<assignment_operator>
19111111111111111111111111111100<semicolon>
20111111111111111111111111111100<math_operator,+>
21111111111111111111111111111100<math_operator,->
22111111111111111111111111111100<math_operator,%>
230000000000000000240000000000000
24111111111111111111111111111100<logical_operator,and>
250000000000000000026000000000000
26111111111111111111111111111100<logical_operator,or>
27111111111111111111111111111100<logical_operator,not>
282929292929292900002902900000003000000000
29111111111111111111111111111110<relational_operator,>>
30111111111111111111111111111100<relational_operator,>=>
3134343434343434000034034000003203300000000
32111111111111111111111111111100<relational_operator,<>>
33111111111111111111111111111100<relational_operator,<=>
34111111111111111111111111111110<relational_operator,<>
35444400000444444444444444404444444444440360044000
3600003737000000000000000000000000
3743430393738043434343434343430434343434343000043000
3800003738000000000000000000000000
390000424100004040000000000000000000
4000004241000000000000000000000000
4143430000043434343434343430434343434343000043000
424343004242043434343434343430434343434343000043000
43111111111111111111111111111110<float,>
44111111111111111111111111111110<int,>
4544440045450444444444444444404444444444440360044000
4648484800000000004800000000004700000
47111111111111111111111111111100<scope_resolution_operator>
48111111111111111111111111111110<inheritance_operator>
49111111111111111111111111111100<open_square_bracket>
50111111111111111111111111111100<close_square_bracket>
51111111111111111111111111111100<accessor_operator,.>

Note that for rows which correspond to final states, the next transition function is never computed, given that arriving at a final state results in a return to the initial state (after backtracking if applicable). For simplicity's sake and in order to indicate a return to the start state, these rows are filled with 1.

Language Keywords

The current set of keywords available in the language are:

  • if
  • then
  • else
  • for
  • class
  • get
  • put
  • return
  • program
  • int
  • float
  • bool
  • and
  • or
  • not

If an <identifier,> token is found, it must first be compared to the above list where a match would actually result in a keyword token class, with the exception of and, or and not, which would result in a token class of logical_operator.

Lexical Errors

The three types of errors generated by the lexer are:

  • Unrecoverable Error (1): The input file path provided does not exist on the filesystem.
  • Recoverable Error (2): Input character is unrecognized.
  • Recoverable Error (3): Invalid syntax: error state reached. Resetting FSM to initial state.

Every undefined transition in the above DFA results in one of the two recoverable error types, as is demonstrated by the wide variety of integration tests.

All recoverable errors display the line number and line when tokenizing the source code.

Example of a recoverable error and the displayed line number

The panic mode technique is used for error recovery, which involves skipping invalid input characters until valid program segments are encountered. An error is reported to the user, but the lexical analysis and identification of tokens continues afterwards.

All error output is logged to both the console as well as the error.log file at the root of the repository whenever any lexical analysis occurs.