Problem Set 1 Harvard Extension School CSCI E-95: Compiler Design and Implementation Fall 2023 Due: September 17, 2023 at Midnight ET (Eastern Time) 1. (100 Points) Using the lexical-analyzer generator Flex, write a lexer to return tokens from an input stream. The tokens to be recognized include identifiers, reserved words, operators, separators, and constants. Our tokens are taken from a simplified version of the ISO C89 language. Do not attempt to match tokens that are not acceptable -- this includes tokens that are explicitly mentioned as not acceptable. You may utilize a single default rule in your lexer to emit errors and skip over characters that are not in valid tokens. Keep in mind that the stream of tokens that are produced by your lexer is the only information that is fed to future components of your compiler. Therefore, that stream of tokens must include all information that is needed by those later compiler components. All code for your lexer -- and for your entire compiler as you add components to your lexer in future problem sets -- should be placed in the src/compiler directory. For the code that you write to implement your compiler, you are welcome to use all of the syntax and features of the latest ISO/ANSI C Programming Language. The accepted input Source Character Set is composed of the 52 upper case and lower case alphabetic characters; the 10 decimal digits; the space, horizontal tab, vertical tab, and form feed characters; the 29 graphic characters below; the new line character; the 3 extra graphic characters dollar sign, at-sign, and accent grave. Note that these extra graphic characters are only acceptable as part of a string literal, character literal, or comment. This Source Character Set is the complete set of characters that are allowed to appear in your source code input file. As mentioned in Harbison & Steele, Section 2.1.1 on Page 12, there is also an Execution Character Set. In our implementation, the Execution Character Set is the ASCII character set extended to include all characters that can be represented in an unsigned char. The Execution Character Set is the complete set of characters that can be read or written by a C program during a program's execution. Any characters in the Execution Character Set can be represented in character and string constants by using escape sequences that begin with the backslash ( \ ) character (see below). An implementation that is targeted towards Windows in addition to Unix/Linux may decide to accept carriage return as an additional acceptable character in the Source Character Set. A Windows implementation should accept carriage return followed by new line as an acceptable end-of-line character. In any case, all implementations must accept a new line character by itself as a valid end-of-line character. If a carriage return is encountered without a following new line character, an error should be emitted. Graphic characters: exclamation point plus double quote number sign equal left brace percent tilde right brace circumflex left bracket comma ampersand right bracket period asterisk backslash less than left parenthesis vertical bar greater than right parenthesis semicolon slash hyphen or minus colon question mark underscore apostrophe Whitespace is defined as any sequence of the blank (space), new line, vertical tab, form feed, and horizontal tab characters. Implementation of continuation of source lines (i.e., the ability to end a line with a backslash character immediately before the end-of-line marker so that it is combined with the next source line as one logical source line) as described in Section 2.1.2 in Harbison & Steele is not a required feature of your compiler. Comments are delimited by /* and */. They are ignored by the lexer and are not returned as tokens. See Section 2.2 in Harbison & Steele: comments are to be replaced by a single space character. A token returned by the lexer is always the longest possible recognizable valid sequence of characters. An identifier is a sequence of letters, digits, and underscores. An identifier must not begin with a digit, and it must not have the same spelling as a reserved word. As stated in Harbison & Steele, Section 2.5, your implementation must permit a minimum of 31 significant characters in identifiers. The reserved words are: do for return break short else goto signed unsigned char if void int continue long while The operators and separators are: simple operators: ! % ^ & * - + = ~ | < > / ? compound assignment operators: += -= *= /= %= <<= >>= &= ^= |= other compound operators: ++ -- << >> <= >= == != && || other separator characters: ( ) [ ] { } , ; : The constants (also known as literals) are integers, characters, and strings. We will not accept floating-point constants. The integers may be only decimal (we will not accept octal or hexadecimal integer constants). Note that a constant composed of a single '0' digit will be deemed to be an acceptable integer constant; a single '0' digit would otherwise be an octal integer constant. We will not accept any integer suffixes (l, L, ll, LL, u, or U). Even though the C Standard allows character constants to be composed of multiple characters, in our implementation, a character constant is a single character enclosed within single quotes. Additionally, we are not allowing wide character constants. A string constant is a sequence of characters enclosed within double quotes. We are not requiring the implementation of string concatenation when there are adjacent string constants. Your implementation must accept the C89 requirement of at least 509 characters within a string literal (not including the appended \0 character), but may accept longer string literals (see the extra credit for unlimited string literal length). If you place a limit on the length of string literals, an error should be emitted for string literals that exceed that length. Within character and string constants, we will accept characters in backslash notation -- so called escape codes. In particular, we will allow only octal and character escape codes in our language. Octal escape codes may be one, two, or three octal digits in length. Character escape code may be n, t, b, r, f, v, \, ', ", a, and ?. Our subset implementation of the C Programming Language will most closely follow the Standard C 1989 (also known as "C89"). Please refer to Harbison and Steele, C: A Reference Manual, Fifth Edition for additional details of the language. In order to correctly determine the type of decimal integer constants, it is necessary to know the size of the types of integer constants. This is required to be able to determine whether an integer constant is of type int, long, or unsigned long (see Harbison and Steele, Section 2.7.1 and Table 2-5). For our implementation, we will use the following sizes for types: Type Size in bytes ---- ------------- signed char 1 unsigned char 1 signed short 2 unsigned short 2 signed int 4 unsigned int 4 signed long 4 unsigned long 4 In order to determine if character constants need to potentially have sign-extention (see Harbison & Steele, Section 2.7.3 on the top of page 31), it is necessary to know what is the type of a "char" without a "signed" or "unsigned" modifier. For our C subset, in all cases, a type without a "signed" or "unsigned" modifier will be considered to be signed. Your compiler should not have separate internal representations for character constants and integer constants. As mentioned in Harbison and Steele, Section 2.7.3, "Character constants not preceded by the letter L have type int." During lexical analysis, you should make sure to validate and convert character constants into their appropriate numeric value based on the ASCII encoding value of a single character or the interpretation of an escape code as described in Harbison and Steele, Sections 2.7.5 through 2.7.7. You will need to have a separate internal representation for string constants. Because SPIM, our target assembler, accepts C-style escapes in its .ascii and .asciiz directives, you may decide to represent strings internally by the same text they contained in the source program. You must still validate the string to ensure that any escape characters within it are legitimate. You may earn extra credit for this assignment by properly interpreting escapes in a string constant, as described below. Your lexical analyzer will use the generated yylex function as your interface to the lexer. The lexer should make available to the caller an enumerated type indicating the type of token found (reserved word, operator, separator, identifier, or literal) and the value of the token, as appropriate. For tokens that are either a reserved word, an operator, or a separator, the lexer should return an enumerated type that indicates which specific reserved word, operator, or separator was found. If the token is an identifier, the lexer should return the name of the identifier as a dynamically-allocated string pointed to by yylval. If the token is a literal, the kind of literal (integer or string) should be identified by the return value and the value of the literal should be returned through yylval. If the token is an integer literal (including character literals -- see Harbison and Steele, Section 2.7.3), yylval should point to a struct that contains: (1) an enumerated type that indicates the type of the integer literal (see Harbison and Steele, Section 2.7.1 and Table 2-5) and (2) the value of the integer literal. If the token is a string literal, yylval should point to the value of the string. As metioned in the preceding paragraph, all these values will be made available by using facilities provided by the lexical-analyzer generator. For example, in Lex, the code in the "rules" section should return the token name/value that will be defined in the input file to Yacc as a %token. Before integrating with Yacc, the values of the token names can be defined in an appropriate header (i.e., .h file). The token name/value is the value that Yacc will use to match against the token name specified in the body of a Yacc translation rule. All other information should be returned via yylval which is accessible in Yacc via $, the attribute associated with the th grammar symbol in the body of a translation rule. Your lexer must be able to keep track of the line number and the column numbers from which input is being read. This is important in order to identify the line and column numbers where errors are detected. Any errors must be identified with the approximate line and column numbers in the input stream where the error was found. Line and column numbers must be associated with each token produced by your lexer. As noted above, the end-of-line character sequence is either a new line or a carriage return followed by a new line (in implementations that are compatible with Windows line termination characters). The work that you produce for Problem Sets 1 through 6 and for the Final Project should be cumulative and all source code should be in the src/compiler directory in your repository. Please do *not* create new directories for each of the Problem Sets. The executable for your developing compiler should be named "compiler" (in all PSes and the Final Project) and should accept the same command line interface as does the sample skeleton compiler in the class repository on GitHub, as follows: Usage: compiler [-s ] [-o ] [] ::= scanner | parser | symbol | type | ir | mips if the -s switch is not specified, defaults to mips if the -o switch is not specified, output goes to stdout if is not specified, input comes from stdin Each value for refers to a final processing stage that the compiler progresses through. "scanner" refers to PS1, "parser" refers to PS2, "symbol" refers to PS3, "type" refers to PS4, "ir" refers to PS5, and "mips" refers to PS6. You should output an error if a is selected that is not yet implemented. You are welcome to add stages for various levels of optimization that you add in the Final Term Project. Keep in mind that, in addition to correctness, your grade for this assignment will include points for style. You must include a makefile to build your application and a test program that invokes your function and displays the stream of tokens that are returned. You should include sample test program(s) and output that shows that you have tested your function with a wide variety of possible input. The output that you produce for this test program must be formatted to be easy to read. See the course syllabus for more information. As in Problem Set 0, for all problem sets including this one, remember to commit *all* of the code in your compiler to the src/compiler directory, push the branch, create a pull request, add a comment with the text: @frankelharvard @dwillens @massfords and accept the pull request. For this problem set, the branch should be named problem-set-1. Extra credit: Allow any number of significant characters in identifiers. Extra credit: Allow any number of characters in a string literal. Extra credit: Implement source line continuation. Extra credit: When creating your internal representation for string literals, build an array of characters that corresponds to the value of the string, rather than its appearance in the source program. To do so, you will have to interpret any escape characters within the string, converting them to their proper numeric value as you did for character constants. You will also need to represent the length of the string alongside the value, since you cannot rely on NUL-termination. Last revised 20-Sep-23