Problem Set 2 Harvard Extension School CSCI E-95: Compiler Design and Implementation Spring 2025 Due: February 23, 2025 at Midnight ET (Eastern Time) 1. (300 Points) Using your lexer from Problem Set 1, write a simple parser for the grammar for our subset of the C Programming Language that is available on the class web site. Your parser must be written using a YACC or Bison parser generator. As your compiler reads the input program being compiled, it should generate an abstract syntax tree (AST) in memory which represents that program. The AST should abstract away all syntactic structures. Other than outputting error messages for ill-formed input, your program should build an internal tree data structure which represents the parsed structure of the input file. For this assignment, there is no need to build a symbol table for any identifiers declared in the input program. Instead, tokens which represent identifier strings should simply appear at the appropriate points in the parse tree. The AST should be a simplified version of the parse tree that corresponds to the grammar. As discussed in class, the AST should not contain unnecessary nodes. For example, do not create a node simply to pass through another node as would happen with a parenthesized expression or for many productions where only a single non-terminal would result (such as in top_level_decl where it would be either a decl or a function_definition, or as in unary_expr where it would be either a postfix_expr or a unary_minus_expr or a unary_plus_expr or ...). Do not create a node for a terminal where the terminal is always some single symbol (such as in unary_minus_expr where the minus_t is followed by a cast_expr; in this case the minus_t always appears before the cast_expr and, therefore, the minus_t does not need to be in the AST). Do not create a different node type for each of the binary operators; instead, use a single binary operator node and include an enumerated type that indicates which binary operator is specified for that node. This simplifies later steps in your compiler by having fewer node types with which to deal. Similarly, do not create a different node type for each of the unary prefix operators. And, a node type should be shared for the unary postfix operators ++ and --. A new node type should be created for the function call operator because the printed representation for that operator is very different from postfix ++ and --. No node needs to be created for the subscripting operator because, as described below, it will be converted into addition and dereference operators and, therefore, will not exist in the AST. The AST will already represent precedence and associativity in the way the tree is created. There may be examples of nodes in the grammar that *do* need to be maintained in the AST even though they have a single child. For example, the compound_statement production has a single child of declaration_or_statement_list, but includes a left_brace and right_brace terminal symbol surrounding the declaration_or_statement_list. This node in the AST will serve as a marker for the pretty printer (to be described later in this problem set) so that the left_brace and right_brace are emitted. Because the subscript operator in C is syntactic sugar for adding an offset and then dereferencing, any appearance of the subscript operator should be replaced by addition followed by indirection/dereference. To be precise, A[i] should be stored in the AST as if it were written as *(A+i). In our language, functions cannot be declared as arguments nor passed as arguments to a function. Identifiers may be declared at: file scope (top-level identifiers) procedure scope (formal parameters in function definitions) prototype scope (formal parameters in function prototype declarations) block scope (local identifiers) function scope (statement labels only) (We are not supporting preprocessor macros) Functions may be declared only at file scope. Identifiers may identify: a variable a function a parameter to a function The allowable types are: char signed char unsigned char short signed short unsigned short short int signed short int unsigned short int int signed int unsigned int signed unsigned long signed long unsigned long long int signed long int unsigned long int In all cases, a type without a "signed" or "unsigned" modifier will be considered to be signed. Where several different names are allowed to be specified for the same type (e.g., "short", "signed short", "short int", and "signed short int" all refer to the same type), they should all be represented by a single canonical type name, in this case, "signed short int". Here are the required type renamings: char signed char both of the above renamed to: signed char short signed short short int signed short int all the above renamed to: signed short int unsigned short unsigned short int both of the above renamed to: unsigned short int int signed int signed all the above renamed to: signed int unsigned int unsigned both of the above renamed to: unsigned int long signed long long int signed long int all the above renamed to: signed long int unsigned long unsigned long int both of the above renamed to: unsigned long int In our language, the reserved word "void" may be used only as a substitute for the return type of a function -- thus indicating that the function does not return a value -- or as a substitute for the parameters of a function -- thus indicating that the function takes no parameters. When "void" is specified as a substitute for the parameters of a function, no parameters may be specified (i.e., void f(void, int i) is not legal). However, we don't expect you will check for this restriction until PS3 and we will not be enforcing this restriction until PS3. Your implementation must require a return type or "void" to be specified for each function (in both declarations and definitions). A parameter list must be specified for each function (of course, it is possible to specify "void"). In other words, we are not allowing the nonprototype, or traditional, form for function parameters. To simplify your implementation, we are also not allowing pointers to void. In addition to the base types defined above, your parser should accept pointers (any depth of pointers to pointers) and arrays (any number of dimensions). You do not need to implement pointers to functions (usually referred to as function pointers). You should maintain the same structure found in your input program regarding more than one initialized_declarator in an initialized_declarator_list. For example, int a, b, c; should be parsed in the AST as three initialized_declarator's in a single initialized_declarator_list. Whereas, int a; int b; int c; should be parsed in the AST as three separate decl's. Our grammar does not allow initializers in any object definitions. Your program should be written in C. It should read a program written in our C-subset language from the specified file (using the same command line interface as does the skeleton code in the class repository) and emit error messages to standard error. After reaching the end of the input, your program should output to standard out a pretty print of the parse tree you created from the input file. Your pretty print output should be a complete C program with identical semantics to the program input to your compiler. That pretty print output should be able to be fed back into your compiler with no errors (that is, the action of your compiler should be idempotent). And, when you do so, the pretty print output from feeding the first pretty print output into your compiler should be identical. In addition, if the original input to your compiler compiled without any errors or warnings under gcc with the -pedantic switch then you should verify that the pretty print output from your compiler compiles without errors or warnings under gcc with the -pedantic switch specified. In the pretty print output, parentheses should be emitted around each sub-expression. For example, the expression "a = b+c/d++*e" should be pretty printed as "(a=(b+((c/(d++))*e)))". In addition, in the pretty print output, parentheses should be emitted around each component in a declarator. For example, the declaration "char fcn(unsigned *a[10+2])" should be pretty printed as "signed char (fcn(unsigned int (*(a[(10+2)]))))". Because there is no type difference between a character literal and an integer literal of the same value, you should output either one as a decimal integer literal. This is a good time to start to develop a reasonable methodology to recover from errors in the compiler's input stream. It is not sufficent to abort the compilation process when an error is detected. You should be able to recover from errors and continue to process as much of the remainder of the input stream as possible. This is one area in the compilation process that will distinguish a better compiler from one that does not recover from errors as well. You should use "git" to submit completed work. All of the code in your compiler should be in the src/compiler directory. And, remember to add a comment with the text: @frankelharvard @massfords @stbenjam The branch for this problem set should be named problem-set-2. Last revised 18-Feb-25