Theta Parser

The Theta parser is responsible for converting tokenized input from the lexer into an Abstract Syntax Tree (AST). This process ensures that valid syntactical constructs are represented as nodes in the AST, which can then be used by the compiler for code generation and other phases of the compilation pipeline.

This document outlines the architecture and flow of the parser, providing detailed insights into the major components and how they work together.

Overview
Core Components
Detailed Parsing Process
Design Considerations
Example Workflow

Overview

The Theta parser works by taking a deque of tokens generated by the lexer and converting it into an Abstract Syntax Tree (AST). The AST represents the structural hierarchy of the code and its syntactical components, which are critical for the compiler’s later stages.

The parsing process involves matching specific token types, verifying syntax rules, and constructing different AST node types based on the tokens encountered.

Core Components

Token Matching

Token matching is a crucial part of the parsing process. The parser frequently checks whether the next token in the deque matches an expected type and lexeme before proceeding. If the token matches, it is consumed; otherwise, the parser attempts alternative parsing strategies or throws an error if no valid pattern is found.

The matching mechanism involves the following methods:

match(Token::Type type, std::string lexeme): Checks if the next token matches the expected type and lexeme.
check(Token::Type type, std::string lexeme): Similar to match(), but only checks the next token without consuming it.

Parsing Functionality

The parser is built around several parse functions that handle different syntactical constructs, such as expressions, assignments, control flow, structs, and more. Each function attempts to match the expected tokens for its corresponding construct and builds the appropriate AST nodes.

parseSource(): Begins parsing the entire source code, processing any link statements, followed by the core body of the capsule.
parseExpression(): Handles general expressions and determines whether an assignment, binary operation, or another construct is being parsed.
parseBlock(): Processes a block of code (usually enclosed in braces), collecting statements and expressions into a sequence.

Error Handling

The parser provides robust error handling capabilities, generating meaningful exceptions when parsing fails due to syntax errors. These exceptions are reported back to the Theta::Compiler with detailed information such as the file, source line, and the token that caused the error.

Common exceptions include:

ParseError: Thrown when the parser encounters an unexpected token or invalid syntax during parsing.
CompilationError: Used for broader compilation issues, such as missing linked files or invalid syntax constructs like missing braces.

Detailed Parsing Process

Source Parsing

The entry point for the parser is the parseSource() function, which handles the top-level structure of the source file.

Link Parsing: If the source file contains any link statements, they are processed first. The parser matches the link keyword and attempts to resolve the linked capsule by finding the appropriate file using the filesByCapsule map.
- If the linked capsule is found, its AST is retrieved or built.
- If the capsule cannot be found, a LinkageError is raised.
Capsule Parsing: Once links are handled, the parser processes the core capsule structure using the parseCapsule() function, which parses the main body of the source file.

Link Parsing

When a link statement is encountered, the parser calls parseLink(), which handles resolving external capsule dependencies and generating corresponding AST nodes.

If the capsule has already been parsed, its AST is retrieved.
If the capsule has not yet been parsed, the file is located using filesByCapsule, and a new AST is generated by the compiler’s buildAST function.
If the capsule file cannot be found, a LinkageError is raised, terminating the parse.

Control Flow

The parser handles control flow statements like if, else, and else if using the parseControlFlow() function.

The function matches the if keyword, then parses the condition and the associated code block.
It handles else if clauses and optional else blocks, building a sequence of condition-expression pairs.
The resulting AST represents the entire control flow construct, with each condition paired with its corresponding block of code.

Expression Parsing

The parser processes expressions of varying complexity, including assignments, binary operations, comparisons, and literals. These expressions are parsed through a sequence of functions that recursively break down the expression and construct the appropriate AST nodes.

Binary Operations: The parser handles binary operations such as addition, subtraction, and comparison using functions like parseTerm(), parseFactor(), and parseBooleanComparison(). These functions work by matching operators and building binary operation nodes.
Unary Operations: Unary operations (such as negation) are handled by parseUnary(), which matches unary operators and builds the appropriate AST node.

AST Node Construction

For each syntactical construct, the parser creates corresponding AST nodes. The node types are derived from base ASTNode classes, with specialized subclasses for different constructs.

AssignmentNode: Represents an assignment operation.
BinaryOperationNode: Represents binary operations like +, -, *, and comparisons.
FunctionInvocationNode: Represents function calls and their arguments.
ControlFlowNode: Represents control flow structures like if and else.
BlockNode: Represents a block of statements or expressions.

Each AST node is linked to its parent node, ensuring that the tree structure correctly reflects the hierarchy of the parsed code.

Design Considerations

The Theta parser is designed to handle complex language constructs while maintaining flexibility and robustness. Key design considerations include:

Separation of Concerns: Parsing logic is broken down into distinct functions for different constructs (e.g., expressions, control flow, links, etc.), simplifying code maintenance and scalability.
Error Reporting: The parser generates detailed errors that include the token, source file, and line number, helping developers quickly identify and fix issues.
Efficient AST Construction: By using a shared pointer system, the AST nodes can be efficiently shared between different parts of the parsing process, reducing memory overhead.
Link Handling: The parser’s ability to resolve and include linked capsules during parsing allows Theta programs to scale across multiple source files.

Example Workflow

Parsing a Source File

Here’s an example of how the Theta parser processes a source file:

Token Input: The lexer provides a deque of tokens representing the source code.
Source Parsing: The parser begins with parseSource(), handling any link statements and parsing the core body of the capsule.
Expression Handling: As the parser encounters expressions, it recursively breaks them down using functions like parseExpression(), parseAssignment(), and parseBooleanComparison().
AST Node Creation: For each syntactical construct, an appropriate AST node is created, linked to its parent, and added to the overall tree.
Error Handling: If an unexpected token is encountered or a syntax error is detected (e.g., missing brace), a ParseError or CompilationError is raised, providing detailed information about the issue.
Final AST: After processing all tokens, the final AST representing the source file is returned, ready for use by the compiler’s subsequent stages.