Theta Parser
The Theta parser is responsible for converting tokenized input from the lexer into an Abstract Syntax Tree (AST). This process ensures that valid syntactical constructs are represented as nodes in the AST, which can then be used by the compiler for code generation and other phases of the compilation pipeline.
This document outlines the architecture and flow of the parser, providing detailed insights into the major components and how they work together.
Table of Contents
Overview
The Theta parser works by taking a deque of tokens generated by the lexer and converting it into an Abstract Syntax Tree (AST). The AST represents the structural hierarchy of the code and its syntactical components, which are critical for the compiler’s later stages.
The parsing process involves matching specific token types, verifying syntax rules, and constructing different AST node types based on the tokens encountered.
Core Components
Token Matching
Token matching is a crucial part of the parsing process. The parser frequently checks whether the next token in the deque matches an expected type and lexeme before proceeding. If the token matches, it is consumed; otherwise, the parser attempts alternative parsing strategies or throws an error if no valid pattern is found.
The matching mechanism involves the following methods:
match(Token::Type type, std::string lexeme)
: Checks if the next token matches the expected type and lexeme.check(Token::Type type, std::string lexeme)
: Similar tomatch()
, but only checks the next token without consuming it.
Parsing Functionality
The parser is built around several parse functions that handle different syntactical constructs, such as expressions, assignments, control flow, structs, and more. Each function attempts to match the expected tokens for its corresponding construct and builds the appropriate AST nodes.
parseSource()
: Begins parsing the entire source code, processing anylink
statements, followed by the core body of the capsule.parseExpression()
: Handles general expressions and determines whether an assignment, binary operation, or another construct is being parsed.parseBlock()
: Processes a block of code (usually enclosed in braces), collecting statements and expressions into a sequence.
Error Handling
The parser provides robust error handling capabilities, generating meaningful exceptions when parsing fails due to syntax errors. These exceptions are reported back to the Theta::Compiler with detailed information such as the file, source line, and the token that caused the error.
Common exceptions include:
ParseError
: Thrown when the parser encounters an unexpected token or invalid syntax during parsing.CompilationError
: Used for broader compilation issues, such as missing linked files or invalid syntax constructs like missing braces.
Detailed Parsing Process
Source Parsing
The entry point for the parser is the parseSource()
function, which handles the top-level structure of the source file.
- Link Parsing: If the source file contains any
link
statements, they are processed first. The parser matches thelink
keyword and attempts to resolve the linked capsule by finding the appropriate file using thefilesByCapsule
map.- If the linked capsule is found, its AST is retrieved or built.
- If the capsule cannot be found, a
LinkageError
is raised.
- Capsule Parsing: Once links are handled, the parser processes the core capsule structure using the
parseCapsule()
function, which parses the main body of the source file.
Link Parsing
When a link
statement is encountered, the parser calls parseLink()
, which handles resolving external capsule dependencies and generating corresponding AST nodes.
- If the capsule has already been parsed, its AST is retrieved.
- If the capsule has not yet been parsed, the file is located using
filesByCapsule
, and a new AST is generated by the compiler’sbuildAST
function. - If the capsule file cannot be found, a
LinkageError
is raised, terminating the parse.
Control Flow
The parser handles control flow statements like if
, else
, and else if
using the parseControlFlow()
function.
- The function matches the
if
keyword, then parses the condition and the associated code block. - It handles
else if
clauses and optionalelse
blocks, building a sequence of condition-expression pairs. - The resulting AST represents the entire control flow construct, with each condition paired with its corresponding block of code.
Expression Parsing
The parser processes expressions of varying complexity, including assignments, binary operations, comparisons, and literals. These expressions are parsed through a sequence of functions that recursively break down the expression and construct the appropriate AST nodes.
- Binary Operations: The parser handles binary operations such as addition, subtraction, and comparison using functions like
parseTerm()
,parseFactor()
, andparseBooleanComparison()
. These functions work by matching operators and building binary operation nodes. - Unary Operations: Unary operations (such as negation) are handled by
parseUnary()
, which matches unary operators and builds the appropriate AST node.
AST Node Construction
For each syntactical construct, the parser creates corresponding AST nodes. The node types are derived from base ASTNode classes, with specialized subclasses for different constructs.
AssignmentNode
: Represents an assignment operation.BinaryOperationNode
: Represents binary operations like+
,-
,*
, and comparisons.FunctionInvocationNode
: Represents function calls and their arguments.ControlFlowNode
: Represents control flow structures likeif
andelse
.BlockNode
: Represents a block of statements or expressions.
Each AST node is linked to its parent node, ensuring that the tree structure correctly reflects the hierarchy of the parsed code.
Design Considerations
The Theta parser is designed to handle complex language constructs while maintaining flexibility and robustness. Key design considerations include:
- Separation of Concerns: Parsing logic is broken down into distinct functions for different constructs (e.g., expressions, control flow, links, etc.), simplifying code maintenance and scalability.
- Error Reporting: The parser generates detailed errors that include the token, source file, and line number, helping developers quickly identify and fix issues.
- Efficient AST Construction: By using a shared pointer system, the AST nodes can be efficiently shared between different parts of the parsing process, reducing memory overhead.
- Link Handling: The parser’s ability to resolve and include linked capsules during parsing allows Theta programs to scale across multiple source files.
Example Workflow
Parsing a Source File
Here’s an example of how the Theta parser processes a source file:
- Token Input: The lexer provides a deque of tokens representing the source code.
- Source Parsing: The parser begins with
parseSource()
, handling anylink
statements and parsing the core body of the capsule. - Expression Handling: As the parser encounters expressions, it recursively breaks them down using functions like
parseExpression()
,parseAssignment()
, andparseBooleanComparison()
. - AST Node Creation: For each syntactical construct, an appropriate AST node is created, linked to its parent, and added to the overall tree.
- Error Handling: If an unexpected token is encountered or a syntax error is detected (e.g., missing brace), a
ParseError
orCompilationError
is raised, providing detailed information about the issue. - Final AST: After processing all tokens, the final AST representing the source file is returned, ready for use by the compiler’s subsequent stages.