You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

193 lines
6.8 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

+++
title = "§9 Parsing S-Expressions and Special Forms"
priority = 5
status = "done"
ticket_type = "task"
dependencies = []
+++
## §9 Parsing S-Expressions and Special Forms — Stub to fill
File: `edu/src/lisp-compiler.md`, section `### 9. Parsing S-Expressions and Special Forms`
Replace the stub line with full content. Target 10001300 words. This is the hardest parsing section — recursive parsers, special-form recognition, and the top-level `parse` entry point.
## Learning objectives
- Write a recursive parser in nom (handling the recursion challenge)
- Distinguish special forms from generic calls during parsing and produce typed AST variants
- Parse `define`, `lambda`, `if`, `let`, `begin` into the correct `Expr` variants
- Implement the top-level `parse` function
- Understand when to use `cut` to commit to a parse branch
## Content to write
### The Recursion Problem in nom
nom parsers must have known types at compile time, but a parser for S-expressions is recursive: an expression is either an atom or a list of expressions. Rust's type system normally prevents this with "infinite type" errors.
Solution: use a function definition rather than a closure, and break the cycle with a forward reference. In Rust, a named function works because the function pointer has a known size.
```rust
pub fn parse_expr(input: &str) -> IResult<&str, Expr> {
ws(alt((
parse_list,
parse_atom,
)))(input)
}
```
`parse_list` calls `parse_expr` recursively. Because `parse_expr` is a named function (not a closure), its type is `fn(&str) -> IResult<&str, Expr>` — a known size — so the recursion is fine.
### Parsing Generic Lists → Calls
A generic list `(func arg1 arg2 ...)` is parsed into `Expr::Call`:
```rust
fn parse_call(input: &str) -> IResult<&str, Expr> {
let (input, exprs) = delimited(
ws(char('(')),
many1(ws(parse_expr)),
ws(char(')')),
)(input)?;
let mut iter = exprs.into_iter();
let func = iter.next().unwrap(); // safe: many1 guarantees >= 1
let args = iter.collect();
Ok((input, Expr::Call { func: Box::new(func), args }))
}
```
### Recognizing Special Forms
Special forms are lists that begin with a specific keyword. Recognize them *inside* the list parser by peeking at the first token. The cleanest approach: try each special-form parser in an `alt` before falling back to `parse_call`.
```rust
fn parse_list(input: &str) -> IResult<&str, Expr> {
alt((
parse_define,
parse_lambda,
parse_if,
parse_let,
parse_begin,
parse_call,
))(input)
}
```
### Parsing `define`
Two shapes: `(define name expr)` and `(define (name params...) body...)`. Parse both; the second desugars into a `Define` wrapping a `Lambda`.
```rust
fn parse_define(input: &str) -> IResult<&str, Expr> {
let (input, _) = ws(char('('))(input)?;
let (input, _) = ws(tag("define"))(input)?;
// Use cut here: we've seen "(define", so commit to this branch
cut(|input| {
alt((
// Function shorthand: (define (name params...) body...)
|input| {
let (input, _) = ws(char('('))(input)?;
let (input, name) = ws(parse_symbol_str)(input)?;
let (input, params) = many0(ws(parse_symbol_str))(input)?;
let (input, _) = ws(char(')'))(input)?;
let (input, body) = many1(ws(parse_expr))(input)?;
let (input, _) = ws(char(')'))(input)?;
let lambda = Expr::Lambda { params, body };
Ok((input, Expr::Define { name: name.to_string(), value: Box::new(lambda) }))
},
// Variable binding: (define name expr)
|input| {
let (input, name) = ws(parse_symbol_str)(input)?;
let (input, value) = ws(parse_expr)(input)?;
let (input, _) = ws(char(')'))(input)?;
Ok((input, Expr::Define { name: name.to_string(), value: Box::new(value) }))
},
))(input)
})(input)
}
```
Explain `cut`: after matching `(define`, we are committed to this branch. If the body is malformed, `cut` converts recoverable errors to failures, producing better error messages and preventing backtracking to `parse_call`.
### Parsing `lambda`, `if`, `let`, `begin`
Show each parser in similar style. Key details:
**`lambda`**: `(lambda (params...) body...)` — use `many0` for params (zero-parameter functions are valid), `many1` for body.
**`if`**: `(if cond then else)` — exactly three sub-expressions; the third (`else`) is required in MiniLisp.
**`let`**: `(let ((name expr)...) body...)` — parse a list of `(name expr)` pairs, collect into `Vec<(String, Expr)>`.
**`begin`**: `(begin expr...)` — one or more expressions.
### Comments in the expression parser
Comments must be silently consumed wherever whitespace is allowed. Update `ws` (or create a separate `skip` combinator) to skip both whitespace and comments:
```rust
fn skip(input: &str) -> IResult<&str, ()> {
value((), many0(alt((
value((), multispace1),
value((), pair(char(';'), opt(is_not("\n\r")))),
))))(input)
}
```
Then use `skip` in place of `multispace0` in the `ws` wrapper.
### The top-level `parse` function
```rust
/// Parse a complete MiniLisp program (zero or more top-level expressions).
pub fn parse(source: &str) -> Result<Vec<Expr>, crate::error::CompileError> {
let (remaining, exprs) = many0(ws(parse_expr))(source)
.map_err(|e| crate::error::CompileError::ParseError(e.to_string()))?;
if !remaining.trim().is_empty() {
return Err(crate::error::CompileError::ParseError(
format!("unexpected input: {:?}", &remaining[..remaining.len().min(20)])
));
}
Ok(exprs)
}
```
### Unit tests
```rust
#[test]
fn test_parse_if() {
let src = "(if #t 1 2)";
let result = parse(src).unwrap();
assert_eq!(result.len(), 1);
assert!(matches!(result[0], Expr::If { .. }));
}
#[test]
fn test_parse_define_fn() {
let src = "(define (add a b) (+ a b))";
let result = parse(src).unwrap();
assert!(matches!(&result[0], Expr::Define { name, .. } if name == "add"));
}
#[test]
fn test_nested_calls() {
let src = "(display (* 2 (+ 3 4)))";
assert!(parse(src).is_ok());
}
#[test]
fn test_comments_skipped() {
let src = "; this is a comment\n(define x 42)";
assert!(parse(src).is_ok());
}
```
## Style notes
- The recursion problem is the hardest conceptual moment — explain it thoroughly before showing the solution
- `cut` is essential for good error messages; explain why each use of `cut` is there
- The top-level `parse` function must check for unconsumed input — show why (trailing garbage would otherwise be silently ignored)
- End with a checkpoint: parse the complete factorial example and print the AST using the `Display` impl from §7