You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
vibed/edu/.beans/archive/edu-16fy--6-recognizing-ato...

169 lines
5.8 KiB
Markdown

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

---
# edu-16fy
title: '§6 Recognizing Atoms: Integers, Booleans, Strings, Symbols'
status: completed
type: task
priority: normal
created_at: 2026-03-10T23:30:01Z
updated_at: 2026-03-10T23:30:01Z
---
## §6 Recognizing Atoms: Integers, Booleans, Strings, Symbols — Stub to fill
File: `edu/src/lisp-compiler.md`, section `### 6. Recognizing Atoms: Integers, Booleans, Strings, Symbols`
Replace the stub line with full content. Target 8001100 words. This is a hands-on section that builds one atom parser at a time. Each parser is developed in isolation before being combined in §8.
## Learning objectives
- Write a nom parser for each MiniLisp atom type
- Use `map_res`, `recognize`, `opt`, `alt`, `tag`, `char`, `take_while1`, `is_not`, `escaped_transform`
- Understand how to test parsers with `assert_eq!` on the full `IResult`
- Know the tricky cases: negative integers vs symbol `-`, `#t`/`#f` ambiguity, string escapes
## Content to write
Work through each atom parser in a subsection with: explanation, full code, tricky cases, and a test block.
### Integer parser
A signed decimal integer: optional `-`, then one or more digits, converted to `i64`.
```rust
use nom::{IResult, combinator::{map_res, recognize, opt}, character::complete::{char, digit1}, sequence::pair};
pub fn parse_integer(input: &str) -> IResult<&str, i64> {
map_res(
recognize(pair(opt(char('-')), digit1)),
|s: &str| s.parse::<i64>()
)(input)
}
```
Tricky case: the symbol `-` and negative integers. Because `opt(char('-'))` allows a lone `-`, `parse_integer("-")` will try to parse `-` as an integer and fail at `map_res` (because `"-"` does not parse as i64). This is fine — the failure is recoverable and `alt` in the atom parser will fall through to the symbol parser. However, this means the integer parser must be tried *before* the symbol parser in the `alt`.
Tests:
```rust
assert_eq!(parse_integer("42 rest"), Ok((" rest", 42)));
assert_eq!(parse_integer("-7"), Ok(("", -7)));
assert!(parse_integer("abc").is_err());
```
### Boolean parser
```rust
use nom::{IResult, branch::alt, bytes::complete::tag, combinator::value};
pub fn parse_bool(input: &str) -> IResult<&str, bool> {
alt((
value(true, tag("#t")),
value(false, tag("#f")),
))(input)
}
```
Explain `value(output, parser)` — discards the parser's output and returns a fixed value instead. This avoids a `map` that ignores its argument.
Tricky case: `#t` and `#f` must not be valid symbol characters, otherwise a symbol starting with `#` would be ambiguous. Confirm that `#` is not in the symbol character set (per §2).
### Symbol parser
Symbols start with a `sym_start` character and continue with zero or more `sym_cont` characters. Use `recognize` to return the input slice.
```rust
use nom::{IResult, combinator::recognize, sequence::pair,
character::complete::{alpha1, alphanumeric1},
bytes::complete::take_while1, branch::alt};
fn is_sym_start(c: char) -> bool {
c.is_alphabetic() || "-_?!+*/=<>".contains(c)
}
fn is_sym_cont(c: char) -> bool {
c.is_alphanumeric() || "-_?!+*/=<>".contains(c)
}
pub fn parse_symbol(input: &str) -> IResult<&str, &str> {
recognize(pair(
nom::bytes::complete::take_while_m_n(1, 1, is_sym_start),
nom::bytes::complete::take_while(is_sym_cont),
))(input)
}
```
Tricky case: `+`, `*`, `/`, `=`, `<`, `>` are valid single-character symbols (used as operator names). The parser must handle them.
Tests:
```rust
assert_eq!(parse_symbol("my-var rest"), Ok((" rest", "my-var")));
assert_eq!(parse_symbol("+"), Ok(("", "+")));
assert!(parse_symbol("42").is_err());
```
### String parser
Double-quoted strings with escape sequences `\"`, `\\`, `\n`, `\t`.
```rust
use nom::{IResult, bytes::complete::{tag, is_not}, sequence::delimited,
combinator::map, branch::alt};
use nom::bytes::complete::escaped_transform;
use nom::character::complete::char;
pub fn parse_string(input: &str) -> IResult<&str, String> {
delimited(
char('"'),
escaped_transform(
is_not("\\\""),
'\\',
alt((
map(char('"'), |_| "\""),
map(char('\\'), |_| "\\"),
map(char('n'), |_| "\n"),
map(char('t'), |_| "\t"),
))
),
char('"'),
)(input)
}
```
Note: `escaped_transform` returns `String` (owned), not `&str`, because it must allocate when escape sequences are expanded.
Tricky case: an empty string `""``is_not` requires at least one character. Test it explicitly.
Tests:
```rust
assert_eq!(parse_string(r#""hello""#), Ok(("", "hello".to_string())));
assert_eq!(parse_string(r#""a\nb""#), Ok(("", "a\nb".to_string())));
assert_eq!(parse_string(r#""""#), Ok(("", "".to_string())));
```
### Comment parser
Comments are consumed and discarded — they produce no AST node.
```rust
use nom::{IResult, bytes::complete::is_not, sequence::pair,
character::complete::{char, line_ending}, combinator::opt,
combinator::value};
pub fn parse_comment(input: &str) -> IResult<&str, ()> {
value((), pair(char(';'), opt(is_not("\n\r"))))(input)
}
```
## Exercises
1. Extend the integer parser to also recognise hexadecimal literals prefixed with `0x` — use `alt` and `map_res` with `i64::from_str_radix`.
2. Extend the symbol parser to reject the single character `-` followed immediately by a digit (since that should be parsed as a negative integer).
Both exercises should have collapsible reference solutions.
## Style notes
- One subsection per atom type, in the order they will appear in the `alt` in §8
- Every code block must be self-contained with `use` statements
- Show tricky cases and why they are tricky before showing the solution — the reader should understand the pitfall, not just copy the fix
- nom version note: use `nom::bytes::complete` (not `nom::bytes::streaming`) throughout