You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.7 KiB

+++ title = "§4 Introduction to nom: Parser Combinators" priority = 5 status = "done" ticket_type = "task" dependencies = [] +++

§4 Introduction to nom: Parser Combinators — Stub to fill

File: edu/src/lisp-compiler.md, section ### 4. Introduction to nom: Parser Combinators

Replace the stub line with full content. Target 9001200 words. This is the conceptual and practical foundation for all parsing in the course. The reader needs to understand nom well enough to write parsers without hand-holding by §8.

Learning objectives

  • Understand what a parser combinator is and why it is better than hand-rolling a recursive descent parser for our purposes
  • Understand IResult<I, O, E> and what its three variants mean
  • Know and be able to use: tag, char, alpha1, digit1, multispace0, alt, many0, map, map_res, tuple, delimited, preceded, terminated, opt, recognize, verify, cut
  • Know how to write a parser function, call it, and test it
  • Know how to use the ws whitespace-wrapper pattern

Content to write

What is a parser combinator?

A parser combinator is a function that takes one or more parsers and returns a new parser. Individual parsers handle small fragments of input; combinators compose them into larger parsers. The result is a parser written entirely in the host language (Rust), with no grammar files, no code generation, and no build-time magic.

Contrast with traditional parser generators (ANTLR, yacc): those require a separate grammar file, a code-generation step, and often a bespoke DSL for semantic actions. nom parsers are plain Rust functions.

The IResult Type

type IResult<I, O, E = nom::error::Error<I>> = Result<(I, O), nom::Err<E>>;

On success: Ok((remaining_input, output)). The parser consumed some input and produced a value; remaining_input is whatever was left.

On failure (recoverable): Err(nom::Err::Error(e)). The parser tried and failed; the caller can try an alternative.

On failure (unrecoverable): Err(nom::Err::Failure(e)). The parser is committed — no alternatives should be tried. Triggered by cut.

The key insight: parsers return the remaining input. This is what makes composition work — one parser's remaining output is the next parser's input.

Writing a Parser

Show the anatomy of a parser function:

use nom::{IResult, bytes::complete::tag};

fn parse_hello(input: &str) -> IResult<&str, &str> {
    tag("hello")(input)
}

#[test]
fn test_parse_hello() {
    assert_eq!(parse_hello("hello world"), Ok((" world", "hello")));
    assert!(parse_hello("goodbye").is_err());
}

Essential Combinators

Work through each combinator with a small standalone example:

tag(s) — match a literal string.

tag("(")(input)  // matches the literal "("

char(c) — match a single character.

char('(')(input)

alpha1, digit1, alphanumeric1 — match one or more letters/digits/alphanumerics.

multispace0, multispace1 — match zero/one or more whitespace characters.

alt((p1, p2, ...)) — try each parser in order; return the first success.

alt((tag("true"), tag("false")))(input)

many0(p) — apply p zero or more times; return Vec<O>.

map(p, f) — transform a parser's output.

map(digit1, |s: &str| s.parse::<i64>().unwrap())

map_res(p, f) — like map but f returns Result; propagates errors.

map_res(digit1, |s: &str| s.parse::<i64>())

tuple((p1, p2, ...)) — run parsers in sequence; collect outputs as a tuple.

delimited(open, inner, close) — parse open, inner, close; return only inner's output. Perfect for parenthesised expressions.

delimited(char('('), inner_parser, char(')'))(input)

preceded(prefix, inner) — parse prefix then inner; return only inner.

terminated(inner, suffix) — parse inner then suffix; return only inner.

opt(p) — make p optional; returns Option<O>.

recognize(p) — run p but return the input slice it consumed rather than its output. Useful for building string slices from composed parsers.

verify(p, pred) — run p, then apply predicate pred; fail if predicate returns false.

cut(p) — mark this branch as committed; convert recoverable errors into unrecoverable ones. Use after a discriminating tag (e.g., after matching (define, commit to parsing a define form).

The ws Combinator Pattern

Whitespace appears between any two tokens in Lisp. Define a helper that strips whitespace before and after any parser:

use nom::{Parser, IResult, character::complete::multispace0, sequence::delimited};
use nom::error::ParseError;

pub fn ws<'a, O, E, F>(inner: F) -> impl Parser<&'a str, Output = O, Error = E>
where
    E: ParseError<&'a str>,
    F: Parser<&'a str, Output = O, Error = E>,
{
    delimited(multispace0, inner, multispace0)
}

Testing parsers

Show the pattern: use assert_eq! on Ok((remaining, output)) for success cases, assert!(result.is_err()) for failure cases. Note that remaining input is part of the assertion — it is easy to accidentally under-consume.

nom 8 API note

nom 8 changed the parser API: combinators now return types that implement Parser<I> rather than closures. Call .parse(input) on them, or pass input directly as combinator(args)(input). The Parser trait is in scope with use nom::Parser;. Reference: nom changelog.

Key references

Exercises to include

  1. Write a parser for #t and #f booleans using alt and tag
  2. Write a parser for a C-style identifier (starts with letter or _, then alphanumeric or _)
  3. Write a parser for a decimal integer using recognize, opt(char('-')), and digit1
  4. Compose the above three into an alt that returns a string slice matching any of them

Each exercise should have a collapsible reference solution.

Style notes

  • Introduce IResult before showing any combinator — readers need to understand the return type to understand what combinators are doing
  • Show every combinator with a working code snippet, not just a description
  • Make the ws wrapper a "save this — you will use it throughout" moment