Skip to content

Latest commit

 

History

History
167 lines (133 loc) · 6.4 KB

day10.md

File metadata and controls

167 lines (133 loc) · 6.4 KB

Day 10 - nom, part 1

It's entirely possible that you're walking a happy road of a programmer who never had to write a parser by hand. That's not my case unfortunately. I remember incomprehensible indexing of hideous arrays of characters, a maddening cascade of unmaintainable if-else statements, and futile, indescribable attempts to abstract away parts of this unspeakable monstrosity.

If the above paragraph was hard to parse, good! Putting some Lovecraftian adjectives into a description of events can be a way of coping with terrible experiences. Those dark, eldritch (sorry, couldn't resist one more) days are fortunately over. With parser combinators we can write composable and fast parsers. Rust adds another adjective here: safe. nom is a parser combinator library that works by generating parsing code at compile time with a bunch of macros. It also tries to avoid allocation and work through input bytes without copying.

I decided to split nom article into two parts. Today we're focused on parsing text (well, bytes that contain text), while the next article in the series will cover binary parsing.

This is my first hands-on experience with parser combinators - I'm learning nom as I write this. Feel free to let me know if the examples here could be more nom-idiomatic.

Toy HTTP parser

We'll start our adventure with nom by writing a simple parser for HTTP headers. Even though HTTP itself is a plain text protocol, it's a complex beast.

Note: this is by no means a complete, correct HTTP parser. Its purpose is only illustrative. In fact, we're only parsing the first line of the HTTP header. Parsing the rest of header fields is left as an exercise for the reader.

Let's take a look at the first line of HTTP header.

let first_line = b"GET /home/ HTTP/1.1\r\n";

According to RFC 2616, which is the definition of the HTTP protocol, the first line of the request starts with method name. It is followed by an URI (/ in our example) and finally protocol version. Finally it's terminated by CRLF characters.

named!(parse_method_v1, alt!(tag!("GET") | tag!("POST")));
println!("{:?}", parse_method_v1(&first_line[..]));

We can create a new parser using named! macro. It will create a function that takes a byte slice (&[u8]) and returns an IResult.

The tag! macro matches the exact characters given. alt! tries to match any of sub-parsers separated by |, returning the result of the first one that matches. (Note: we're leaving out other HTTP methods for simplicity.)

$ cargo run
Done([32, 47, 32, 72, 84, 84, 80, 47, 49, 46, 49, 13, 10], [71, 69, 84])

Done is a variant of IResult returned when the parser succeeds. We're interested in the second field, which contains the byte sequence matched by the parser. (We can check that the byte sequence [71, 69, 84] is equivalent to GET.) But we'd like to see some text, not raw bytes. We can do this in nom using map! or map_res! combinators.

named!(parse_method_v2<&[u8], &str>,
       alt!(map_res!(tag!("GET"), str::from_utf8) | map_res!(tag!("POST"), str::from_utf8)));
println!("{:?}", parse_method_v2(&first_line[..]));

map_res! takes an existing parser (like tag!("GET")) and applies a function to its result. This function must return a Result, like from_utf8 in our example. There's also a version for functions returning Option - map_opt!, and one for functions returning "plain" values - map!. If you need your parser to produce something else than byte slices, you'll probably use these macros a lot.

Note that we had to add type parameters to our parser inside angle brackets. First is the input type - a byte array slice, followed by output type. If you omit these types as we did in parse_method_v1, the default output type is also a byte slice.

#[derive(Debug)]
enum Method {
    GET,
    POST,
}

named!(parse_method<&[u8], Method>,
       alt!(map!(tag!("GET"), |_| Method::GET) | map!(tag!("POST"), |_| Method::POST)));

Here we're using a custom enum for HTTP methods. The parser uses map! to discard matched bytes (we already know we matched either GET or POST) and return a respective Method variant. We could also implement FromStr for our enum.

The power of parser combinator libraries such as nom lies in the combinator part of the name. We can build very complex parsing machinery from small pieces. Let's reuse our parse_method function to actually parse entire first line of HTTP header.

#[derive(Debug)]
struct Request {
    method: Method,
    url: String,
    version: String,
}

named!(parse_request<&[u8], Request>, ws!(do_parse!(
    method: parse_method >>
    url: map_res!(take_until!(" "), str::from_utf8) >>
    tag!("HTTP/") >>
    version: map_res!(take_until!("\r"), str::from_utf8) >>
    (Request { method: method, url: url.into(), version: version.into() })
)));
println!("{:?}", parse_request(&first_line[..]));
$ cargo run
Done([], Request { method: GET, url: "/home/", version: "1.1" })

The do_parse! macro is new in nom 2.0. It's used to chain parsers one after another. If an intermediate value is required, we can give it a name, for example in method: parse_method the method name will hold the result of parse_method parser. Subsequent parsers are put together with >> and the final result can refer to named intermediate results. When we run the code, we notice it consumed all bytes and succesfully parsed a Request.

What if our input doesn't conform to the specification?

let bad_line = b"GT /home/ HTTP/1.1\r\n";
println!("{:?}", parse_request(&bad_line[..]));
$ cargo run
Error(Alt)

This time our parser returns an Error variant, indicating which of the subparsers failed (Alt points to alt! macro). There is an extensive guide to error management in nom if we wanted to improve the experience.

See you tomorrow for part 2 on binary parsing!

Further reading