Press "Enter" to skip to content

Rust Dust: 5. Tokenizer Redux with Error Handling

There’s a clear problem with the current implementation of the tokenizer library: it only covers the happy code path. We’ve skirted the possibility of an error with all the calls to unwrap(). This is perfectly fine in unit tests, where failing fast is appropriate, but product code must handle errors gracefully — the subject of this post. For an additional perspective, I start with an overview of error handling before Rust. If you’d rather continue reading about Rust, skip to section 2.

1. Evolution of Error Handling before Rust

Rust’s error handling is, in a way, a throwback to the 1960s, when languages had no support for exceptions. Back then, each fallible operation, e.g. file read or integer arithmetic, returned the status code which had to be acted upon locally. For example, in Fortran IV (IBM, 1956) an I/O error handler was a line label to which the program long-jumped in case of an error:

READ (unit, format, ERR=100) variable
100 CONTINUE
! Handle error here

Things were iffier when it came to the errors that were thrown entirely by the user code, like division by zero or integer overflow. In response, PL/I (IBM, 1964) was the first language to offer exceptions that could be used to handle such conditions declaratively:

ON ZERODIVIDE BEGIN;
    PUT SKIP LIST('Error: Division by zero detected!');
    /* Handle error */
END;

This advance was made possible by an important development in compiler design; from a mechanical translator of the high level code to the machine instructions, compilers began inserting logic that was never written by the programmer, e.g. runtime checks if the denominator is zero. Moreover, what compilers had to do to implement certain features depended on the instruction set and the OS, so compilers began optimizing for target architectures.

By the time the C language was released (Bell Labs, 1974) the concept of exceptions was well understood by language designers. And yet, Dennis Ritchie left it out entirely for two reasons:

  • Simplicity and performance. The original C compiler did not insert any logic that was not written by the programmer.
  • Interoperability. This was the time of many new operating systems and processors, and in order for a C program behave the same way in different environments, it had to be minimalistic.

C++ (AT&T Bell Labs, 1985) added exceptions — to the chagrin of many programmers. Without the support of a VM, exceptions proved expensive and more importantly unsafe. Many teams banned or severely restricted the use of exceptions in C++.

All new computer languages like Java, C#, or Erlang run on a VM, enabling safe and efficient exception handling. (Not to mention all the interpreted languages like Ruby and Python, which are their own VM.) The price of this high-level convenience is low-level inefficiency. Many use cases, like systems programming, cannot afford a VM, and some use cases, like embedded systems, cannot even have it. Until recently, these use cases have relied squarely on C or C++.

The big change came in 2003 with the release of LLVM (Low Level Virtual Machine), a byte-code and compiler infrastructure that permitted rapid development of new low-level languages running directly on the processor, without a runtime VM. By 2010 two such languages were released: Go (Google, 2007) and Rust (Mozilla, 2009) — both ditching exceptions in favor of error propagation via complex return types.

2. Living without Exceptions

To reiterate, exceptions have these drawbacks:

  • Runtime overhead. Note, for example that Scala (EPFL, 2004), an advanced dual paradigm language that compiles to the Java bytecode, uses Java’s try/catch/finally blocks and the throw statement, just like Java, because these concepts are built into the JVM. (Most Scala programmers prefer to use the Try type, which hides these imperative verbs behind a more functional flow.)
  • A source of potential resource leaks due to early termination.
  • Undermines compiler’s ability to reason about the source code.

For these reasons, Rust does not support exceptions — at least the kind that can be caught. Instead, Rust offers two error handling mechanisms: panic for non-recoverable exceptions that typically lead to termination of the panicking thread, and Result for recoverable errors.

2.1. Panic

Panics are meant to be used for systemic unrecoverable errors. It can be triggered explicitly with the panic! macro, or is triggered implicitly by one of the following:

  • Calling unwrap() on Result if it is ErrResult is the subject of the next section.
  • Calling unwrap() on Option if it is None.
  • Calling expect() on either Result or Option, which is just a variation of unwrap() that allows the caller to attach last words to the panic.
  • Various arithmetic exceptions, such as division by zero and integer overflow.
  • Out-of-bounds array index.

Panic is thread-local; a panic in a non-main thread will terminate the thread, but not the process. There are however errors that are more disruptive than panics, such as the out-of-memory error. At this time of this writing it does not cause panic but rather terminates the process regardless of what thread received it.

On panic, rust compiler attempts to unwind the call stack from the point of panic to the entry point into the current thread and cleanup all heap allocations owned by stack values. This is not guaranteed to succeed, because there’s no requirement that each struct overrides the default implementation of Drop. Consequently, repeatedly panicking threads may end up leaking memory.

Even though panic is reserved for non-recoverable errors, the standard library does provide a way to recover from it with std::panic::catch_unwind() and even to trigger a custom panic with std::panic::panic_any() which enables an arbitrary type to be attached to the panic, which can be later accessed at the point of recovery. This mechanism however is not meant for mimicking exception handling a la Java, but rather for libraries to be able to localize their panics instead of making the library users deal with the unexpected panics coming from 3rd party crates.

2.2. The Result Type

All user errors, like trying to read a file that doesn’t exist, or recoverable system errors, like timing out on a network call, are meant to be handled with the Result type. It’s the type that is returned by any library, standard or not, so my task as a consumer of those libraries is to correctly handle the Result they return by either recovering from the error, like retrying the failed operation, or propagating the error up the call stack to be handled by a caller.

2.2.1. Implicit Error Propagation

In a well organized codebase, each fallible function returns an object of type Result<T,E>, where T is the good result, if the function succeeded, and E is the error type otherwise. Result is an enum with two variants: Ok(T) wraps the successful return object, while Err(E) wraps the error object. Both T and E can be of any type; there’s absolutely no expectation on what user functions can return or what error can be.

There is commonly used syntactic sugar in the form of the ? operator, which makes error propagation less verbose: some_result_value? desugars to

match some_result_value {
    Ok(val) => val,
    Err(err) => return Err(From::from(err))
}

Which is to say that if some value of type Result<T,E> is a success, ? unboxes the T, but if it’s a failure, ? short-circuits out of the function with the possibly converted value of E, boxed in Err. If you do nothing, From::from(err) returns the err value itself, thanks to the blanket identity implementation of From<T>:

impl<T> From<T> for T {
    fn from(t: T) -> T { t }
}

This nuance enables implicit conversion from one error type to another, as it’s propagated up the stack. This is important, because most crates use their own error types, exposing data pertinent to the kinds of errors the library may encounter. Thus, programmers typically have to deal with several error types by converting them to some new error types. We will see how this automatic conversion works in the next section.

2.2.2. Explicit Error Propagation (V1)

Source

I start by defining our custom tokenizer error type as a sum of all possible error types we can get (only one in our simple example):

#[derive(Debug)]
pub enum TokenizerError {
    Io(io::Error),
}

Let’s start with from_buf_reader(), whose original implementation was as follows:

/// Read tokens from a reader
pub fn from_buf_reader<R: Read>(&self, reader: R) -> impl Iterator<Item=String> {
    BufReader::new(reader).lines()
        .map(|res| res.unwrap())
        .map(|str| str.chars().filter(|c| (self.validator)(c)).collect::<String>())
        .flat_map(|line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
}

The only fallible call here is BufReader.lines(), which returns an iterator over parse results containing either the parsed line as a string or an error if the byte array contained non UTF-8 character. We will let the caller process the errors by returning impl Iterator<Item=Result<String, TokenizerError>>.

The name of the game here is to replace the call to unwrap() with something that propagates the error up the call stack, instead of panicking. Because the call to unwrap() is inside a closure, we cannot use the ? syntax to return from the containing function. Instead, we map the successful result to its filtered version. The flat map also receives a Result as the argument and maps it to an iterator of Results to be flattended into the invoking iterator.

/// Read tokens from a reader
pub fn from_buf_reader<R: io::Read>(&self, reader: R) -> impl 
    Iterator<Item=Result<String, TokenizerError>> 
{
    io::BufReader::new(reader).lines()
        .map(|res_line|
            res_line.map(|line|
                line.chars().filter(|c| (self.validator)(c)).collect::<String>()
            )
        )
        .flat_map(|res_line|
            match res_line {
                Err(err) =>
                    vec![Err(TokenizerError::from(err))],
                Ok(line) =>
                    line.split_whitespace()
                        .map(|str| Ok(String::from(str)))
                        .collect::<Vec<Result<String, _>>>()
            }
        )
}

Now let’s fix the from_file() method, whose original implementation was as follows:

pub fn from_file(&self, filename: &str) -> impl Iterator<Item=String> {
    let file = File::open(filename).unwrap();
    self.from_buf_reader(file)
}

Here, the call to unwrap() is not inside a closure so we can take advantage of the ? syntax:

/// Read tokens from a file
pub fn from_file(&self, filename: &str)
    -> Result<impl Iterator<Item=Result<String, TokenizerError>>, TokenizerError>
{
    Ok(self.from_buf_reader(fs::File::open(filename)?))
}

Note the implicit conversion from io::Error, returned by fs::File::open(), to TokenizerError. This is possible because we provided an implementation of the From trait that covers exactly this use case.:

impl From<io::Error> for TokenizerError {
    fn from(error: io::Error) -> Self { TokenizerError::Io(error) }
}

We can now add a new test case for the file not found error:

#[test]    
fn test_io_error() {
    let tokenizer = Tokenizer::new_with_validator(validator);
    match tokenizer.from_file("./bad.txt") {
        Ok(_) => assert!(false),
        Err(err) => assert!(
            matches!(err, TokenizerError::Io(ioerr) if ioerr.kind() == io::ErrorKind::NotFound)
        ),
    }
}

3. Further Discussion (V2)

Source.

3.1. The Problem

The solution we developed in V1 is already much better than the original tokenizer, because we’ve replaced panics with orderly and statically typed error handling. The one last wrinkle is the unsightly return type Result<impl Iterator<Item=Result<String, TokenizerError>>, TokenizerError> returned by from_file(). If the caller is to be able to tell apart the two TokenizerErrors, we’d have to expose implementation details that need not be exposed.

Rather, I want to expose only one error, while keeping the return type as an iterator. This means that the error returned by fs::File::open() must be repackaged in a single element iterator. Something like this:

/// Read tokens from a file -- DOES NOT COMPILE
pub fn from_file(&self, filename: &str)
    -> impl Iterator<Item=Result<String, TokenizerError>>
{
    match fs::File::open(filename) {
        Ok(file) => self.from_buf_reader(file),
        Err(error) => vec![Err(TokenizerError::from(error))].into_iter()
    }
}

This would work in an OO language, like Scala, where the actual implementation would be determined at runtime. But Rust won’t compile this:

= note: expected opaque type `impl Iterator<Item = Result<String, token_with_result_v2::TokenizerError>>`
                   found struct `std::vec::IntoIter<Result<_, token_with_result_v2::TokenizerError>>`
help: you could change the return type to be a boxed trait object
   |
34 -         -> impl Iterator<Item=Result<String, TokenizerError>>
34 +         -> Box<dyn Iterator<Item=Result<String, TokenizerError>>>
   |
help: if you change the return type to expect trait objects, box the returned expressions
   |
37 ~             Ok(file) => Box::new(self.from_buf_reader(file)),
38 ~             Err(error) => Box::new(vec![Err(TokenizerError::from(error))].into_iter())
   |

The hint suggests that could we solve this problem with the familiar technique of Boxing the return type. We’ve already encountered this when we implemented the recursive Stack type. The difference here is that the reason the compiler can’t determine the actual type is that it resolves at runtime to one of two different opaque type. However, I don’t want to change the return type. Rather, I’d like to solve what is likely to be a general problem: how to return one of several opaque types implementing Iterator.

So far, I’ve found two ways to make the Rust compiler do the work for us: by using enums or by chaining the two iterators with Iterator.chain().

3.2. Abstracting Over Opaque Types with enums

Enums are additive types which unite arbitrary types in a single type. To unite the two different iterator types, we create a new enum TokenizerIter which one of the two possible arms of the match statement above:

pub enum TokenizerIter<I1,I2> {
    Iter1(I1),
    Iter2(I2),
}

In order to use TokenizerIter in place of impl Iterator<Item=Result<String, TokenizerError>>, it needs to implement Iterator with the same item type:

impl<I1: Iterator<Item=Result<String,TokenizerError>>, I2: Iterator<Item=Result<String,TokenizerError>>>
Iterator for TokenizerIter<I1, I2> {
    type Item = Result<String, TokenizerError>;
    fn next(&mut self) -> Option<Result<String, TokenizerError>> {
        match self {
            Self::Iter1(iter1) => iter1.next(),
            Self::Iter2(iter2) => iter2.next(),
        }
    }
}

This is it, really!

3.3. Abstracting Over Opaque Types with Either

Except, we just reinvented the Either enum, available from the either crate. (Will it make it into the standard library, like in Scala and Haskell?)

pub enum Either<L, R> {
    /// A value of type `L`.
    Left(L),
    /// A value of type `R`.
    Right(R),
}

It’s symmetric, has lots of useful methods, and, in particular, implements Iterator. We can just use it without worrying about implementing anything ourselves:

pub fn from_file_either(&self, filename: &str)
                 -> impl Iterator<Item=Result<String, TokenizerError>>
{
    match fs::File::open(filename) {
        Ok(file) => Either::Left(self.from_buf_reader(file)),
        Err(error) => Either::Right(vec![Err(TokenizerError::from(error))].into_iter())
    }
}

Note, that since Either is symmetric, I can combine them to create more branches. For example to have three branches x,y.z, I could do (Left(x), Right(Left(y), Right(z))).

3.4. Chaining Optional Iterators

It turns out, we don’t even need an explicit sum type to unify the two different opaque implementors of Iterator. We can let the compiler do that for us as well:

pub fn from_file_chain(&self, filename: &str)
    -> impl Iterator<Item=Result<String, TokenizerError>>
{
    let (iter1_opt, iter2_opt) =
        match fs::File::open(filename) {
            Ok(file) => (Some(self.from_buf_reader(file)), None),
            Err(error) => (None, Some(vec![Err(TokenizerError::from(error))]))
        };
    iter1_opt.into_iter().flatten().chain(iter2_opt.into_iter().flatten())
}

Here, I’ve delegated the job of Either to two calls to flatten(), one of which will produce an empty iterator and the other the iterator to be returned by the function, while chain() will stitch them together.

Comments are closed.