Press "Enter" to skip to content

Rust Dust: 5. Tokenizer Redux with Error Handling

There’s a clear problem with the current implementation of the tokenizer library: it only covers the happy code path. We’ve skirted the possibility of an error with all the calls to unwrap(). This is perfectly fine in unit tests, where failing fast is appropriate, but product code must handle errors gracefully — the subject of this post. For an additional perspective, I start with an overview of error handling before Rust. If you’d rather continue reading about Rust, skip to section 2.

1. Evolution of Error Handling before Rust

Rust’s error handling is, in a way, a throwback to the 1960s, when languages had no support for exceptions. Back then, each fallible operation, e.g. file read or integer arithmetic, returned the status code which had to be acted upon locally. For example, in Fortran IV (IBM, 1956) an I/O error handler was a line label to which the program long-jumped in case of an error:

READ (unit, format, ERR=100) variable
100 CONTINUE
! Handle error here

Things were iffier when it came to the errors that were thrown entirely by the user code, like division by zero or integer overflow. In response, PL/I (IBM, 1964) was the first language to offer exceptions that could be used to handle such conditions declaratively:

ON ZERODIVIDE BEGIN;
    PUT SKIP LIST('Error: Division by zero detected!');
    /* Handle error */
END;

This advance was made possible by an important development in compiler design; from a mechanical translator of the high level code to the machine instructions, compilers began inserting logic that was never written by the programmer, e.g. runtime checks if the denominator is zero. Moreover, what compilers had to do to implement certain features depended on the instruction set and the OS, so compilers began optimizing for target architectures.

By the time the C language was released (Bell Labs, 1974) the concept of exceptions was well understood by language designers. And yet, Dennis Ritchie left it out entirely for two reasons:

  • Simplicity and performance. The original C compiler did not insert any logic that was not written by the programmer.
  • Interoperability. This was the time of many new operating systems and processors, and in order for a C program behave the same way in different environments, it had to be minimalistic.

C++ (AT&T Bell Labs, 1985) added exceptions — to the chagrin of many programmers. Without the support of a VM, exceptions proved expensive and more importantly unsafe. Many teams banned or severely restricted the use of exceptions in C++.

All new computer languages like Java, C#, or Erlang run on a VM, enabling safe and efficient exception handling. The price of this high-level convenience is low-level inefficiency. Many use cases, like systems programming, cannot afford a VM, and some use cases, like embedded systems, cannot even have it. Until recently, these use cases have relied squarely on C or C++.

The big change came in 2003 with the release of LLVM (Low Level Virtual Machine), a byte-code and compiler infrastructure that permitted rapid development of new low-level languages running directly on the processor, without a runtime VM. By 2010 two such languages were released: Go (Google, 2007) and Rust (Mozilla, 2009) — both ditching Java-style exceptions in favor of error propagation via complex return types.

2. Living without Exceptions

To reiterate, exceptions have these drawbacks:

  • Runtime overhead. Note, for example that Scala (EPFL, 2004), an advanced dual paradigm language that compiles to the Java bytecode, uses Java’s try/catch/finally blocks and the throw statement, just like Java, because these concepts are built into the JVM. (Most Scala programmers prefer to use the Try type, which hides these imperative verbs behind a more functional flow.)
  • A source of potential resource leaks due to early termination.
  • Undermines compiler’s ability to reason about the source code.

For these reasons, Rust does not support exceptions — at least the kind that can be caught. Instead, Rust offers two error handling mechanisms: panic for non-recoverable exceptions that typically lead to termination of the panicking thread, and Result for recoverable errors.

2.1. Panic

Panics are, essentially, exceptions, that are reserved for systemic errors that are not meant to be handled. Generally speaking, panics cannot be caught. Panics can be triggered explicitly with the panic! macro, or are thrown implicitly by one of the following:

  • Calling unwrap() on Result or Option, if it is Err or None, respectively.
  • Calling expect() on Result or Option, if it is Err or None, respectively. This is just a variation of unwrap(), which allows the caller to attach last words to the panic.
  • Various arithmetic operation exceptions, such as division by zero or integer overflow.
  • Out-of-bounds array index.

Similarly to the exceptions in VM-backed functional languages, panicking block is considered to never return, which his signified by the return type Never, which cannot be instantiated. This is particularly apparent in the let/else construct, where the else block must be divergent, i.e. never return.

let Some(x) = maybe_value else {
    panic!("No value found");
};

Panic is thread-local; a panic in a non-main thread will terminate the thread, but not the process. There are however errors that are more disruptive than panics, such as the out-of-memory error. At this time of this writing it does not cause panic but rather terminates the process regardless of what thread received it.

On panic, the Rust compiler attempts to unwind the call stack from the point of panic to the entry point into the current thread and cleanup all heap allocations owned by stack values. This is not guaranteed to succeed, because there’s no requirement that each struct override the default implementation of Drop. Consequently, repeatedly panicking threads may end up leaking memory.

Even though panic is reserved for non-recoverable errors, the standard library does provide a way to recover from it with std::panic::catch_unwind() and even to trigger a custom panic with std::panic::panic_any() which enables an arbitrary type to be attached to the panic, which can be later accessed at the point of recovery. This mechanism however is not meant for mimicking exception handling à la Java, but rather for libraries to prevent their panics leak across the abstraction boundary.

2.2. The Result Type

All user errors, like trying to read a file that doesn’t exist, or recoverable system errors, like timing out on a network call, are meant to be handled with the Result type. It’s the type that is returned by any library, standard or not, so my task as a consumer of those libraries is to correctly handle the Result they return by either recovering from the error, like retrying the failed operation, or propagating the error up the call stack to be handled by a caller.

In a well organized codebase, each fallible function returns an object of type Result<T,E>, where T is the good result, if the function succeeded, and E is the error type otherwise. Result is an enum with two variants: Ok(T) wraps the successful return object, while Err(E) wraps the error object. Both T and E can be of any type; there’s absolutely no expectation on what user functions can return or what error can be.

In a typical flow, the caller examines the return value of type Result by patten matching and explicitly handles both possibilities. In a vast majority of cases, the error is handled by simply propagating it up the call stack. This path is so common, that Rust offers syntactic sugar for it in the form of the ? operator, which makes error propagation less verbose: some_result_value? desugars to

match some_result_value {
    Ok(val) => val,
    Err(err) => return Err(From::from(err))
}

Which is to say that if some value of type Result<T,E> is a success, ? unboxes the T, but if it’s a failure, ? short-circuits out of the function with the possibly converted value of E, boxed in Err. If you do nothing, From::from(err) returns the err value itself, thanks to the blanket identity implementation of From<T>:

impl<T> From<T> for T {
    fn from(t: T) -> T { t }
}

This nuance enables implicit conversion from one error type to another, as it’s propagated up the stack. This is important, because it frees developers to use custom error types, encapsulating the data pertinent to the error. The downside is that programmers often have to deal with several error types coming out of the libraries developed by other people. We will see how this automatic conversion works in the next section.

3. Adding Error Handling to Tokenizer

Source

I start by defining our custom tokenizer error type as a sum of all possible error types we can get (only one in our simple example):

#[derive(Debug)]
pub enum TokenizerError {
    Io(io::Error),
}

Let’s start with from_buf_reader(), whose original implementation was as follows:

/// Read tokens from a reader
pub fn from_buf_reader<R: Read>(&self, reader: R) -> impl Iterator<Item=String> {
    BufReader::new(reader).lines()
        .map(|res| res.unwrap())
        .map(|str| str.chars().filter(|c| (self.validator)(c)).collect::<String>())
        .flat_map(|line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
}

The only fallible call here is BufReader.lines(), which returns an iterator over parse Results, each containing either the parsed line as a String or an error if the byte array contained non UTF-8 character. We will follow the pattern by attempting to parse as many lines as possible and letting the caller process the errors by returning impl Iterator<Item=Result<String, TokenizerError>>.

The goal is to replace the call to unwrap() with something that propagates an error up the call stack, instead of panicking. We can throw away the first map completely, so that the second map receives the Result object returned by lines() and, if it is OK, applies the filter to the string inside, or, if it’s Err returns it as is. Likewise, flat_map receives the Result object returned by map and, if it’s OK, splits the string and wraps each token in its own Ok.

/// Read tokens from a reader
pub fn from_buf_reader<R: io::Read>(&self, reader: R) -> impl 
    Iterator<Item=Result<String, TokenizerError>> 
{
    io::BufReader::new(reader).lines()
        .map(|res_line|
            res_line.map(|line|
                line.chars().filter(|c| (self.filter)(c)).collect::<String>()
            )
        )
        .flat_map(|res_line|
            match res_line {
                Err(err) =>
                    vec![Err(TokenizerError::from(err))],
                Ok(line) =>
                    line.split_whitespace()
                        .map(|str| Ok(String::from(str)))
                        .collect::<Vec<Result<String, _>>>()
            }
        )
}

We have one more unwrap() to remove, inside the from_file() method, whose original implementation was as follows:

pub fn from_file(&self, filename: &str) -> impl Iterator<Item=String> {
    let file = File::open(filename).unwrap();
    self.from_buf_reader(file)
}

Here, the call to unwrap() is not inside a closure so we could simply take advantage of the ? syntax:

/// Read tokens from a file
pub fn from_file(&self, filename: &str)
    -> Result<impl Iterator<Item=Result<String, TokenizerError>>, TokenizerError>
{
    Ok(self.from_buf_reader(fs::File::open(filename)?))
}

Note the implicit conversion from io::Error, returned by fs::File::open(), to TokenizerError. This is possible because we provided an implementation of the From trait that covers exactly this use case:

impl From<io::Error> for TokenizerError {
    fn from(error: io::Error) -> Self { TokenizerError::Io(error) }
}

4. Simplifying Return Type

The solution we developed so far has one big problem: he unsightly return type from  from_file():

Result<impl Iterator<Item=Result<String, TokenizerError>>, TokenizerError>

It looks like there may be two different tokenizer errors , and if the caller were to be able to tell apart the two TokenizerErrors, we’d have to expose implementation details that need not be exposed. A much better solution would be to keep the return type as an iterator. 

impl Iterator<Item=Result<String, TokenizerError>>

For that, the error returned by fs::File::open() must be repackaged in an iterator, containing a single Err element, so this iterator can be prepended to the one returned by from_buf_reader(). Something like this:

/// Read tokens from a file -- DOES NOT COMPILE
pub fn from_file(&self, filename: &str)
    -> impl Iterator<Item=Result<String, TokenizerError>>
{
    match fs::File::open(filename) {
        Ok(file) => self.from_buf_reader(file),
        Err(error) => vec![Err(TokenizerError::from(error))].into_iter()
    }
}

This would work in an OO language, like Scala, where the actual implementation would be determined at runtime. But Rust won’t compile this:

= note: expected opaque type `impl Iterator<Item = Result<String, token_with_result_v2::TokenizerError>>`
                   found struct `std::vec::IntoIter<Result<_, token_with_result_v2::TokenizerError>>`
help: you could change the return type to be a boxed trait object
   |
34 -         -> impl Iterator<Item=Result<String, TokenizerError>>
34 +         -> Box<dyn Iterator<Item=Result<String, TokenizerError>>>
   |
help: if you change the return type to expect trait objects, box the returned expressions
   |
37 ~             Ok(file) => Box::new(self.from_buf_reader(file)),
38 ~             Err(error) => Box::new(vec![Err(TokenizerError::from(error))].into_iter())
   |

The hint suggests that could we solve this problem with the familiar technique of Boxing the return type. We’ve already encountered this when we implemented the recursive Stack type. The difference here is that the reason the compiler can’t determine the actual type is that it resolves at runtime to one of two different opaque type. However, I don’t want to change the return type. Rather, I’d like to solve what is likely to be a general problem: how to return one of several opaque types implementing Iterator.

So far, I’ve found three ways to make the Rust compiler abstract over opaque types implementing an iterator.

4.1. Abstracting Over Opaque Types with enums

Enums are additive types which unite arbitrary types in a single type. To unite the two different iterator types, we create a new enum TokenizerIter which one of the two possible arms of the match statement above:

pub enum TokenizerIter<I1,I2> {
    Iter1(I1),
    Iter2(I2),
}

In order to use TokenizerIter in place of impl Iterator<Item=Result<String, TokenizerError>>, it needs to implement Iterator with the same item type:

impl<I1: Iterator<Item=Result<String,TokenizerError>>, I2: Iterator<Item=Result<String,TokenizerError>>>
Iterator for TokenizerIter<I1, I2> {
    type Item = Result<String, TokenizerError>;
    fn next(&mut self) -> Option<Result<String, TokenizerError>> {
        match self {
            Self::Iter1(iter1) => iter1.next(),
            Self::Iter2(iter2) => iter2.next(),
        }
    }
}

This solution works for our example, but generally speaking is problematic. Although the Iterator trait only has one required method next(), there are many methods whose default implementations are inefficient and are expected to be overridden by implementations. For example, Vec‘s implementation of Iterator overrides nth(), last() and others. Whenever we implement Iterator for a custom type, we typically need to worry about those methods too.

4.2. Abstracting Over Opaque Types with Either

It turns out that all that extra work has already been done. The Either enum, available from the either crate already implements Iterator, if its constituent types are iterators.

pub enum Either<L, R> {
    /// A value of type `L`.
    Left(L),
    /// A value of type `R`.
    Right(R),
}

We can just use it without worrying about implementing anything ourselves:

pub fn from_file_either(&self, filename: &str)
                 -> impl Iterator<Item=Result<String, TokenizerError>>
{
    match fs::File::open(filename) {
        Ok(file) => Either::Left(self.from_buf_reader(file)),
        Err(error) => Either::Right(vec![Err(TokenizerError::from(error))].into_iter())
    }
}

Note, that since Either is symmetric, I can combine them to create more branches. For example to have three branches x,y.z, I could do (Left(x), Right(Left(y), Right(z))).

4.3. Chaining Optional Iterators

The solution in the previous section is pretty slick. The only problem is that it bring in a transitive dependency. Turns out, we don’t even need an explicit sum type to unify the two different opaque implementors of Iterator. We can let the compiler do that for us as well:

pub fn from_file_chain(&self, filename: &str)
    -> impl Iterator<Item=Result<String, TokenizerError>>
{
    let (iter1_opt, iter2_opt) =
        match fs::File::open(filename) {
            Ok(file) => (Some(self.from_buf_reader(file)), None),
            Err(error) => (None, Some(vec![Err(TokenizerError::from(error))]))
        };
    iter1_opt.into_iter().flatten().chain(iter2_opt.into_iter().flatten())
}

Here, I’ve delegated the job of Either to two calls to flatten(), one of which will produce an empty iterator and the other the iterator to be returned by the function, while chain() will stitch them together.

Comments are closed.