Press "Enter" to skip to content

Rust Dust: 3. Tokenizer Library

Source.

Let’s organize some exercises into a library with the goal of learning how to build one. We start with a stream tokenizer, which I’ll develop over subsequent posts. For the first go, we’ll learn how to build a library (as opposed to a binary executable) and develop the very basic tokenizer.

1. How to Build a Simple Rust Library

Note, that this section applies to Rust 2018 or later.

A Rust library’s entry point is lib.rs at the top level of the src hierarchy. Just like it is the case with main.rs for binary distributions, lib.rs declares modules with the mod declaration, optionally with the pub qualifier if the module is to be visible to the library users. And just like in binary distribution, the likely named file must be found at the top level. Our lib.rs declares pub mod io and the contents of the io module are inside io.rs.

This is the simplest possible way to define a Rust library that entirely depends on rustc‘s defaults. In general, Rust’s module system is complex and flexible, supporting a complete decoupling of the logical structure as seen by the library users and the physical source file organization inside the crate.

Local projects can now use this library by adding it as a local dependency in their Cargo.toml files:

rust-dust-lib = { path = "path/to/lib/project" }

where path/to/lib/project is the path to the library project’s top level directory containing its Cargo.toml file. It may be absolute or relative of the client project’s Cargo.toml file.

2. File Tokenizer

Source

Problem: iterate over individual words in a file without allocating the entire file in memory.

The initial intuition is to compose two std library methods BufReader.lines(), returning the file’s contents as a String iterator, followed by String.split_whitespace(), returning an iterator of the current string’s sub-slices:

fn from_file(filename: &str) -> impl Iterator<Item=String> {
    let file = File::open(filename).unwrap();
    BufReader::new(file).lines()
        .map(|res| res.unwrap())
        .flat_map(|line| line.split_whitespace())
}

This fails with

error[E0271]: expected `FlatMap<Map<Lines<BufReader<File>>, {closure@read.rs:7:14}>, SplitWhitespace<'_>, {closure@read.rs:8:19}>` to be an iterator that yields `String`, but it yields `&str`
 --> src/read.rs:4:35
  |
4 | fn read_tokens(filename: &str) -> impl Iterator<Item=String> {
  |                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `String`, found `&str`

The reason for the error is that we’re trying to flatten a borrowing iterator of &str, returned by String.split_whitespace(), into an owning iterator of String, returned by our method. Changing the method’s return type to Iterator<Item=&str> is a bad idea because each line is a string owned by this function, so we can’t return references to it. Rather, we need to convert the string slices returned by split_whitespace() into Strings owned by the iterator we’re building to return:

 fn read_tokens(filename: &str) -> impl Iterator<Item=String> {
    let file = File::open(filename).unwrap();
    BufReader::new(file).lines()
        .map(|res| res.unwrap())
        .flat_map(|line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
}

Here’s why this works:

  • line.split_whitespace() returns an iterator over tokens in a single line as &str slices;
  • String::from copies them into instances of Strings, owned by the enclosing mapping function;
  • collect() collects them into Vec<String> returned by flat_map();
  • The Rust compiler implicitly calls to_iter() on the Vec<String> inside flat_map() to turn it into an iterator that can be flat-mapped into the iterator returned by lines()

3. Abstracting over Source

Another improvement we want to make is to expose a more general interface, abstracting over the input source. The good candidate for the abstraction is the trait std::io::Read. Any Reader can be wrapped into a BufReader, which provides an efficient way of reading a stream of bytes line by line, without materializing the entire stream in memory:

/// Read tokens from any `Read`er
pub fn from_buf_reader<R: Read>(&self, reader: R) -> impl Iterator<Item=String> + use<'_, R> {
   BufReader::new(reader).lines()
      .map(|res| res.unwrap())
      .map(|str| str.chars().filter(|c| (self.validator)(c)).collect::<String>())
      .flat_map( |line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
}

4. Character Pre-Filtering

So far, our tokenizer does not care what characters make up the tokens. However, many use cases call for character filtering prior to tokenization. The natural way to do this is to let the client code provide a custom filtering function fn(&char) -> boolean:

struct Tokenizer {
    filter: fn(&char) -> bool,
}

impl Tokenizer {
    
    fn default_filter(_: &char) -> bool {true}
    
    pub fn new() -> Tokenizer {
        Tokenizer { validator: Self::default_filter}
    }
    
    pub fn new_with_filter(filter: fn(&char) -> bool) -> Tokenizer {
        Tokenizer {filter}
    }
    ...    
    pub fn from_buf_reader<R: Read>(&self, reader: R) -> impl Iterator<Item=String> + use<'_, R> {
        BufReader::new(reader).lines()
            .map(|res| res.unwrap())
            .map(|str| str.chars().filter(|c| (self.filter)(c)).collect::<String>())
            .flat_map(|line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
    }
}

The type fn is a function pointer whose size is known at compile time. Only named functions have this type. Anonymous functions do not have a known at compile time size because they close over containing scope, potentially referencing values of unknown at compile time size. fn is a perfectly usable approach in an interface that does not cross abstraction boundary. Which is not the case, we’re designing a library to be used by others, so we don’t want to disallow the use of closures.

The general functional type that covers both named and anonymous function is defined by the marker trait std::ops::Fn. With it, we can use the trait object that implements dynamic dispatch of the actual function call:

type Filter = dyn Fn(&char) -> bool;

Trait objects are unsized, so they must be boxed:

pub struct Tokenizer {
    filter:  Box<Filter>,
}

We can now use a closure to define the default filter:

pub fn new() -> Tokenizer {
   Self { filter: Box::new(|_| true) } 
}

Finally, the complete Tokenizer listing:

impl Tokenizer {

    /// New Tokenizer instance with no filtering.
    pub fn new() -> Tokenizer {
         Self { filter: Box::new(|_| true) }
    }

    /// New Tokenizer instance with custom filtering
    pub fn new_with_filter<F>(filter: F) -> Tokenizer
    where F: Fn(&char) -> bool + 'static {
        Self { filter: Box::new(filter)}
    }

    /// Read tokens from a file
    pub fn from_file(&self, filename: &str) -> impl Iterator<Item=String> {
        let file = File::open(filename).unwrap();
        self.from_buf_reader(file)
    }

    /// Read tokens from a reader
    pub fn from_buf_reader<R: Read>(&self, reader: R) -> impl Iterator<Item=String> {
        BufReader::new(reader).lines()
            .map(|res| res.unwrap())
            .map(|str| str.chars().filter(|c| (self.filter)(c)).collect::<String>())
            .flat_map(|line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
    }
}

Comments are closed.