Let’s organize some exercises into a library with the goal of learning how to build one. We start with a stream tokenizer, which I’ll develop over subsequent posts. For the first go, we’ll learn how to build a library (as opposed to a binary executable) and develop the very basic tokenizer.
1. How to Build a Simple Rust Library
Note, that this section applies to Rust 2018 or later.
A Rust library’s entry point is lib.rs
at the top level of the src
hierarchy. Just like it is the case with main.rs
for binary distributions, lib.rs
declares modules with the mod
declaration, optionally with the pub
qualifier if the module is to be visible to the library users. And just like in binary distribution, the likely named file must be found at the top level. Our lib.rs
declares pub mod io
and the contents of the io
module are inside io.rs
.
This is the simplest possible way to define a Rust library that entirely depends on rustc
‘s defaults. In general, Rust’s module system is complex and flexible, supporting a complete decoupling of the logical structure as seen by the library users and the physical source file organization inside the crate.
Local projects can now use this library by adding it as a local dependency in their Cargo.toml
files:
rust-dust-lib = { path = "path/to/lib/project" }
where path/to/lib/project
is the path to the library project’s top level directory containing its Cargo.toml
file. It may be absolute or relative of the client project’s Cargo.toml
file.
2. File Tokenizer
Problem: iterate over individual words in a file without allocating the entire file in memory.
The initial intuition is to compose two std library methods BufReader.lines()
, returning the file’s contents as a String
iterator, followed by String.split_whitespace()
, returning an iterator of the current string’s sub-slices:
fn from_file(filename: &str) -> impl Iterator<Item=String> {
let file = File::open(filename).unwrap();
BufReader::new(file).lines()
.map(|res| res.unwrap())
.flat_map(|line| line.split_whitespace())
}
This fails with
error[E0271]: expected `FlatMap<Map<Lines<BufReader<File>>, {closure@read.rs:7:14}>, SplitWhitespace<'_>, {closure@read.rs:8:19}>` to be an iterator that yields `String`, but it yields `&str`
--> src/read.rs:4:35
|
4 | fn read_tokens(filename: &str) -> impl Iterator<Item=String> {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `String`, found `&str`
The reason for the error is that we’re trying to flatten a borrowing iterator of &str
, returned by String.split_whitespace()
, into an owning iterator of String
, returned by our method. Changing the method’s return type to Iterator<Item=&str>
is a bad idea because each line is a string owned by this function, so we can’t return references to it. Rather, we need to convert the string slices returned by split_whitespace()
into String
s owned by the iterator we’re building to return:
fn read_tokens(filename: &str) -> impl Iterator<Item=String> {
let file = File::open(filename).unwrap();
BufReader::new(file).lines()
.map(|res| res.unwrap())
.flat_map(|line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
}
Here’s why this works:
line.split_whitespace()
returns an iterator over tokens in a single line as&str
slices;String::from
copies them into instances ofString
s, owned by the enclosing mapping function;collect()
collects them intoVec<String>
returned byflat_map()
;- The Rust compiler implicitly calls
to_iter()
on theVec<String>
insideflat_map()
to turn it into an iterator that can be flat-mapped into the iterator returned bylines()
3. Abstracting over Source
Another improvement we want to make is to expose a more general interface, abstracting over the input source. The good candidate for the abstraction is the trait std::io::Read
. Any Read
er can be wrapped into a BufReader
, which provides an efficient way of reading a stream of bytes line by line, without materializing the entire stream in memory:
/// Read tokens from any `Read`er
pub fn from_buf_reader<R: Read>(&self, reader: R) -> impl Iterator<Item=String> + use<'_, R> {
BufReader::new(reader).lines()
.map(|res| res.unwrap())
.map(|str| str.chars().filter(|c| (self.validator)(c)).collect::<String>())
.flat_map( |line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
}
4. Character Pre-Filtering
So far, our tokenizer does not care what characters make up the tokens. However, many use cases call for character filtering prior to tokenization. The natural way to do this is to let the client code provide a custom filtering function fn(&char) -> boolean
:
struct Tokenizer {
filter: fn(&char) -> bool,
}
impl Tokenizer {
fn default_filter(_: &char) -> bool {true}
pub fn new() -> Tokenizer {
Tokenizer { validator: Self::default_filter}
}
pub fn new_with_filter(filter: fn(&char) -> bool) -> Tokenizer {
Tokenizer {filter}
}
...
pub fn from_buf_reader<R: Read>(&self, reader: R) -> impl Iterator<Item=String> + use<'_, R> {
BufReader::new(reader).lines()
.map(|res| res.unwrap())
.map(|str| str.chars().filter(|c| (self.filter)(c)).collect::<String>())
.flat_map(|line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
}
}
The type fn
is a function pointer whose size is known at compile time. Only named functions have this type. Anonymous functions do not have a known at compile time size because they close over containing scope, potentially referencing values of unknown at compile time size. fn
is a perfectly usable approach in an interface that does not cross abstraction boundary. Which is not the case, we’re designing a library to be used by others, so we don’t want to disallow the use of closures.
The general functional type that covers both named and anonymous function is defined by the marker trait std::ops::Fn
. With it, we can use the trait object that implements dynamic dispatch of the actual function call:
type Filter = dyn Fn(&char) -> bool;
Trait objects are unsized, so they must be boxed:
pub struct Tokenizer {
filter: Box<Filter>,
}
We can now use a closure to define the default filter:
pub fn new() -> Tokenizer {
Self { filter: Box::new(|_| true) }
}
Finally, the complete Tokenizer listing:
impl Tokenizer {
/// New Tokenizer instance with no filtering.
pub fn new() -> Tokenizer {
Self { filter: Box::new(|_| true) }
}
/// New Tokenizer instance with custom filtering
pub fn new_with_filter<F>(filter: F) -> Tokenizer
where F: Fn(&char) -> bool + 'static {
Self { filter: Box::new(filter)}
}
/// Read tokens from a file
pub fn from_file(&self, filename: &str) -> impl Iterator<Item=String> {
let file = File::open(filename).unwrap();
self.from_buf_reader(file)
}
/// Read tokens from a reader
pub fn from_buf_reader<R: Read>(&self, reader: R) -> impl Iterator<Item=String> {
BufReader::new(reader).lines()
.map(|res| res.unwrap())
.map(|str| str.chars().filter(|c| (self.filter)(c)).collect::<String>())
.flat_map(|line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
}
}
Comments are closed.