Problem: iterate over individual words in a file without allocating the entire file in memory.
The initial intuition is to compose two std library methods BufReader.lines()
, returning the file’s contents as a String
Iterator
, and String.split_whitespace()
, returning an iterator of string sub-slices:
fn read_tokens(filename: &str) -> impl Iterator<Item=String> {
let file = File::open(filename).unwrap();
BufReader::new(file).lines()
.map(|res| res.unwrap())
.flat_map(|line| line.split_whitespace())
}
This fails with
error[E0271]: expected `FlatMap<Map<Lines<BufReader<File>>, {closure@read.rs:7:14}>, SplitWhitespace<'_>, {closure@read.rs:8:19}>` to be an iterator that yields `String`, but it yields `&str`
--> src/read.rs:4:35
|
4 | fn read_tokens(filename: &str) -> impl Iterator<Item=String> {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `String`, found `&str`
The error, as I understand it, is saying that we’re trying to flatten an borrowing iterator of &str
into an owning iterator of String
. Changing the function’s return type to iterator of &str
is a bad idea because each line is a string owned by this function, so we can’t return references to it. Rather, we need to convert the string slices returned by split_whitespace()
into String
s owned by the iterator we’re building to return:
fn read_tokens(filename: &str) -> impl Iterator<Item=String> {
let file = File::open(filename).unwrap();
BufReader::new(file).lines()
.map(|res| res.unwrap())
.flat_map(|line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
}
Here’s why this works:
line.split_whitespace()
returns an iterator over tokens in a single line as&str
slices;String::from
copies them into instances ofString
s, owned by the enclosing mapping function;collect()
collects them intoVec<String>
returned byflat_map()
;- The Rust compiler implicitly calls
to_iter()
on theVec<String>
insideflat_map()
to turn it into an iterator that can be flat-mapped into the iterator returned bylines()
Note, however, that some of the tokens contain punctuation — likely not what a caller would want. In the final version below we add a filter to each token that removes non-alphanumeric chars.
fn read_tokens(filename: &str) -> impl Iterator<Item=String> {
let file = File::open(filename).unwrap();
BufReader::new(file).lines()
.map(|res| res.unwrap())
.flat_map(|line| line.split_whitespace().map(String::from).collect::<Vec<String>>())
.map(|str| str.chars().filter(|c| c.is_alphanumeric()).collect::<String>())
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_read_tokes() {
for token in read_tokens("./verlaine.txt") {
println!("{}", token);
}
}
}
Comments are closed.