Rust - Chunks, thread spawn Example

Arc AtomicUsize read_dir String Array

Chunks, thread spawn. Suppose in Rust we have a directory full of many files, and we want to divide them into groups (chunks) and process them on threads. This can be done with thread::spawn.

When dividing the objects into chunks, we must be careful to own the data on each thread. This will fix errors caused by the borrow checker in Rust.

Example program. Please change the directory path to a folder full of many files. These files will be processed in parallel. We place each file name in a Page object.

Start We must copy global data to each thread to avoid unsafe memory accesses. Be sure to call to_owned() to transfer ownership.

Next We call chunks() to divide the pages vector into groups to place on threads. Again we must call to_owned() on the chunk data.

Finally In the thread::spawn call, we can process the files sequentially in the thread. We can access the global data as well.

use std::{fs, io};
use std::thread;

#[derive(Clone)]
struct Page {
    file_name: String,
    index: usize,
}

struct PageGroup {
    pages: Vec<Page>,
    global: Vec<u8>,
    id: usize
}

fn main() -> io::Result<()> {

    // Get all files in directory.
    let mut pages = vec![];
    for entry in fs::read_dir("/Users/sam/perls/t/")? {
        let path = entry?.path();
        let path_str = path.to_str().unwrap();
        // Add to vector of objects.
        pages.push(Page{file_name: path_str.to_owned(), index: pages.len()});
    }

    // Global data.
    let mut global = vec![];
    global.push(100);

    // Divide the page objects into 8 groups.
    let processors = 8;
    let chunk_len = ((pages.len() / processors) + 1) as usize;
    let mut groups = vec![];
    for chunk in pages.chunks(chunk_len) {
        // Add a new group for this chunk.
        // Make sure to own the chunk and the global data.
        groups.push(PageGroup{pages: chunk.to_owned(), global: global.to_owned(), id: groups.len()});
    }

    // Number of groups of pages.
    println!("GROUPS: {}", groups.len());

    // Place threads in this vector.
    let mut children = vec![];

    // Loop over groups.
    for group in groups {
        // Add spawned thread to children vector.
        children.push(thread::spawn(move || {

            // On each group, get the global data and id.
            let global_data = group.global;
            println!("{} GLOBAL LENGTH: {}, DATA: {}", group.id, global_data.len(), global_data[0]);

            // Process pages in group.
            for page in group.pages {
                println!("{} PAGE: {} {}", group.id, page.index, page.file_name);
            }
        }));
    }

    // Join all threads.
    for child in children {
        let _ = child.join();
    }
    Ok(())
}GROUPS: 8
0 GLOBAL LENGTH: 1, DATA: 100
0 PAGE: 0 /Users/sam/perls/t/suffixarray-go
0 PAGE: 1 /Users/sam/perls/t/string-constructor
...
3 PAGE: 609 /Users/sam/perls/t/for-scala
7 GLOBAL LENGTH: 1, DATA: 100
7 PAGE: 1414 /Users/sam/perls/t/encapsulate-field
...
4 PAGE: 1008 /Users/sam/perls/t/unsafe
4 PAGE: 1009 /Users/sam/perls/t/response-writefile

Consider the Rust error "borrowed value does not live long enough." We can fix this by assuming ownership of all data by calling to_owned().

Tip Try adding to_owned() calls on Vecs and Strings to fix this error. Programs with this error can often be easily fixed.

Also Use String instead of a str reference to make ownership simpler. Copying some data with to_owned() may be needed.

Clone trait. Consider the Page struct in the program. We must have a "derive Clone" attribute on this class, as the Page struct must be owned by the vector in the program.

Tip To call to_owned on a struct, add the Clone attribute. The trait can be added with just the attribute, no other code is needed.

Summary. By using to_owned(), we can safely access data on multiple threads. We can chunk files together in groups to process them on threads in Rust programs.