Auto-vectorization

In auto-vectorization, a compiler uses SIMD instructions to process data more efficiently. When certain code patterns are recognized, these instructions can be used.

For Rust programs, we can modify code to allow hot loops to be vectorized. This can lead to significant speedups in important functions.

Example

This program creates a vector of many elements, and then in each version, it loops over the vector and counts elements with the value 999.

Version 1 In this version, we use a simple for-loop and test each element for 999 with an if-statement.

Version 2 Here we separate the task into groups of 16 elements at a time. We then can test each 16-byte slice with a SIMD instruction.

Important The vectorized version tests each 16-byte slice with a SIMD instruction, but it then does another pass to count the elements.

Finally The elements at the end of the vector (which are 15 or fewer) are tested in a separate loop.

Result The vectorized version, when compiled at opt-level 3 on a modern Intel chip (released in 2022) is 5-6 times faster.

fn main() {
    let mut x = vec![];
    for i in 0..100033 {
        x.push(0);
        // Insert 999 every 2000 items.
        if i % 2000 == 0 {
            x.push(999);
        }
    }
    // Make last item be 999 as well to test.
    let last_index = x.len() - 1;
    x[last_index] = 999;

    // Version 1: use simple for-loop to count 999 elements.
    let t = std::time::Instant::now();
    let mut count = 0;
    for _ in 0..500 {
        for &i in x.iter() {
            if i == 999 {
                count += 1;
            }
        }
    }
    println!("Count: {count}, {:?}", t.elapsed());

    // Version 2: divide the search into units of 16 to allow vectorization.
    let t = std::time::Instant::now();
    let mut count = 0;
    for _ in 0..500 {
        const STEP: usize = 16;
        let parts = x.len() / STEP;
        let mut start = 0;
        // Iterate overall all parts of 16 that can be searched.
        for _ in 0..parts {
            // Take a slice of 16 elements.
            let slice_here = &x[start..start + STEP];

            // Initial loop that can be vectorized (do not exit early, avoid data dependences).
            // Should see the vpcmpeqd instruction.
            let mut found = false;
            for &i in slice_here {
                if i == 999 {
                    found = true;
                }
            }

            // Do a slow counting loop if a value was found.
            if found {
                for &i in slice_here {
                    if i == 999 {
                        count += 1;
                    }
                }
            }

            // Skip past 16 elements.
            start += STEP;
        }

        // Slow count over last elements (at most 15).
        for &i in x.iter().skip(start) {
            if i == 999 {
                count += 1;
            }
        }
    }
    println!("Count: {count}, {:?}", t.elapsed());
}Count: 26000, 12.043ms
Count: 26000, 2.3254ms
opt-level = 3

SIMD notes

The instruction vpcmpeqd is used to search for a value in 16 elements. This instruction is why the massive speedup occurs.

Warning If any data dependencies (even incrementing a variable) are present in the loop, the vectorization may not be applied by the compiler.

Important It is important to always loop over a slice of a fixed number of elements (like 16), and not exit the loop early with break or return.

Sometimes, compilers can apply auto-vectorization to code without any refactoring needed. But in many cases, dividing up a loop into segments can facilitate this optimization.