In auto-vectorization, a compiler uses SIMD instructions to process data more efficiently. When certain code patterns are recognized, these instructions can be used.
For Rust programs, we can modify code to allow hot loops to be vectorized. This can lead to significant speedups in important functions.
This program creates a vector of many elements, and then in each version, it loops over the vector and counts elements with the value 999.
for
-loop and test each element for 999 with an if
-statement.fn main() { let mut x = vec![]; for i in 0..100033 { x.push(0); // Insert 999 every 2000 items. if i % 2000 == 0 { x.push(999); } } // Make last item be 999 as well to test. let last_index = x.len() - 1; x[last_index] = 999; // Version 1: use simple for-loop to count 999 elements. let t = std::time::Instant::now(); let mut count = 0; for _ in 0..500 { for &i in x.iter() { if i == 999 { count += 1; } } } println!("Count: {count}, {:?}", t.elapsed()); // Version 2: divide the search into units of 16 to allow vectorization. let t = std::time::Instant::now(); let mut count = 0; for _ in 0..500 { const STEP: usize = 16; let parts = x.len() / STEP; let mut start = 0; // Iterate overall all parts of 16 that can be searched. for _ in 0..parts { // Take a slice of 16 elements. let slice_here = &x[start..start + STEP]; // Initial loop that can be vectorized (do not exit early, avoid data dependences). // Should see the vpcmpeqd instruction. let mut found = false; for &i in slice_here { if i == 999 { found = true; } } // Do a slow counting loop if a value was found. if found { for &i in slice_here { if i == 999 { count += 1; } } } // Skip past 16 elements. start += STEP; } // Slow count over last elements (at most 15). for &i in x.iter().skip(start) { if i == 999 { count += 1; } } } println!("Count: {count}, {:?}", t.elapsed()); }Count: 26000, 12.043ms Count: 26000, 2.3254msopt-level = 3
The instruction vpcmpeqd
is used to search for a value in 16 elements. This instruction is why the massive speedup occurs.
break
or return.Sometimes, compilers can apply auto-vectorization to code without any refactoring needed. But in many cases, dividing up a loop into segments can facilitate this optimization.