Dot Net Perls

Split String Improvement - C#

by Sam Allen

Problem

Correct a common string Split mistake when parsing text. Observe examples and benchmarks of the mistake and how it can be fixed, with a corresponding increase in performance. Discuss the benefits and negatives of this mistake.

Solution: C#

Improving the clarity and performance of splitting strings on delimiters is important, as the operation is done so frequently. Here we look closely at 'nesting' or compounding Split calls to parse text, and the negatives of doing so. [C# - Split String Examples - dotnetperls.com]

Precise goals

For the example, we have lines in a file that have more than one delimiter. We have a collection of paired keys and values. The pairs themselves are separated by commas, and the two items in each pair are separated by a colon (:).

Figure A: Split once

As I found out, reducing the number of splits that occur is fastest. This approach simply splits all the 'tokens' at once, reducing the collection into a single array. Here's how we can do that.

string[] logLines = new string[]
{
    "something:1,something:2,more:3,bloviate:5,alpaca:65,spaniels:3",
    "elementary:4,string:4,miserable:6,reprimands:3,eats:6,trustworthy:5"
};

foreach (string line in logLines)
{
    //
    // Split on multiple delimeters
    //
    string[] tokens = line.Split(new char[] { ':', ',' },
        StringSplitOptions.RemoveEmptyEntries);

    for (int a = 0; a < tokens.Length; a += 2)
    {
        string s1 = tokens[a];
        string s2 = tokens[a + 1];
        // Example values:
        // s1 = 'something', s2 = '1'
        // s1 = 'elementary', s2 = '4'
    }
}
  1. Reduce string to array
    This version transforms the string of keys and values into a single array.
  2. Advance 2 places
    In the for loop, we use the |a += 2| expression to advance two places each iteration. In the body of the loop, we assign each string to the two elements next to each other.

Figure B: nested Split

In a string of key value pairs, you could split on each comma and then for each of those strings, split again. This works well but it is slower and may be somewhat harder to manage.

foreach (string line in logLines)
{
    string[] pairs = line.Split(',');
    foreach (string pair in pairs)
    {
        string[] parts = pair.Split(':');
        string s1 = parts[0];
        string s2 = parts[1];
        // s1 = 'something', s2 = '1', ...
    }
}

Speed comparison

The second version is slower because calling Split is comparatively expensive. Here I show some benchmarks of the code, and right after is some discussion. Please keep reading for my scientific theories, but look next for my concrete measurements. [Why Benchmark C#? - dotnetperls.com]

 A - 1 Split callB - 2 Split calls
Time in ms23404150

Is this important?

Yes and no. The performance different alone is not critical for 99% of applications. However this experiment gives me insight into how computers process data. The fast version simply breaks the string into a simple array. It then works sequentially on that array.

Second version. The second version that splits data again is not sequential to the same degree. The application will have less locality of reference because it must calculate the splits more frequently.

Conclusion

This experiment shows how "thinking sequentially" like a computer can improve your C# code. It is often better to tokenize the data all at once to avoid having to split repeatedly on a line. The sequential version is not just faster, but much more similar to the underlying hardware.

Dot Net Perls
About
Sitemap
Strings
IndexOf String Examples
Remove Duplicate Words From...
Remove HTML Tags From String
Replace String Examples
Split String Examples
New
Occurrence Count of String
StartsWith String Examples
© 2008 Sam Allen. All rights reserved.