Correct a common string Split mistake when parsing text. Observe examples and benchmarks of the mistake and how it can be fixed, with a corresponding increase in performance. Discuss the benefits and negatives of this mistake.
Improving the clarity and performance of splitting strings on delimiters is important, as the operation is done so frequently. Here we look closely at 'nesting' or compounding Split calls to parse text, and the negatives of doing so. [C# - Split String Examples - dotnetperls.com]
For the example, we have lines in a file that have more than one delimiter. We have a collection of paired keys and values. The pairs themselves are separated by commas, and the two items in each pair are separated by a colon (:).
As I found out, reducing the number of splits that occur is fastest. This approach simply splits all the 'tokens' at once, reducing the collection into a single array. Here's how we can do that.
string[] logLines = new string[]
{
"something:1,something:2,more:3,bloviate:5,alpaca:65,spaniels:3",
"elementary:4,string:4,miserable:6,reprimands:3,eats:6,trustworthy:5"
};
foreach (string line in logLines)
{
//
// Split on multiple delimeters
//
string[] tokens = line.Split(new char[] { ':', ',' },
StringSplitOptions.RemoveEmptyEntries);
for (int a = 0; a < tokens.Length; a += 2)
{
string s1 = tokens[a];
string s2 = tokens[a + 1];
// Example values:
// s1 = 'something', s2 = '1'
// s1 = 'elementary', s2 = '4'
}
}In a string of key value pairs, you could split on each comma and then for each of those strings, split again. This works well but it is slower and may be somewhat harder to manage.
foreach (string line in logLines)
{
string[] pairs = line.Split(',');
foreach (string pair in pairs)
{
string[] parts = pair.Split(':');
string s1 = parts[0];
string s2 = parts[1];
// s1 = 'something', s2 = '1', ...
}
}The second version is slower because calling Split is comparatively expensive. Here I show some benchmarks of the code, and right after is some discussion. Please keep reading for my scientific theories, but look next for my concrete measurements. [Why Benchmark C#? - dotnetperls.com]
| A - 1 Split call | B - 2 Split calls | |
| Time in ms | 2340 | 4150 |
Yes and no. The performance different alone is not critical for 99% of applications. However this experiment gives me insight into how computers process data. The fast version simply breaks the string into a simple array. It then works sequentially on that array.
Second version. The second version that splits data again is not sequential to the same degree. The application will have less locality of reference because it must calculate the splits more frequently.
This experiment shows how "thinking sequentially" like a computer can improve your C# code. It is often better to tokenize the data all at once to avoid having to split repeatedly on a line. The sequential version is not just faster, but much more similar to the underlying hardware.