C#Dot Net Perls

C#
Directed Acyclic Word Graph

by Sam Allen
DAG image from Wikipedia.

Problem

Implement a directed acyclic word graph (DAWG). This data structure must store thousands of words or strings. It must be faster to use than a regular hashtable, even a highly tuned one. Our goal is to implement the directed acyclic word graph, which has been studied and proven more efficient than its alternatives.

C# Solution

Let's take a step back and think of some definitions. According to Wikipedia, a directed acyclic word graph is a data structure, a tree, that can store a list of words in an extremely efficient way. Here are some parts of the DAWG that you will encounter.

Which nodes can be shared in memory?

Here is an example of some substrings that may occupy the same memory locations. The first two rows are fairly obvious, but the last one shares both the prefixes and the suffixes. (The actual implementation is more complex and words are not always shared exactly like this.)

If you have these words Then you could share
analogy
pathology
-logy
animal
animated
anima-
fast
feast
f-
-ast

Introducing DAWG

DAWGs are being used to efficiently store word lists and some scientific data such as DNA character strings. They have algorithmic advantages over hashtables. A DAWG can be used to find any prefix without needing to know the whole key. It shares thousands of memory locations and references itself.

How can I build a DAWG?

There is a specific algorithm you need to construct a DAWG. It is hard to explain and not really useful to go over here too closely. I want to share some more broad-picture points involved in a common implementation. Here are the basic DAWG construction steps.

  1. First tree
    Read in a file containing all the strings and put them in a prefix-shared tree. This is simple and easy to do, and requires only some logic.
  2. Sort nodes by depth
    Each node in our prefix-shared tree is a certain number of steps from itself to the very furthest node (terminal node). Make an array indexed by that depth, and store a reference to each node in the appropriate array.
  3. Search nodes
    We must search all nodes of each depth for equivalent subnode trees. We can implement a method like NodesEqual that receives two nodes, and then recurses through all the child nodes of each one. If a different tree is found, it returns false.
  4. Combine node references
    If we have two equivalent subtrees, we can simply make the second starting node point to the first. This is how the suffixes are shared.
  5. Write to file
    We can use a method like the StreamWriter to write to a file so we don't have to do all this again.

How can I store it in a file?

By dividing the data required into two different files. Let's look at how the DAWG files are laid out. This is what the reader will need to contend with. The DAWG in my program is stored in two files. Here are the important things about the file layout.

File number Name Lines Description
1 supers ~59,000 Contains bitmasks.
These bitmasks have a bit set for each letter that the node contains. A special bit is also set if the node can be used as the end of a word--a full word bit.
2 words ~59,000
Same
Each line contains up to 26 numbers.
These numbers correspond to the bits in the first file. They contain the line number of the next node (stored in file 1).

How can I initialize it?

By parsing the DAWG files (using int.Parse or an unsafe method) and then putting the two files in data arrays. The first array stores the first file's integers. It is a List holding every integer. The second array is a jagged array of the same number int[] arrays. This code takes only a few milliseconds each time.

Bitcounts

One way to efficiently use the tree when it is in memory is with bitcounting. We can use precomputed bitcounts because they are very fast. Then we need to define bit mask arrays that will remove the last specific number of bits from an integer. This is for the search function.

How can I look up a word?

We need to scan the word trie and find if a key is contained in a path in the tree. First we need to loop through each character in the key. Then, we look for the letter and get the corresponding line number that it points to. After that, we can keep jumping through the tree using the two arrays together.

How efficient is it?

You won't get the full picture from this write-up, but I wanted to emphasize the technique of using two arrays together. Because of that, we share most memory locations, and then the algorithm has better locality of reference. Here are some more characteristics.

Characteristic Dictionary DAWG
Initialization and loading time in ms 78 93
Program memory used in MB 17.268 4.61
Execution benchmark
Time elapsed in ms
1148.2 855.0
Disk reads at startup
Includes Windows app
473 reads 309 reads
Difficulty of word list modification Low High
Finds prefixes quickly No Yes
DAWG performance comparison.

Lookup Method

The code I show here is an example of how we can walk through the DAWG and keep accessing the correct memory locations. This code is how a lookup is performed, and it will return true or false depending on whether the string is in the DAWG and is therefore a full path.

/// <summary>
/// Use the directed acyclic word graph to lookup a word. We keep looking up
/// the correct node until no node is found, or until the entire key path is
/// found and is also a full path.
/// </summary>
/// <param name="wordIn">The you want to look up.</param>
/// <returns>Whether the word is in the DAWG.</returns>
public bool ContainsWord(string wordIn)
{
    int nodePosition = 0;
    int superHere = _supers[0];

    int length = wordIn.Length;
    int lastIndex = length - 1;

    // Iterate through the letters in the word.
    for (int i = 0; i <= lastIndex; i++)
    {
        // Get the letter's char code (0 for A, 25 for Z, etc.)
        int letterHere = _translatorChars[wordIn[i]];

        // See if this node contains a bit for this letter. If it does,
        // do some calculations to find the location of the node that
        // letter points to.
        if ((superHere & (1 << letterHere)) != 0)
        {
            int newInt = superHere & _bitmasks[letterHere];
            int count = Bitcount(newInt) - 1;

            int indexNum = _nodes[nodePosition][count];
            nodePosition = indexNum;

            superHere = _supers[nodePosition];
        }
        else
        {
            return false;
        }
    }
    // We have finished iterating through the word. See if the
    // final bit we got has the special bit for full words set.
    return (superHere & (1 << _fullWordBit)) != 0;
}

Discussion

DAWG structures can be used in scientific applications and word games. It has been used for DNA strand processing, and you may find references to it in journals such as Science and Nature. It also makes for a blazingly fast anagram finder or Scrabble program. See the adjunctive MSDN project.

C# Is Lightning Fast

C# code sometimes performs a thousand times faster than C and uses less memory. The only catch is that the C# program must be well-thought-out and precisely implemented. If you iterate through 160,000 items, your performance will be 1,000 times worse than looking through 160 nodes in a DAWG.

Dot Net Perls is dedicated to sharing code and knowledge. It has
© 2007-2008 Sam Allen. All rights reserved.

Ads by The Lounge