Problem. You want to implement a directed acyclic word graph. Your DAWG must store thousands of words. It must be faster to use than a regular hashtable, even a highly tuned one with custom hashcodes. Solution. Here we cover several important points related to DAWGs and introduce an example implementation in the C# programming language.
=== Dictionary Benchmark ===
Loading time: 78 ms
Program memory used: 17.27 MB
Execution time: 1148.2 ms
Disk reads: 473 reads
Modification: It is easy to modify.
Prefixes: It does not find prefixes.
=== Directed Acyclic Word Graph Benchmark [DAWG] ===
Loading time: 93 ms
Program memory used: 4.61 ms
Execution time: 855.0 ms
Disk reads: 309 reads
Modification: It is hard to modify.
Prefixes: It finds prefixes.According to Wikipedia, a directed acyclic word graph is a data structure, a tree, that can store a list of words in an extremely efficient way. Here are some parts of the DAWG that you will encounter.
It has a root node. Basically, the tree starts with one root node. This node doesn't correspond to any particular letter, but it is where the fun begins. There are 26 child nodes on the root node—one for each letter.
It has thousands of nodes. Each node contains child nodes with each letter that continues a word in the path from the root node to itself. There are likely thousands of these, and it is very difficult to visualize them.
Its paths converge. Suffixes of words are shared. For example, there are only 26 terminal nodes, one for each possible ending letter. An example of the shared prefixes and suffixes is shown below.
In the DAWG data structure, you can share all notes that form subtrees that are identical. Here is an example of some substrings that may occupy the same memory locations. The first two rows are fairly obvious, but the last one shares both the prefixes and the suffixes.
Words: analogy
pathology
Share: -logy
Words: animal
animated
Share: anima-
Words: fast
feast
Share: f-
-astDAWGs are being used to efficiently store word lists and some scientific data such as DNA character strings. They have algorithmic advantages over hashtables. A DAWG can be used to find any prefix without needing to know the whole key. It shares thousands of memory locations and references itself.
There is a specific algorithm you need to construct a DAWG. It is hard to explain and not really useful to go over here too closely. I want to share some more broad-picture points involved in a common implementation. Here are the basic DAWG construction steps.
First steps. Build the first tree. Read in a file containing all the strings and put them in a prefix-shared tree. This is simple and easy to do, and requires only some logic.
Sorting and searching. Sort nodes by depth. Each node in our prefix-shared tree is a certain number of steps from itself to the very furthest node (terminal node). Make an array indexed by that depth, and store a reference to each node in the appropriate array. We must search all nodes of each depth for equivalent subnode trees. We can implement a method like NodesEqual that receives two nodes, and then recurses through all the child nodes of each one. If a different tree is found, it returns false.
Combining nodes and writing. If we have two equivalent subtrees, we can simply make the second starting node point to the first. This is how the suffixes are shared. Write to file. We can use a method like the StreamWriter to write to a file so we don't have to do all this again.
You can use either plain text files or binary files to store the directed acyclic word graph on the disk. Initially, the author divided the data required into two different files. Let's look at how the DAWG files are laid out. Here are the important things about my file layout.
File number: 1 Name: supers Lines: ~59,000 Description: Contains bitmasks. These bitmasks have a bit set for each letter that the node contains. Special bit is set if the node can be used as the end of a word. File number: 2 Name: words Lines: ~59,000 Description: Contains node data. Each line contains up to 26 numbers. These numbers correspond to the bits in the first file. They contain the line number of the next node (from file 1).
You can initialize the word graph By parsing the DAWG files, using int.Parse or an unsafe method, and then putting the two files in data arrays. The first array stores the first file's integers. The second array is a jagged array of the same number int[] arrays.
Because they improve memory use and efficiency. We can use precomputed bitcounts because they are very fast. Then we need to define bit mask arrays that will remove the last specific number of bits from an integer. This is for the search function. [C# Bit Counting Algorithms - dotnetperls.com]
You can look up words in the DAWG by scanning the word trie for a path in the tree. First we need to loop through each character in the key. Then, we look for the letter and get the corresponding line number that it points to. After that, we can keep jumping through the tree using the two arrays together.
Lookup details. Two arrays are used. The first array contains bitmasks that indicate the arrangement of numbers in the second, jagged array. This saves memory over using a 2D array. In the algorithm, we can mask integer bits out, and then count the remaining ones. This will allow us to find the proper location of integers in the jagged array.
Final bit usage. The 28th bit is the final bit, which designates a complete word path. We can test it with the code "x & (1 << 28)". This says to test whether a 1 shifted 28 places to the right is present.
Very efficient, as described by programmers from the 1980's. You won't get the full picture from this write-up, but I wanted to emphasize the technique of using two arrays together. Because of that, we share most memory locations, and then the algorithm has better locality of reference. [See table above]
The code I show here is an example of how we can walk through the DAWG and keep accessing the correct memory locations. This code is how a lookup is performed, and it will return true or false depending on whether the string is in the DAWG and is therefore a full path.
/// <summary>
/// Use the directed acyclic word graph to lookup a word. We keep looking up
/// the correct node until no node is found, or until the entire key path is
/// found and is also a full path.
/// </summary>
/// <param name="wordIn">The you want to look up.</param>
/// <returns>Whether the word is in the DAWG.</returns>
public bool ContainsWord(string wordIn)
{
int nodePosition = 0;
int superHere = _supers[0];
int length = wordIn.Length;
int lastIndex = length - 1;
//
// Iterate through the letters in the word.
//
for (int i = 0; i <= lastIndex; i++)
{
//
// Get the letter's char code (0 for A, 25 for Z, etc.)
//
int letterHere = _translatorChars[wordIn[i]];
//
// See if this node contains a bit for this letter. If it does,
// do some calculations to find the location of the node that
// letter points to.
//
if ((superHere & (1 << letterHere)) != 0)
{
int newInt = superHere & _bitmasks[letterHere];
int count = Bitcount(newInt) - 1;
int indexNum = _nodes[nodePosition][count];
nodePosition = indexNum;
superHere = _supers[nodePosition];
}
else
{
return false;
}
}
//
// We have finished iterating through the word. See if the
// final bit we got has the special bit for full words set.
//
return (superHere & (1 << _fullWordBit)) != 0;
}Many developers, including myself, initially approach word lists with an array. However, any program that is well-thought-out and precisely implemented vastly outperforms those that are not.
Performance in any language. If you iterate through 160,000 items in any program, your performance will be 1,000 times worse than looking through 160 nodes in a DAWG. This can yield C# programs that are a thousand times faster than C or assembly language programs.
Here we looked at an overview of some aspects of implementing a DAWG structure in the C# programming language. DAWG structures can be used in scientific applications and word games. They have been used for DNA strand processing, and you may find references in Science and Nature. DAWGs also make fast anagram finders and Scrabble programs. [C# Anagram Open Source in Windows - dotnetperls.com]