Jump to content

Deterministic acyclic finite state automaton

From Wikipedia, the free encyclopedia
The strings "tap", "taps", "top", and "tops" stored in atrie(left) and a DAFSA (right),EOWstands for End-of-word.

Incomputer science,adeterministic acyclic finite state automaton(DAFSA),[1] also called adirected acyclic word graph(DAWG;though that name also refers to arelated data structurethat functions as a suffix index[2]) is adata structurethat represents a set ofstrings,and allows for a query operation that tests whether a given string belongs to the set in time proportional to its length. Algorithms exist to construct and maintain suchautomata,[1]while keeping themminimal.

A DAFSA is a special case of afinite state recognizerthat takes the form of adirected acyclic graphwith a single source vertex (a vertex with no incoming edges), in which each edge of the graph is labeled by a letter or symbol, and in which each vertex has at most one outgoing edge for each possible letter or symbol. The strings represented by the DAFSA are formed by the symbols on paths in the graph from the source vertex to any sink vertex (a vertex with no outgoing edges). In fact, adeterministic finite state automatonis acyclicif and only ifit recognizes afinite set of strings.[1]

Comparison to tries[edit]

By allowing the same vertices to be reached by multiple paths, a DAFSA may use significantly fewer vertices than the strongly relatedtriedata structure. Consider, for example, the four English words "tap", "taps", "top", and "tops". A trie for those four words would have 12 vertices, one for each of the strings formed as a prefix of one of these words, or for one of the words followed by the end-of-string marker. However, a DAFSA can represent these same four words using only six verticesvifor 0 ≤i≤ 5, and the following edges: an edge fromv0tov1labeled "t", two edges fromv1tov2labeled "a" and "o", an edge fromv2tov3labeled "p", an edgev3tov4labeled "s", and edges fromv3andv4tov5labeled with the end-of-string marker. There is a tradeoff between memory and functionality, because a standard DAFSA can tell you if a word exists within it, but it cannot point you to auxiliary information about that word, whereas a trie can.

The primary difference between DAFSA and trie is the elimination of suffix and infix redundancy in storing strings. The trie eliminates prefix redundancy since all common prefixes are shared between strings, such as betweendoctorsanddoctoratethedoctorprefix is shared. In a DAFSA common suffixes are also shared, for words that have the same set of possible suffixes as each other. For dictionary sets of common English words, this translates into major memory usage reduction.

Because the terminal nodes of a DAFSA can be reached by multiple paths, a DAFSA cannot directly store auxiliary information relating to each path, e.g. a word's frequency in the English language. However, if for each node we store the number of unique paths through that point in the structure, we can use it to retrieve the index of a word, or a word given its index.[3]The auxiliary information can then be stored in an array.

References[edit]

  1. ^abcJan Daciuk, Stoyan Mihov, Bruce Watson and Richard Watson (2000). Incremental construction of minimal acyclic finite state automata. Computational Linguistics26(1):3-16.
  2. ^Public DomainThis article incorporatespublic domain materialfromPaul E. Black."directed acyclic word graph".Dictionary of Algorithms and Data Structures.NIST.
  3. ^Kowaltowski, T.; CL Lucchesi (1993). "Applications of finite automata representing large vocabularies".Software-Practice and Experience.1993:15–30.CiteSeerX10.1.1.56.5272.

External links[edit]