Suffix tree

Idea: store all suffixes of a string in a (compressed) trie
Advantages:
  • a very strong hammer for solving almost any stringology problem
  • string matching: after a linear time precomputation, we can search pattern \(p\) in \(O(|p|)\) time
Disadvantages: very space-consuming; using a suffix array may be a better alternative
[You can download the application for offline viewing.]

Structure

Suffix tree is a compressed trie containing all the suffixes of a given string \(S\). For example, MISSISSIPPI$ has 12 (nonempty) suffixes: $, I$, PI$, PPI$, IPPI$, … ISSISSIPPI$, MISSISSIPPI$. If we store all these words in a trie, we get:

This is our conceptual view of a suffix tree (not a real suffix tree). The problem is that such a trie would have \(\Theta(n^2)\) nodes – it would take \(\Theta(n^2)\) space and couldn’t be constructed in time better than \(\Omega(n^2)\). This is too much to be of any practical use.

We show, however, that it is possible to represent such a trie in \(O(n)\) memory and construct it in \(O(n)\) time.

We merge each node that is the only child with its parent:

This way, each internal node is at least binary and therefore there are only \(O(n)\) nodes. (The number of internal nodes is less than the number of leaves, which is \(n\) – the number of suffixes and the length of the string.) However, the edge labels still consist of \(\Theta(n^2)\) symbols in total.

What shall we do? Instead of making copies of different substrings as edge labels, we just use indices into string \(S\). For example, instead of label SSI, which is the substring \(S[3\ldots 5)\), we just store the indices \([3,5)\). Instead of label SSIPPI$\(=S[5,12)\), we store just \([5,12)\).

This way, each edge takes \(O(1)\) space and there are \(O(n)\) nodes and edges, so the trie of all suffixes can be represented in \(O(n)\) memory. Each leaf corresponds to a suffix and is labeled with the position where the suffix starts.

References

  • E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. [bib] [pdf]
  • P. Weiner. Linear pattern matching algorithms. In IEEE 14th Ann. Symp. on Switching and Automata Theory, 1973, pages 1-11. [bib] [pdf]
  • E.M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM (JACM), 23(2):262-272, 1976. [bib] [pdf]
  • R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica, 19(3):331-353, 1997. [bib] [pdf]