Union find

Invented by	Galler, Fischer (1964), Tritter, McIlroy, Morris (path compression), Tarjan (1975, amortized analysis)
Idea:	disjoint sets are represented as separate trees, roots of the trees are representatives of the sets to find a set where \(x\) belongs (the representative), we need to find the root of the tree containing \(x\) to join two sets, we just find the representatives and link one to another for the data structure to be efficient, we need to keep the paths from nodes to the roots short heuristic 1: during union, always link a “smaller” tree to the “larger” one heuristic 2: compress the paths
Advantages:	easy to implement very fast (just a bit slower than constant time per operation)
Disadvantages:	the time bounds are amortized, not worst case

[You can download the application for offline viewing.]

Heuristics

The plain linking algorithm can be improved considerably by using two kinds of heuristics. One deals with the way we link trees (it’s preferrable to link smaller trees under bigger trees), the others take extra effort to compact the paths we traverse.

The time complexity of the union and find operations when using these heuristics are as follows (let \(n\) be the number of elements and \(m\) be the number of operations; assume \(m\geq n\); the time is amortized per operation):

heuristics	complexity of `union` and `find` operations
none	\(O(height)\)
union by size/rank	\(O(\log n)\)
path compression	\(O(\log_{\lfloor 1+m/n\rfloor} n)\)
both	\(O(1)\) for all practical purposes*

*the \(O(1)\) claim is actually a white lie. Truth is: the complexity is bounded by a so called inverse Ackermann function which grows to infinity, but so slooowly that it is still less than 5 even for \(n\) as large as the number of atoms in the observable universe.

Union by size or by rank

For union by size, we keep the node count for every tree and when linking two trees, we always link the smaller tree (with less nodes) under the bigger tree.

When doing union by rank, we keep ranks for all trees. A new node, representing a singleton set, starts with rank 0. When linking two trees, we always link the one with smaller rank under the root with higher rank. After linking two trees with the same rank, the rank of root is increased by 1.

It is easy to prove by induction, that:

Tree with rank \(r\) has height at most \(r\).
Tree with rank \(r\) contains at least \(2^r\) nodes.

Consequently,

If there are \(n\) nodes, the maximum rank is at most \(\log n\).
Every tree has at most logarithmic height so all the operations take \(O(\log n)\) time.

Path compaction

Path compression

Path compression needs two passes:

In the first pass, we find the root \(r\).
In the second pass, we link each node directly under \(r\).

Path splitting

While traversing the path to root, we link every node to its grandparent, i.e. \(a\to c\), \(b\to d\), \(c\to e\), etc. The ultimate effect of this operation is that the path to root is split into two halves (+/-1). Distance to root will be approximately halved for every node on the path.

Path halving

In the path halving heuristic, we only link every other node to its grandparent, i.e. we link \(a\to c\), leave \(b\), link \(c\to e\), leave \(d\), etc. The final effect is shown on figure below. Again, each node on the path to the root will be closer to root by approximately half the distance.

Analysis

Note that each union operation consists of 2 finds (finding the roots) + some additional computation which, however, takes only constant time. So we will focus on analysing the time taken by all the finds. Moreover, note that the complexity of each find(x) operation is proportional to the length of the path from \(x\) to the root, i.e. number of “hops” we need to do get to the root. We will prove that:

Theorem: If \(n<2^{65536}\) elements, any sequence of \(m\) find operations makes at most \(6m+2n\) hops In general, \(m\) union and find operations take at most \(O((m+n)\log^* n)\) time where \(\log^*\) is the iterated logarithm function.

For comparison, the number of atoms in the observable universe is estimated to be much less than \(2^{300}\) (somewhere between \(10^{78}\) and \(10^{82}\)).

What is \(\log^* n\)? Enter number \(n\) into your calculator. Keep pressing the “log” button until you get a number \(\leq 1\). Then \(\log^* n\) is the number of times you needed to press the button.

E.g., \(\log^* 65536=4\), because \(65536 \stackrel\log\to 16 \stackrel\log\to 4 \stackrel\log\to 2 \stackrel\log\to 1\) so we need to press “log” 4 times, and \(\log^* (2^{65536}) = 5\) because it only takes one extra logarithm to get from \(2^{65536}\) down to \(65536\).

Proof: First, a few simple observations:

Nodes with high rank are rare. Recall that a tree with rank \(r\) contains at least \(2^r\) nodes. That means that at most \(n/2^r\) nodes ever reach rank \(r\) (and thus height \(r\)). Only \(\frac{1}{16}\) of nodes may reach rank 4 and only \(\frac{1}{65536}\) of nodes may reach rank 16.
Rank of parent is always at least 1 higher than rank of the children. So if we go along a path from any node to root, the ranks will be strictly increasing.
Rank of every node starts at 0 and then it may increase while the node is a root. Once the node is linked under some other root, its rank doesn’t change anymore…
… but its parent’s rank may increase. For all nodes except roots define \(gap(x)\) to be the difference of \(rank(parent(x))-rank(x)\). Each time we do path compression, all the nodes except for the root and root’s child get a new, “higher” parent with higher rank. So \(gap(x)\) increases for these nodes.

Let us divide ranks into 3 levels:

level 0: ranks \(0\ldots 3\),
level 1: ranks \(4\ldots 2^4-1 = 15\),
level 2: ranks \(16\ldots 2^{16}-1 = 65535\),
…and we could go on but if \(n<2^{65536}\) (quite likely in practice), there will be no node with rank 65536.

Now let’s count the total number of hops for all the \(m\) finds. There are

at most 3 hops per find at level 0 – \(3m\) hops total
for each find let us count the last final hop to a root separately – that’s \(m\) hops total
also each time we hop from one level to another, we count it separately – that’s at most \(2m\) hops total
it remains to count the hops within levels 1 and 2

So let’s count all the hops at level 1. More specifically, we need to count all the hops from \(x\) to \(y=parent(x)\) such that both \(x\) and \(y\) are at level 1. The highest possible rank gap on level 1 is \(15-4=11\). Moreover, we assume \(x\) also has a grandparent \(z\) - otherwise it’s a final hop and we already counted those separately. But every time this is the case, after path compression, \(x\) will be linked to \(z\) or higher and \(gap(x)\) will increase. After 11 such hops, \(x\) will be linked to a node with rank 16 or higher.

To sum up: there are at most 11 hops per node within level 1. But only \(\leq n/16\) nodes ever get to level 1, so this is less than \(n\) hops in total for level 1.

The same holds for level 2: there are at most 65519 (less than 65536) hops per node within level 2, but only \(\leq n/65536\) nodes ever get to level 2, so this is again less than \(n\) hops in total.

Altogether, this is \(6m+2n\) for 2 levels – when \(n<2^{65536}\).

In general, level 3 would consist of nodes with ranks \(65536\ldots 2^{65536}-1\). The range of ranks for each levels is of the form \([k, 2^k)\). The \(i\)-th level starts with rank \(\underbrace{2^{2^{\cdot^{\cdot^{\cdot^2}}}}}_i\) and ends with rank \(\underbrace{2^{2^{\cdot^{\cdot^{\cdot^2}}}}}_{i+1} - 1\) so there will be at most \(\log^* n\) levels and at most \((\log^* n)\cdot m\) hops between levels. Within each level, there will be at most \(n\) hops, which makes \((\log^* n)\cdot n\) hops for all levels and a grand total of \(O((m+n)\log^* n)\).

Note: The true complexity of \(m\) union/find operations (with union by rank and path compression) is \(O(m\alpha(m,n))\) where \(\alpha\) is related to the inverse Ackermann function. For all practical purposes \(\alpha(m,n)\leq 3\) (but still, in theory, \(\alpha\) goes to infinity).

References

B.A. Galler and M.J. Fisher. An improved equivalence algorithm. Communications of the ACM, 7(5):301-303, 1964. [bib] [pdf]
J.R. Gilbert, E.G. Ng, and B.W. Peyton. An Efficient Algorithm to Compute Row and Column Counts for Sparse Cholesky Factorization. SIAM Journal on Matrix Analysis and Applications, 15:1075, 1994. [bib] [pdf]
Robert Endre Tarjan. Efficiency of a good but not linear set union algorithm. J. ACM, 22(2):215-225, 1975. [bib] [pdf]
M. Fredman and M. Saks. The cell probe complexity of dynamic data structures. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, pages 345-354. ACM, 1989. [bib] [pdf]
M.M.A. Patwary, J. Blair and F. Manne. Experiments on union-find algorithms for the disjoint-set data structure. In International Symposium on Experimental Algorithms, pages 411-423. Springer, Berlin, Heidelberg, 2010. [pdf]
R.E. Tarjan. A class of algorithms which require nonlinear time to maintain disjoint sets. J. Comput. Syst. Sci., 18(2):110-127, 1979. [bib] [pdf]
R.E. Tarjan and J. Van Leeuwen. Worst-case analysis of set union algorithms. Journal of the ACM (JACM), 31(2):245-281, 1984. [bib] [pdf]
R. Seidel and M. Sharir. Top-down analysis of path compression. SIAM Journal on Computing, 34:515, 2005. [bib] [ps]
H.N. Gabow and R.E. Tarjan. A linear-time algorithm for a special case of disjoint set union.
Journal of computer and system sciences, 30(2):209-221, 1985. [bib] [pdf]