Working through Chris Okasaki's
"Purely
Functional Data Structures", I found that I couldn't find an Erlang
version of the red-black tree implementation Chris shows on page 28. The
code, also available as a pastie and
download (with more comments of mine)
.erl and
.hrl:
%% redblack.hrl
-record(node, {color, data, left, right}).
%% redblack.erl
%% Adapted from Okasaki "Purely Functional Data Structures" p. 28
member(_X, undefined) ->
false;
member(X, #node{left=A, data=Y}) when X < Y ->
member(X, A);
member(X, #node{data=Y, right=B}) when X > Y ->
member(X, B);
member(_X, _S) ->
true.
insert(X, undefined) ->
#node{color=black, data=X};
insert(X, S) ->
% to do recursive anonymous functions, pass the created fun in as 2nd parameter
Ins = fun(undefined, _F) ->
#node{color=red, data=Data}; % insert the new data as a red node
(#node{color=Color, left=A, data=Y, right=B}, F) when X < Y->
balance(Color, F(A, F), Y, B);
(#node{color=Color, left=A, data=Y, right=B}, F) when X > Y->
balance(Color, A, Y, F(B, F));
(Node, _F) ->
Node
end,
#node{left=A, data=Y, right=B} = Ins(S, Ins),
#node{color=black, left=A, data=Y, right=B}.
%% detect Black->Red->Red Patterns and balance the situation
balance(black, #node{color=red, left=#node{color=red, left=A, data=X, right=B}, data=Y, right=C}, Z, D) ->
#node{color=red, left=#node{color=black, left=A, data=X, right=B}, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, #node{color=red, left=A, data=X, right=#node{color=red, left=B, data=Y, right=C}}, Z, D) ->
#node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, A, X, #node{color=red, left=#node{color=red, left=B, data=Y, right=C}, data=Z, right=D}) ->
#node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, A, X, #node{color=red, left=B, data=Y, right=#node{color=red, left=C, data=Z, right=D}}) ->
#node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(Color, Left, Data, Right) ->
#node{color=Color, left=Left, data=Data, right=Right}.
So, I'm going to go through this, mainly to prove to myself that I kinda
get it, and because Ben will complain I didn't
dumb it down enough for him. Line 2 defines the node structure in an
Erlang header file. Comparable to a struct in C. Member Lines 6-13
define the member function. Each member() -> … is a function clause,
essentially a case statement on steroids. The first clause says if the
tree is undefined, then whatever X is, it is not a member of undefined,
so return false. The second clause unpacks the current node (
#node{left=A, data=Y} ) and assigns the values of left and data to A and
Y, respectively (I have tried to maintain Chris's variable names as much
as possible, though I think Erlang's record syntax complements the
Standard ML syntax found in the book). The clause then compares Y, the
node's data, to the query X, and if X < Y, checks the left branch (the
variable A) of the node. The third clause is nearly identical, except it
handles when X > Y, and checks the right branch (B) for X's existence.
The fourth and final clause handles a match (because when X is not
greater than or less than Y, it must be equal), returning true. Simple
enough to follow and grok. Insertion When I read several different
articles online concerning red-black trees, the insertion routine
sounded quite complex to me. I was quite pleased when I read the
algorithm in the book, as it crystallized my thoughts and allowed me to
grok the action. The first clause of the insert function handles the
empty (or undefined) tree, creating and returning a black node (since
the root of the tree must be black). The second clause takes the new
data X and a tree S and creates a new function, stored in the var Ins.
Now, calling an anonymous function recursively is tricky, and I will
explain why this part varies slightly from the text. Since there's no
way to refer to the function (as it isn't named), we have to pass the
anonymous function as a second parameter to the anonymous function,
which is what happens on line 28. So when you see F(A, F), mentally
think Ins(A). The first parameter to Ins() is our current node in the
tree. If it is undefined, we insert our new data as a red node. Why?
That is discussed at length elsewhere, but it boils down to being easier
to fix a red violation than a black violation (look up the
constraints
a red-black tree has on it, and the violations will make more sense).
The second clause unpacks the node and makes our less than/greater than
comparison like a normal BST. What's different is that we call the
appropriate recursive insert call (remember F(A, F)? This inserts X
somewhere into the left side of Node, and F(B, F) inserts X somewhere
into the right side), and pass the result to the balance function. The
final clause handles a match between X and Y and returns the Node and
its branches unmodified. Balance Why do we need a balance function? In
case our data being inserted into the tree is more ordered than we may
have expected. A vanilla BST with inputs [3,2,1] will be a glorified
list. We force the situation by inserting a red node in the tree and
causing a situation where the parent of the newly inserted red node is
also a red node. This violates the constraint that a red node's children
must be black nodes. There are four possible combinations of this
violation:
- (B)-Left->(Red)-Left->(Red)
- (B)-Left->(Red)-Right->(Red)
- (B)-Right->(Red)-Left->(Red)
- (B)-Right->(Red)-Right->(Red)
Each of those possibilities maps to the first four clauses of the
balance function. The fifth clause passes a node back unchanged. So
let's look at the first clause. This clause is activated when a black
node's left child is red, and that red node's left child is also red.
B3
/ \
R2 *
/ \
R1 *
So we unpack the three nodes and return the three nodes rotated to
remove the red violation.
Variable |
Meaning |
Value |
A |
R1's Left node |
undefined |
X |
R1's Data |
1 |
B |
R1's Right node |
undefined |
Y |
R2's Data |
R2 |
C |
R2's Right node |
undefined |
Z |
B3's Data |
3 |
D |
B3's Right node |
undefined |
Unpacking from the pattern match, we then pack up a new tree structure
according to the clause's definition:
#node{color=red, left=#node{color=black, left=A, data=X, right=B}, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
This essentially replaces the black node having a red child and red
grandchild with a red node that has two black nodes for children. The
next illustration shows the result of the balance call:
R2
/ \
B1 B3
Now, the bubbling up of the red node may cause a red violation with the
parent and grandparent of R2, which will be handled as the recursion of
Ins() unwinds. When the initial Ins() ends (line 28), the resulting node
is the root of the new tree, which we then unpack into the data, left,
and right fields. We then force the root node to be black by creating
the root node with the unpacked parts, forcing color=black. So there you
have it. Simplified, with a lot of things glossed over/ignored, but this
is my port of the red-black code listing on page 28. Looking forward to
progressing through the book more and increasing my data structure- and
Erlang-fu. Further fun: Read Chris's
reflections
10 years after publishing his book.