Red-black Trees in Erlang

Working through Chris Okasaki's "Purely Functional Data Structures", I found that I couldn't find an Erlang version of the red-black tree implementation Chris shows on page 28. The code, also available as a pastie and download (with more comments of mine) .erl and .hrl:

%% redblack.hrl
-record(node, {color, data, left, right}).

%% redblack.erl
%% Adapted from Okasaki "Purely Functional Data Structures" p. 28
member(_X, undefined) ->
    false;
member(X, #node{left=A, data=Y}) when X < Y ->
    member(X, A);
member(X, #node{data=Y, right=B}) when X > Y ->
    member(X, B);
member(_X, _S) ->
    true.

insert(X, undefined) ->
    #node{color=black, data=X};
insert(X, S) ->
    % to do recursive anonymous functions, pass the created fun in as 2nd parameter
    Ins = fun(undefined, _F) ->
                  #node{color=red, data=Data}; % insert the new data as a red node
             (#node{color=Color, left=A, data=Y, right=B}, F) when X < Y->
                  balance(Color, F(A, F), Y, B);
             (#node{color=Color, left=A, data=Y, right=B}, F) when X > Y->
                  balance(Color, A, Y, F(B, F));
             (Node, _F) ->
                  Node
          end,
    #node{left=A, data=Y, right=B} = Ins(S, Ins),
    #node{color=black, left=A, data=Y, right=B}.

%% detect Black->Red->Red Patterns and balance the situation
balance(black, #node{color=red, left=#node{color=red, left=A, data=X, right=B}, data=Y, right=C}, Z, D) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B}, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, #node{color=red, left=A, data=X, right=#node{color=red, left=B, data=Y, right=C}}, Z, D) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, A, X, #node{color=red, left=#node{color=red, left=B, data=Y, right=C}, data=Z, right=D}) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, A, X, #node{color=red, left=B, data=Y, right=#node{color=red, left=C, data=Z, right=D}}) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(Color, Left, Data, Right) ->
    #node{color=Color, left=Left, data=Data, right=Right}.

So, I'm going to go through this, mainly to prove to myself that I kinda get it, and because Ben will complain I didn't dumb it down enough for him. Line 2 defines the node structure in an Erlang header file. Comparable to a struct in C. Member Lines 6-13 define the member function. Each member() -> … is a function clause, essentially a case statement on steroids. The first clause says if the tree is undefined, then whatever X is, it is not a member of undefined, so return false. The second clause unpacks the current node ( #node{left=A, data=Y} ) and assigns the values of left and data to A and Y, respectively (I have tried to maintain Chris's variable names as much as possible, though I think Erlang's record syntax complements the Standard ML syntax found in the book). The clause then compares Y, the node's data, to the query X, and if X < Y, checks the left branch (the variable A) of the node. The third clause is nearly identical, except it handles when X > Y, and checks the right branch (B) for X's existence. The fourth and final clause handles a match (because when X is not greater than or less than Y, it must be equal), returning true. Simple enough to follow and grok. Insertion When I read several different articles online concerning red-black trees, the insertion routine sounded quite complex to me. I was quite pleased when I read the algorithm in the book, as it crystallized my thoughts and allowed me to grok the action. The first clause of the insert function handles the empty (or undefined) tree, creating and returning a black node (since the root of the tree must be black). The second clause takes the new data X and a tree S and creates a new function, stored in the var Ins. Now, calling an anonymous function recursively is tricky, and I will explain why this part varies slightly from the text. Since there's no way to refer to the function (as it isn't named), we have to pass the anonymous function as a second parameter to the anonymous function, which is what happens on line 28. So when you see F(A, F), mentally think Ins(A). The first parameter to Ins() is our current node in the tree. If it is undefined, we insert our new data as a red node. Why? That is discussed at length elsewhere, but it boils down to being easier to fix a red violation than a black violation (look up the constraints a red-black tree has on it, and the violations will make more sense). The second clause unpacks the node and makes our less than/greater than comparison like a normal BST. What's different is that we call the appropriate recursive insert call (remember F(A, F)? This inserts X somewhere into the left side of Node, and F(B, F) inserts X somewhere into the right side), and pass the result to the balance function. The final clause handles a match between X and Y and returns the Node and its branches unmodified. Balance Why do we need a balance function? In case our data being inserted into the tree is more ordered than we may have expected. A vanilla BST with inputs [3,2,1] will be a glorified list. We force the situation by inserting a red node in the tree and causing a situation where the parent of the newly inserted red node is also a red node. This violates the constraint that a red node's children must be black nodes. There are four possible combinations of this violation:

  1. (B)-Left->(Red)-Left->(Red)
  2. (B)-Left->(Red)-Right->(Red)
  3. (B)-Right->(Red)-Left->(Red)
  4. (B)-Right->(Red)-Right->(Red)

Each of those possibilities maps to the first four clauses of the balance function. The fifth clause passes a node back unchanged. So let's look at the first clause. This clause is activated when a black node's left child is red, and that red node's left child is also red.

      B3
     /   \
   R2     *
  /   \
R1     *

So we unpack the three nodes and return the three nodes rotated to remove the red violation.

Variable Meaning Value
A R1's Left node undefined
X R1's Data 1
B R1's Right node undefined
Y R2's Data R2
C R2's Right node undefined
Z B3's Data 3
D B3's Right node undefined

Unpacking from the pattern match, we then pack up a new tree structure according to the clause's definition:

#node{color=red, left=#node{color=black, left=A, data=X, right=B}, data=Y, right=#node{color=black, left=C, data=Z, right=D}};

This essentially replaces the black node having a red child and red grandchild with a red node that has two black nodes for children. The next illustration shows the result of the balance call:

   R2
  /   \
B1    B3

Now, the bubbling up of the red node may cause a red violation with the parent and grandparent of R2, which will be handled as the recursion of Ins() unwinds. When the initial Ins() ends (line 28), the resulting node is the root of the new tree, which we then unpack into the data, left, and right fields. We then force the root node to be black by creating the root node with the unpacked parts, forcing color=black. So there you have it. Simplified, with a lot of things glossed over/ignored, but this is my port of the red-black code listing on page 28. Looking forward to progressing through the book more and increasing my data structure- and Erlang-fu. Further fun: Read Chris's reflections 10 years after publishing his book.