Archive for the ‘Research’ Category

Erlang and Webmachine

Friday, April 23rd, 2010

I’m currently working on a small startup project, for one to meet a need of some acquaintances, but more importantly to learn me some Erlang with regards to the web.

While I’m further along than I actually expected to be, I thought I’d begin documenting the steps I’ve taken towards building this app.

The current nerdities I’m using:

Installation of all of these on a GNU/Linux system is pretty straightforward, so I won’t cover that here. Defaults were used for Erlang. I installed the other libraries/applications in ~/dev/erlang/lib and pointed $ERL_LIBS there in my .bashrc.

I did follow this guide for setting up Tsung. The BeeBole site has several other pages worth reading for developing web applications in Erlang.

Once installed, build the webmachine project:

$WEBMACHINE_HOME/scripts/new_webmachine.erl wm_app /path/to/root
cd /path/to/roow/wm_app
make
./start.sh

You now have a working project! Of course, I like to have my Erlang shell inside of emacs while I’m developing, so I added a comment to the start.sh script that contained the shell parameters. My start.sh looks like this:

#!/bin/sh
 
# for emacs C-c C-z flags:
# -pa ./ebin -pa ./priv/templates/ebin -boot start_sasl -s wm_app
 
cd `dirname $0`
exec erl -pa $PWD/ebin $PWD/deps/*/ebin $PWD/deps/*/deps/*/ebin $PWD/priv/templates/ebin -boot start_sasl -s wm_app

I currently have all of my dependencies in $ERL_LIBS; when I deploy this to production, I’ll add the libs to the wm_app/deps as either a symlink or copied into the directory.

To have the custom shell means you need the .emacs code to start an Erlang shell with custom flags.

Important note: If you need to specify multiple code paths in the -pa arg, you have to use a -pa for each path, unlike in the shell command version where any path after the -pa (or -pz) is added.

Another caveat: when starting the Erlang shell within emacs, if you’re currently in a erlang-related buffer (.erl, .hrl, etc), the default shell is started without the option to set flags. I typically have the start.sh open anyway to copy the flags so I don’t run into this much anymore; I’m documenting it here just in case anyone stumbles on it.

Now you have a shell within which to execute commands against your webmachine app, load updated modules, etc.

Coming up, I’ll talk about how I’m using ErlyDTL to create templates and using CouchDB/Couchbeam for the document store.

Purely Functional Data Structures and Me

Tuesday, March 23rd, 2010

Quick post to say that I’ve put my dabblings into Chris Okasaki’s Purely Functional Data Structures book up as a github repo. I am picking and choosing what exercises and examples to code, with the goal being to slowly cover the whole book. Some concessions are made to fit the ideas into Erlang (like recursively calling an anonymous function), but overall I think the ideas fit nicely.

There is a streams lib from Richard Carlsson that I found on the Erlang mailing list in the repo as well that I used for reference for a couple things in chapter 4. I stuck with my streams being represented either as an empty list [] or [term() | fun()] with term() being the calculated value and fun() being the suspension function, instead of the tuple that Richard chose. After reading further in the thread, I understand why (don’t confuse people that might use proper list functions on streams) but for my small examples, it was enough to use lists.

PHP, cURL, and POST

Friday, February 12th, 2010

While working on a script today that had been working, I couldn’t for the life of me figure out why it was failing. It uses the PHP curl_* functions to make various requests and processes the results. Turns out when you send a POST body with the CURLOPT_POSTFIELDS and a value field begins with an at symbol(@), you have to escape it (\@). The reason is the at symbol is used by curl to denote a file upload path (“@/path/to/upload.file”). So escape the at symbol and you should be back to good with the curling.

Adding Files To Subversion

Wednesday, February 10th, 2010

Working with symfony, especially when adding to the schema and generating the model, form, and filter classes, it becomes tedious to add each of the new files to your subversion repository. Here’s a succinct line to add all un-versioned files to your repo:

#!/bin/sh
 
svn add `svn st | grep ? | head | awk '{print $2}'`

The key is in the tick-marked section. It takes the output of svn st(atus) and pipes it to grep, selecting only the un-versioned files (denoted by the ?), pipes that to head which outputs the first 10 (by default) lines, and pipes that to awk which prints the second column containing the file path to be added.

But what if you have more than 10 files to add? You can easily pass a -n NUM switch to the head command to increase the number of lines it reads in at a time. I’ll leave it as an exercise to the reader to modify the script to allow a user to pass in what NUM should be.

So when your “svn st” output is filled with un-versioned files, all of which you need, run this little guy and have it done speedily.

More Erlang+Emacs

Wednesday, February 3rd, 2010

I have found that Distel’s built-in shell launcher wasn’t cutting mustard as I needed to start shells with various flags and didn’t see an easy way to accomplish this using what Distel provided. Digging around the Erlang mailing list, I found an elisp function that allowed me to pass flags to the shell. Place this snippet in your .emacs file after you’ve required erlang and distel:

(defun erl-shell-with-flags (flags)
  "Start an erlang shell with flags"
  (interactive (list (read-string "Flags: ")))
  (set 'inferior-erlang-machine-options (split-string flags))
  (erlang-shell))
 
;; map Ctrl-c Ctrl-z to the new function
(global-set-key "\C-c\C-z" 'erl-shell-with-flags)

Now when you start the erlang shell, a “Flags: ” prompt will be presented. Simply add flags as you would on the command line and the shell will start up. Great for when you need multiple shells with different snames, names, cookies, etc…

Connect to remote erlang shell while inside emacs

Thursday, January 7th, 2010

While developing my top secret project, I have been getting into the fun stuff in Erlang and Emacs. Connecting to a running instance of my app from a remote shell wasn’t straightforward to me at first, so below is my documented way of connecting, as well as dropping into the Erlang JCL from within an Emacs erlang shell.

  1. Start yaws: yaws –daemon -sname appname –conf /path/to/yaws.conf
  2. Start emacs, and from within emacs start an Erlang shell with C-c C-z (assuming you have distel configured).
  3. From the Emacs erlang shell, get into Erlang’s JCL by typing C-q C-g and pressing enter. A ^G will be printed at the prompt, but won’t be evaluated until you press enter. You should see the familiar JCL prompt “User switch command –>”.
  4. Type ‘j’ to see current jobs you have running locally, which is probably just the current shell (1 {shell,start,[init]}).
  5. Type ‘r appname@compy’ to connect to the remote node identified by appname ( from the -sname parameter ) on the computer compy (usually whatever hostname returns)
  6. Type ‘j’ to see current jobs, which should list your current shell as “1 {shell,start,[init]}”, and a second shell “2* {appname@compy,shell,start,[]}”.
  7. Type ‘c 2′ to connect to the remote shell. You can now run commands in the node’s shell. You may have to press enter again to bring up a shell prompt.
james@compy 14:33:34 ~/dev/erlang/app
> yaws --daemon -sname app --conf config/yaws.conf
 
james@compy 14:34:00 ~/dev/erlang/app
> emacs
Eshell V5.7.4  (abort with ^G)
1> ^G
 
User switch command
 --> j
   1* {shell,start,[init]}
 --> r app@compy
 --> j
   1  {shell,start,[init]}
   2* {app@compy,shell,start,[]}
 --> c 2
 
1>

PHP’s json_last_error

Monday, December 28th, 2009

A quick note that I hope Google picks up concerning php’s json_last_error function. I was trying to debug a json string I was decoding with json_decode, but was getting NULL. When I tried to use the json_last_error(), a fatal undefined function error was returned. The reason: json_last_error doesn’t exist in php versions < 5.3. Ah, version numbers! So, check your php version if the function is undefined.

Simple, yet a detail easily overlooked.

Erlang, Primes, and Sieves Again (Lazy edition)

Thursday, December 17th, 2009

I was working on the Project Euler problems, specifically problem 7, and decided to implement a prime number generator based of my previous prime number implementation based off the Melissa O’Neill paper.

The problem says that, given that the 6th prime number is 13, find the 10,001st prime. Since we don’t have a known upper bound to use the primes:queue/1 function, I created a lazy sieve that would return the next prime, its position, and an iterator. The code:

43
44
45
46
47
48
49
50
51
52
53
54
-export([lazy_sieve/0]).
 
lazy_sieve() ->
    Table = insert_prime(2, skew_kv:empty()),
    [ { 2, 1 } | fun() -> lazy_sieve({ 3, 2 }, Table) end ].
 
lazy_sieve({ X, Pos }, Table) ->
    {NextComposite, _Value} = skew_kv:min(Table),
    case  NextComposite =< X of
	true -> lazy_sieve({X+1, Pos}, adjust(Table, X));
	_Else -> [ {X, Pos} | fun() -> lazy_sieve({X+1, Pos+1}, insert_prime(X,Table)) end]
    end.

This can be used thusly:

1>[{Prime1, Pos1} | Next1] = primes:lazy_sieve().
% Prime1 = 2, Pos1 = 1, Next1 = fun()
2>[{Prime2, Pos2} | Next2] = Next1().
% Prime2 = 3, Pos2 = 2, Next2 = fun()
3>[{Prime3, Pos3} | Next3] = Next2().
% Prime3 = 5, Pos3 = 3, Next3 = fun()

This makes defining a function to find the nth prime trivial:

58
59
60
61
62
63
64
65
-export([nth/1]).
 
nth(N) ->
    nth(lazy_sieve(), N).
 
nth([ {Prime, N} | _Next], N) -> Prime;
nth([ {_Prime, _Pos} | Next], N) ->
    nth(Next(), N).

Lazy evaluation twice, in the main functionality and in the value of a key in the queue. And the performance of nth is pretty good too (time is the left number, in mirco-seconds, and the 10,000st prime is the right number):

9> timer:tc(primes, nth, [10001]).
{1635839,104743} % 1.6 seconds
10> timer:tc(primes, queue, [104743]).
{1788997, % 1.7 seconds to generate the full list of primes
 [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,
  79,83,89,97,101,103|...]}

So it takes 1.6 seconds to find the 10,001st prime and 1.7 seconds to generate the list of primes up to the 10,001st prime.

Red-black Trees in Erlang

Tuesday, December 1st, 2009

Working through Chris Okasaki’s “Purely Functional Data Structures“, I found that I couldn’t find an Erlang version of the red-black tree implementation Chris shows on page 28.

The code, also available as a pastie and download (with more comments of mine) .erl and .hrl:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
%% redblack.hrl
-record(node, {color, data, left, right}).
 
%% redblack.erl
%% Adapted from Okasaki "Purely Functional Data Structures" p. 28
member(_X, undefined) ->
    false;
member(X, #node{left=A, data=Y}) when X < Y ->
    member(X, A);
member(X, #node{data=Y, right=B}) when X > Y ->
    member(X, B);
member(_X, _S) ->
    true.
 
insert(X, undefined) ->
    #node{color=black, data=X};
insert(X, S) ->
    % to do recursive anonymous functions, pass the created fun in as 2nd parameter
    Ins = fun(undefined, _F) ->
                  #node{color=red, data=Data}; % insert the new data as a red node
             (#node{color=Color, left=A, data=Y, right=B}, F) when X < Y->
                  balance(Color, F(A, F), Y, B);
             (#node{color=Color, left=A, data=Y, right=B}, F) when X > Y->
                  balance(Color, A, Y, F(B, F));
             (Node, _F) ->
                  Node
          end,
    #node{left=A, data=Y, right=B} = Ins(S, Ins),
    #node{color=black, left=A, data=Y, right=B}.
 
%% detect Black->Red->Red Patterns and balance the situation
balance(black, #node{color=red, left=#node{color=red, left=A, data=X, right=B}, data=Y, right=C}, Z, D) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B}, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, #node{color=red, left=A, data=X, right=#node{color=red, left=B, data=Y, right=C}}, Z, D) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, A, X, #node{color=red, left=#node{color=red, left=B, data=Y, right=C}, data=Z, right=D}) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(black, A, X, #node{color=red, left=B, data=Y, right=#node{color=red, left=C, data=Z, right=D}}) ->
    #node{color=red, left=#node{color=black, left=A, data=X, right=B }, data=Y, right=#node{color=black, left=C, data=Z, right=D}};
balance(Color, Left, Data, Right) ->
    #node{color=Color, left=Left, data=Data, right=Right}.

So, I’m going to go through this, mainly to prove to myself that I kinda get it, and because Ben will complain I didn’t dumb it down enough for him.

Line 2 defines the node structure in an Erlang header file. Comparable to a struct in C.

Member

Lines 6-13 define the member function. Each member() -> … is a function clause, essentially a case statement on steroids. The first clause says if the tree is undefined, then whatever X is, it is not a member of undefined, so return false.
The second clause unpacks the current node ( #node{left=A, data=Y} ) and assigns the values of left and data to A and Y, respectively (I have tried to maintain Chris’s variable names as much as possible, though I think Erlang’s record syntax complements the Standard ML syntax found in the book). The clause then compares Y, the node’s data, to the query X, and if X < Y, checks the left branch (the variable A) of the node.
The third clause is nearly identical, except it handles when X > Y, and checks the right branch (B) for X’s existence.
The fourth and final clause handles a match (because when X is not greater than or less than Y, it must be equal), returning true. Simple enough to follow and grok.

Insertion

When I read several different articles online concerning red-black trees, the insertion routine sounded quite complex to me. I was quite pleased when I read the algorithm in the book, as it crystallized my thoughts and allowed me to grok the action.

The first clause of the insert function handles the empty (or undefined) tree, creating and returning a black node (since the root of the tree must be black).
The second clause takes the new data X and a tree S and creates a new function, stored in the var Ins. Now, calling an anonymous function recursively is tricky, and I will explain why this part varies slightly from the text. Since there’s no way to refer to the function (as it isn’t named), we have to pass the anonymous function as a second parameter to the anonymous function, which is what happens on line 28. So when you see F(A, F), mentally think Ins(A).
The first parameter to Ins() is our current node in the tree. If it is undefined, we insert our new data as a red node. Why? That is discussed at length elsewhere, but it boils down to being easier to fix a red violation than a black violation (look up the constraints a red-black tree has on it, and the violations will make more sense).
The second clause unpacks the node and makes our less than/greater than comparison like a normal BST. What’s different is that we call the appropriate recursive insert call (remember F(A, F)? This inserts X somewhere into the left side of Node, and F(B, F) inserts X somewhere into the right side), and pass the result to the balance function.
The final clause handles a match between X and Y and returns the Node and its branches unmodified.

Balance

Why do we need a balance function? In case our data being inserted into the tree is more ordered than we may have expected. A vanilla BST with inputs [3,2,1] will be a glorified list. We force the situation by inserting a red node in the tree and causing a situation where the parent of the newly inserted red node is also a red node. This violates the constraint that a red node’s children must be black nodes. There are four possible combinations of this violation:

  1. (B)-Left->(Red)-Left->(Red)
  2. (B)-Left->(Red)-Right->(Red)
  3. (B)-Right->(Red)-Left->(Red)
  4. (B)-Right->(Red)-Right->(Red)

Each of those possibilities maps to the first four clauses of the balance function. The fifth clause passes a node back unchanged. So let’s look at the first clause. This clause is activated when a black node’s left child is red, and that red node’s left child is also red.

      B3
     /   \
   R2     *
  /   \
R1     *

So we unpack the three nodes and return the three nodes rotated to remove the red violation.

Variable Meaning Value
A R1′s Left node undefined
X R1′s Data 1
B R1′s Right node undefined
Y R2′s Data R2
C R2′s Right node undefined
Z B3′s Data 3
D B3′s Right node undefined

Unpacking from the pattern match, we then pack up a new tree structure according to the clause’s definition:

33
#node{color=red, left=#node{color=black, left=A, data=X, right=B}, data=Y, right=#node{color=black, left=C, data=Z, right=D}};

This essentially replaces the black node having a red child and red grandchild with a red node that has two black nodes for children. The next illustration shows the result of the balance call:

      R2
     /   \
   B1    B3

Now, the bubbling up of the red node may cause a red violation with the parent and grandparent of R2, which will be handled as the recursion of Ins() unwinds.

When the initial Ins() ends (line 28), the resulting node is the root of the new tree, which we then unpack into the data, left, and right fields. We then force the root node to be black by creating the root node with the unpacked parts, forcing color=black.

So there you have it. Simplified, with a lot of things glossed over/ignored, but this is my port of the red-black code listing on page 28. Looking forward to progressing through the book more and increasing my data structure- and Erlang-fu.

Further fun: Read Chris’s reflections 10 years after publishing his book.

Symfony/Propel Memory Issues

Tuesday, October 6th, 2009

I have a bulk import process that needs running nightly. Currently there are around 4,200 “rows” to process, which actually can encompass many tables so row is not entirely appropriate. The problem is that the script poops out after ~200 “rows” with memory limit errors. While increasing the memory limit is do-able, I am not interested in that as a solution currently.

First, I recorded some numbers to benchmark where the script was in memory consumption. The first number is the “row”, the peak is how much memory the script has allocated, and the mem is the delta in memory allocation per row.

0 peak: 41,103,408 mem: 29,884,416
1 peak: 42,440,264 mem: 1,310,720
2 peak: 43,613,848 mem: 1,310,720
3 peak: 43,893,960 mem: 262,144
4 peak: 44,223,040 mem: 262,144
5 peak: 44,896,296 mem: 786,432
6 peak: 45,671,560 mem: 786,432
7 peak: 45,865,952 mem: 0
8 peak: 46,418,272 mem: 786,432
9 peak: 47,917,888 mem: 1,310,720
10 peak: 48,566,312 mem: 786,432

After the initialization phase, memory allocation increases fairly steadily and it is not long until the 128M memory limit is reached. This is unacceptable as I know some “rows” should be much closer to 0 as nothing is imported on the majority of rows.

My first solution was to disable logging:

sfConfig::set('sf_logging_enabled', FALSE);

The initial memory allocation was decreased, but the running deltas remained higher than expected.

Second, I inserted a ton of unset() calls in the various functions. This dropped my deltas a little:

0 peak: 30,611,472 mem: 20,971,520
1 peak: 31,969,760 mem: 1,310,720
2 peak: 32,180,112 mem: 262,144
3 peak: 32,333,216 mem: 262,144
4 peak: 32,525,144 mem: 262,144
5 peak: 32,656,616 mem: 0
6 peak: 32,863,736 mem: 262,144
7 peak: 33,103,264 mem: 262,144
8 peak: 33,455,544 mem: 262,144
9 peak: 33,754,288 mem: 262,144
10 peak: 33,984,976 mem: 262,144

But allocation still killed the import before it could complete. Browsing through various sites, I discovered Propel had a hard time cleaning up circular references, which meant PHP couldn’t garbage-collect that memory. However, to combat this, Propel 1.3 offers a static method disableInstancePooling that allowed me to override Propel’s desire to keep instances around.

Adding:

Propel::disableInstancePooling();

to the beginning of the import gave me these results:

0 peak: 26,569,632 mem: 17,301,504
1 peak: 28,582,152 mem: 2,097,152
2 peak: 30,455,352 mem: 1,835,008
3 peak: 30,536,176 mem: 0
4 peak: 30,536,176 mem: 0
5 peak: 31,517,088 mem: 1,048,576
6 peak: 31,534,152 mem: 0
7 peak: 31,552,120 mem: 0
8 peak: 31,589,632 mem: 0
9 peak: 31,695,504 mem: 0
10 peak: 31,695,504 mem: 0

Now new memory was allocating only when the import was actually doing something of significance. In fact, watching the import proceed with the deltas displayed, I could observe the memory decreasing at times, prolonging the life of the script by orders of magnitude. Whereas before the script was processing ~200 “rows”, it currently processes the whole batch (4,237 “rows” currently) in one go.

As the importable “rows” increase, I know I won’t be butting up against memory limits for some time.