CouchDB/BigCouch Bulk Insert/Update

January 27th, 2012

While writing a bulk importer for Crossbar, I took a look at squeezing some performance out of BigCouch for the actual inserting of documents into the database. My first time running all the documents into BigCouch at the same time resulted in some poor performance, so I went digging around for some ideas on how to improve the insertions. Reading up on the High Performance Guide for CouchDB (which BigCouch is API-compliant with), I started to play with chunking my inserts up to get better overall execution time.

Note: the following are very unscientific results, but I think are fairly instructive for what one might expect.

Docs Per Insertion Elapsed Time (ms)
26618 107176
1000 8325
1500 5679
2000 3087
2500 1644
Docs Per Insertion Elapsed Time (ms)

Based on the CouchDB guide, I decided to not pursue this further, as dropping insertion time 2 orders of magnitude was fine enough for me! I may have to bake this into the platform natively.

For those interested in the Erlang code, it is pretty simple. Taking a list of documents to save, use lists:split/2 to try and split the list. By catching the error, we can know that the list is less than our threshold, and can save the remaining list to BigCouch. Otherwise, lists:split/2 chunks our list into one for saving, and one for recursing back into the function. Since we don’t really care about the results of couch_mgr:save_docs/2, we could put the calls in the second clause of the case in a spawn to speed this up (relative to the calling process).

-spec save_bulk_rates/1 :: (wh_json:json_objects()) -> no_return().
save_bulk_rates(Rates) ->
    case catch(lists:split(?MAX_BULK_INSERT, Rates)) of
        {'EXIT', _} ->
            couch_mgr:save_docs(?WH_RATES_DB, Rates);
        {Save, Cont} ->
            couch_mgr:save_docs(?WH_RATES_DB, Save),
            save_bulk_rates(Cont)
    end.

Life Update

January 26th, 2012

Updated the blog to run 3.3.1 – lot of cobwebs around these parts. Hopefully I can be more proactive in blogging about things going on at work, and perhaps starting to write about what I’m up to personally (not that I have much of that right now). Maybe my Google stats will jump over the 0.3 hits I average! Dare to dream!

cURL stripping newlines from your CSV or other file?

January 26th, 2012

I’m in the process of writing a REST endpoint for uploading CSVs to Crossbar as part of our communications platform at 2600hz. Not wanting to invoke the full REST client interface, I generally use cURL to send the HTTP requests. Today, however, I had quite the time figuring out why my CSV files were being stripped of their newline characters.

The initial invocation:

$> curl http://localhost:8000/v1/path/to/upload -H "Content-Type: text/csv" -X POST -d @file.csv

Walking through the code, from where I was processing the CSV down to the webserver handling the connection itself, looking for who was stripping the newlines, I determined it was coming in sans-newlines and decided to check out cURL’s man pages for what might be amiss. I quickly found that the -d option was treating the file as ascii, and although the docs don’t explicitly say so, it appears this option will strip the newlines.

The resolution is to use the –data-binary flag so cURL doesn’t touch the file before sending it to the server.

Cron and infinite loops do not mix

March 9th, 2011

More “expert” code time! From the “expert”:

Please put this script in a cron to run every minute

while true; do
  rsync -a server:remote_dir local_dir
  sleep $freq
done

local_dir is going to be really, really, really up to date after a few minutes…the server crash will be epic. Perhaps we should write a script to find and kill these rogue processes and run it every minute too, but stagger it with the other cron…

You get paid for this?

March 7th, 2011

Spotted in some high-priced “expert”‘s code:

switch ($retcode)
{
    case -1:
    case -3:
        if ($retcode==-1)
            log("SOME_CODE", "SOME MSG");
        else
            log("SOME_OTHER_CODE", "SOME OTHER MSG");
...

Resolving Dialyzer “Function foo/n has no local return” errors

November 23rd, 2010

Dialyzer is a great static analysis tool for Erlang and has helped me catch many bugs related to what types I thought I was passing to a function versus what actually gets passed. Some of the errors Dialyzer emits are rather cryptic at first (as seems commonplace in the Erlang language/environment in general) but after you understand the causes of the errors, the fix is easily recognized.

My most common error is Dialyzer inferring a different return type that what I put in my -spec, followed by Dialyzer telling me the same function has no local return. An example:

foo.erl:125: The specification for foo:init/1 states that the function might also return {'ok',tuple()} but the inferred return is none()
foo.erl:126: Function init/1 has no local return

The init/1 function (for a gen_server, btw):

124
125
126
-spec(init/1 :: (Args :: list()) -> tuple(ok, tuple())).
init(_) ->
  {ok, #state{}}.

And the state record definition:

30
31
32
33
-record(state, {
  var_1 = {} :: tuple(string(), tuple())
  ,var_2 = [] :: list(tuple(string(), tuple()))
}).

Spot the error? In the record definition, var_1 is initialized to an empty tuple and var_2 is initialized to an empty list, yet the spec typing for the record does not take that into account. The corrected version:

30
31
32
33
-record(state, {
  var_1 = {} :: tuple(string(), tuple()) | {}
  ,var_2 = [] :: list(tuple(string(), tuple())) | []
}).

And now Dialyzer stops emitting the spec error and the no local return error.

IT Expo

October 7th, 2010

Just returned from IT Expo West last night. Three days of learning, hob-nobbing, and talking myself hoarse about the awesomeness that is 2600hz. We got a decent writeup posted on TMC’s site, met quite a few people, collected beaucoup business cards, and generally had a fun time hanging with the team. Super tired but ready to keep building the best hosted PBX software platform!

Bonus: See Darren’s awesome (yet mildly awkward) video interview!

Also, VoIP service providers looking to offset calling costs for their business clients can look at PromoCalling as a way to compete with Google and Skype’s free calling plans.

Still Kicking

September 17th, 2010

I am still alive and well; just busy. I did write a blog entry for my company, 2600hz. More to come…eventually.

Erlang and Webmachine

April 23rd, 2010

I’m currently working on a small startup project, for one to meet a need of some acquaintances, but more importantly to learn me some Erlang with regards to the web.

While I’m further along than I actually expected to be, I thought I’d begin documenting the steps I’ve taken towards building this app.

The current nerdities I’m using:

Installation of all of these on a GNU/Linux system is pretty straightforward, so I won’t cover that here. Defaults were used for Erlang. I installed the other libraries/applications in ~/dev/erlang/lib and pointed $ERL_LIBS there in my .bashrc.

I did follow this guide for setting up Tsung. The BeeBole site has several other pages worth reading for developing web applications in Erlang.

Once installed, build the webmachine project:

$WEBMACHINE_HOME/scripts/new_webmachine.erl wm_app /path/to/root
cd /path/to/roow/wm_app
make
./start.sh

You now have a working project! Of course, I like to have my Erlang shell inside of emacs while I’m developing, so I added a comment to the start.sh script that contained the shell parameters. My start.sh looks like this:

#!/bin/sh
 
# for emacs C-c C-z flags:
# -pa ./ebin -pa ./priv/templates/ebin -boot start_sasl -s wm_app
 
cd `dirname $0`
exec erl -pa $PWD/ebin $PWD/deps/*/ebin $PWD/deps/*/deps/*/ebin $PWD/priv/templates/ebin -boot start_sasl -s wm_app

I currently have all of my dependencies in $ERL_LIBS; when I deploy this to production, I’ll add the libs to the wm_app/deps as either a symlink or copied into the directory.

To have the custom shell means you need the .emacs code to start an Erlang shell with custom flags.

Important note: If you need to specify multiple code paths in the -pa arg, you have to use a -pa for each path, unlike in the shell command version where any path after the -pa (or -pz) is added.

Another caveat: when starting the Erlang shell within emacs, if you’re currently in a erlang-related buffer (.erl, .hrl, etc), the default shell is started without the option to set flags. I typically have the start.sh open anyway to copy the flags so I don’t run into this much anymore; I’m documenting it here just in case anyone stumbles on it.

Now you have a shell within which to execute commands against your webmachine app, load updated modules, etc.

Coming up, I’ll talk about how I’m using ErlyDTL to create templates and using CouchDB/Couchbeam for the document store.

Erlang, Euler, Primes, and gen_server

March 26th, 2010

I have been working on Project Euler problems for a while now and many of them have centered around prime numbers. I’ve referenced my work with the sieve in other posts but found with a particular problem that some of my functions could benefit from some state being saved (namely the sieve being saved and not re-computed each time).

The problem called to count prime factors of numbers and find consecutive numbers that had the same count of prime factors. My primes module had a prime_factors/1 function that would compute the prime factors and the exponents of those factors (so 644 = 22 * 7 * 23, and primes:prime_factors(644) would return [{2,2},{7,1},{23,1}]. The prime_factors/1 looked something like this:

prime_factors(N) ->
    CandidatePrimes = prime_factors(N, primes:queue(N div 2)),
    PrimeFactors = [ X || X < - CandidatePrimes, N rem X =:= 0 ],
    find_factors(PrimeFactors, N, 0, []).

The call to find_factors/4 takes the factors and finds the exponents and can be ignored for now. The time sink, then, is in generating the CandidatePrimes list. I think my primes:queue/1 function is pretty fast at generating the sieve, and dividing N by two eliminates a lot of unnecessary computation, but when you’re calling prime_factors/1 thousands of times, the call to queue/1 begins to add up. This is where I needed to save some state (the sieve) in between calls. Erlang, fortunately enough, has a module behavior called gen_server that abstracts away a lot of the server internals and lets you focus on the business bits of the server. I won’t discuss it much here as I’m not an authority on it, but Joe Armstrong’s book and the Erlang docs have been a great help in understanding what’s happening behind the scene. You can view the prime_server module to see what its current state is code-wise.

To speed up prime_factors/1, I split it into two functions prime_factors/1 and prime_factors/2. The functions look like this:

prime_factors(N, CandidatePrimes) ->
    PrimeFactors = [ X || X < - CandidatePrimes, N rem X =:= 0 ],
    find_factors(PrimeFactors, N, 0, []).
 
prime_factors(N) ->
    prime_factors(N, primes:queue(N div 2)).

Now, if we don't need to save the queue between calls you can still call prime_factors/1 as usual. The prime_server module utilizes the prime_factors/2 function because it initializes its state to contain a primes sieve (either all primes under 1,000,000 if no arg is passed to start_link, or all primes =< N when using start_link/1) and the current upper bound. Now when the server handles the call for getting the factors, we pass a pared down list of primes to prime_factors/2 and get a nice speed boost.

Well, the heavy lifting is front-loaded in the initialization of the server (generating the sieve) and in calls that increase the sieve's size. One improvement there might be to save the Table generated during the initial sieve creation and start the loop back up from where it left off (when N > UpTo) but that is for another time. If you choose your initial value for start_link right, regenerating the sieve should be unnecessary.

The last speed boost was noticing that calculating the exponents was an unnecessary step so I wrote a count_factors/1 and count_factors/2 that skips the call to find_factors/4 and returns the length of the list comprehension.

With these changes complete, problem 47 went from taking well over 5 minutes to just under 20 seconds to solve brute force.