CouchDB/BigCouch Bulk Insert/Update
While writing a bulk importer for Crossbar, I took a look at squeezing some performance out of BigCouch for the actual inserting of documents into the database. My first time running all the documents into BigCouch at the same time resulted in some poor performance, so I went digging around for some ideas on how to improve the insertions. Reading up on the High Performance Guide for CouchDB (which BigCouch is API-compliant with), I started to play with chunking my inserts up to get better overall execution time. Note: the following are very unscientific results, but I think are fairly instructive for what one might expect.
Docs Per Insertion | Elapsed Time (ms) |
---|---|
26618 | 107176 |
1000 | 8325 |
1500 | 5679 |
2000 | 3087 |
2500 | 1644 |
Docs Per Insertion | Elapsed Time (ms) |
Based on the CouchDB guide, I decided to not pursue this further, as dropping insertion time 2 orders of magnitude was fine enough for me! I may have to bake this into the platform natively. For those interested in the Erlang code, it is pretty simple. Taking a list of documents to save, use lists:split/2 to try and split the list. By catching the error, we can know that the list is less than our threshold, and can save the remaining list to BigCouch. Otherwise, lists:split/2 chunks our list into one for saving, and one for recursing back into the function. Since we don't really care about the results of couch\mgr:save\docs/2, we could put the calls in the second clause of the case in a spawn to speed this up (relative to the calling process).
-spec save_bulk_rates/1 :: (wh_json:json_objects()) -> no_return(). save_bulk_rates(Rates) -> case catch(lists:split(?MAX_BULK_INSERT, Rates)) of {'EXIT', _} -> couch_mgr:save_docs(?WH_RATES_DB, Rates); {Save, Cont} -> couch_mgr:save_docs(?WH_RATES_DB, Save), save_bulk_rates(Cont) end.