Bulk inserts in MongoDB, CouchDB, and ArangoDB

General, Performance Leave a comment

In the last couple of posts, we have been looking at ArangoDB’s insert performance when using individual document insert, delete, and update operations. This time we’ll be looking at batched inserts. To have some reference, we’ll compare the results of ArangoDB to what can be achieved with CouchDB and MongoDB.

Test setup

We have used the bulk insert benchmark tool to generate results for MongoDB, CouchDB, and ArangoDB. The benchmark tool uses the HTTP bulk documents APIs for CouchDB and ArangoDB, and the binary protocol for MongoDB (as MongoDB does not have an HTTP bulk API). The benchmark tool was run on the same machine as the database servers so network latency can be ruled out as an influence factor. The test machine specifications are:

  • Linux Kernel 2.6.37.6-0.11, cfq scheduler
  • 64 bit OS
  • 8x Intel(R) Core(TM) i7 CPU, 2.67 GHz
  • 12 GB total RAM
  • SATA II hard drive (7.200 RPM, 32 MB cache)

The total “net insert time” (time spent in the benchmark tool for sending to request to the database and waiting for the database response, i.e. excluding the time needed to generate the document data) is reported for several datasets in the following charts.

The database versions used for tests were:

  • MongoDB 2.1.3, with preallocation
  • CouchDB 1.2, with delayed_commits, without compression
  • ArangoDB 1.1-alpha, with waitForSync=false

The datasets tested can be categorised in three groups: small, medium, and big. The small datasets tested were:

Dataset name Description Number of documents
uniform_1000 One attribute plus unique „_id“ value 1,000
uniform_10000 same, but 10,000 documents 10,000
names_10000 person records containing names and address, artificially created with source data from US census bureau, ZIP code and state lists 10,000

The medium datasets tested were:

Dataset name Description Number of documents
enron enron e-mail corpus, published by Federal Energy Commission 41,299
names_100000 person records containing names and address, artificially created with source data from US census bureau, ZIP code and state lists 100,000
names_300000 same, but 300,000 documents 300,000
wiki_50000 Wikipedia articles 50,000

The big datasets tested consisted of:

Dataset name Description Number of documents
uniform_1000000 One attribute plus unique „_id“ value 1,000,000
uniform_10000000 same, but 10,000,000 documents 10,000,000
aol search engine queries published by AOL 3,459,421
accesslogs Apache web server access logs 1,357,246

Results, small datasets

For the smallest dataset (uniform_1000), the results were almost on par, with CouchDB being slightly faster than MongoDB than ArangoDB. For the other small datasets tested, MongoDB was slightly faster than ArangoDB, and both being notably faster than CouchDB.

Results, medium datasets

For the medium datasets, MongoDB was fastest for the first two sets tested, and ArangoDB was fastest for the other two sets. CouchDB was slightly slower for two of the datasets, and substantially slower for the two other.

Results, big datasets

With the bigger datasets tested, ArangoDB had the lowest bulk insert times. MongoDB was slightly slower for three of the cases tested, and substantially longer for the other case (uniform_10000000). CouchDB consisrently had the highest insertion time.

Conclusion

With the datasets tested, ArangoDB was on par with MongoDB (with MongoDB being slightly faster in some cases and ArangoDB in others). CouchDB was notably slower than MongoDB and ArangoDB, except in one case.

Caveats

These are benchmarks for specific datasets. The dataset volumes and types might or might not be realistic, depending on what you plan to do with a database. Results might look completely different for other datasets.

In addition, the benchmarks compare the HTTP API of CouchDB and ArangoDB against the binary protocol of MongoDB, which gives MongoDB a slight efficiency advantage. However, real-world applications will also use Mongo’s binary protocol so this is an advantage that MongoDB does have in real life (though it comes with the disadvantage that the protocol is not human-readable).

Furthermore, there are of course other aspects that would deserve observation, e.g. datafile size, memory usage. These aspects haven’t been looked at in this post.

So please be sure to run your own tests in your own environment before adopting the results.

About Jan Steemann

Jan is a member of ArangoDB's core development team. He is an expert in data modelling with nosql & relational databases and writing high performance web applications. For ArangoDB, he wrote much of AQL (ArangoDB's query language).