Comparing ArangoDB with CouchDB and MongoDB

General Tags: , Leave a comment

The folks over at MongoDB have an article on their site on Comparing MongoDB and CouchDB.
They write:

“We are getting a lot of questions “how are mongo db and couch different?” It’s a good question: both are document-oriented databases with schemaless JSON-style object data storage. Both products have their place — we are big believers that databases are specializing and “one size fits all” no longer applies.”

The same applies to ArangoDB, we meet a lot of people especially developers using MongoDB who are interested in ArangoDB and ask us how it is different than other popular nosql databases. In this article we would like to add our answers to the questions asked.

Summary for the reader in a hurry

JavaScript

  • ArangoDB uses Google’s V8 JavaScript engine
  • JavaScript can be used to embed own business logic and data retrieval operations (similar to stored procedures)
  • Usage of JavaScript is not required, can also query the database without JavaScript

Performance

Some figures:

  • Inserting 1 million documents => 24,3 seconds
    41.200 inserts/second
  • Reading 1 million documents => 93,6 seconds
    10.600 requests/second
  • Primary index lookup & return of the document using AQL
    1 million HTTP requests => 94,1 seconds
    10.600 requests requests/second
Aspect ArangoDB
Data organization
  • ArangoDB is a document store
  • documents can be connected, allowing to treat and query them as graphs
  • documents are uniquely identified by keys, allowing key/value storage
  • documents are organized in collections
Data types
  • JSON data types: numbers, UTF-8 strings, Boolean values, arrays/lists, documents with named attributes
Update & delete strategy
  • append-only in versions <= 1.2
  • MVCC in version >= 1.3
  • automatic garbage collection, no manual compaction
Horizontal Scalability
  • automatic and configurable sharding
Querying
  • CRUD via REST (create, read, update, delete, replace)
  • query by example
  • AQL (ArangoDB query language)
  • graph queries (graph algorithms, connections, path traversal) accessible via HTTP REST API and from JavaScript
Durability
  • synchronization is configurable per collection
  • data files are written as journals in append-only fashion
  • writes are check-summed to force data integrity

Query Expression

ArangoDB offers a couple of options for getting data out of the database: it has a REST interface for CRUD operations and also allows “querying by example”. “Querying by example” means that you create a JSON document with the attributes you are looking for. The database will return all documents which look like this “example document”.

And how about complex queries? Expressing complex queries as JSON documents can become painful – and it is almost impossible to support joins with this approach. We wanted a convenient and easy to learn way to execute even complex
queries, not involving any programming as in an approach based on map/reduce. As ArangoDB supports multiple data models including graphs it was neither sufficient to stick to SQL nor simply implement UNQL. We ended up with the “ArangoDB query language” (AQL), a declarative language similar to SQL and Jsoniq. AQL supports joins, graph queries, list iteration, results filtering, results projection, sorting, variables, grouping, aggregate functions, unions and intersections.

AQL will try to make use of indexes defined by the user. ArangoDB supports creation of secondary indexes for attributes and nested attributes (as in MongoDB). It offers the following specialized index types:

  • hash indexes
  • skip list indexes
  • bitmap indexes
  • 2d geo indexes

These index types have different use cases and performance characteristics. Users can define secondary indexes of the appropriate types as required.

Atomicity

Individual documents are written atomically. Support for cross-document or cross-collection transactions will be available in ArangoDB version 1.3.

Durability

The term “durability” describes the property that ensures that data are saved permanently and do not accidentally disappear or get erased even if the system crashes. As in CouchDB and in MongoDB, data in ArangoDB by default is written to disk permanently but synchronization is forced only from time to time. This allows higher throughput than would be available with immediate synchronization. In ArangoDB, the end user can change this setting on a per-collection level.

ArangoDB always uses “journaling”. This means that new data is appended at the end of journal files, without overwriting any old data. If ArangoDB goes off-line and then comes back on-line it will re-read the journal files and can restore everything up to the last journal entry.

As existing data in journal files are never modified, journaling greatly reduces the risks of data corruption caused by partial writes. All writes to the journal files are also check summed so data integrity can be ensured when the journal files are read after the re-start of the server.

Data storage

ArangoDB stores documents in collections. A specialized binary data file format is used for disk storage. Documents that have similar structure (i.e. that have the same attribute names and attribute types) can share their structural information. The structure (named “shape”) is saved just once, and multiple documents can re-use it by storing just a pointer to their “shape”.

In practice, documents in a collection are likely to be homogenous, and sharing the structure data between multiple documents can greatly reduce disk storage space and memory usage for documents.

JavaScript

ArangoDB uses Google’s V8 engine to execute JavaScript functions server side. Users can write server side business logic in JavaScript and deploy it in ArangoDB. These so-called “actions” are much like stored procedures living close to the data.

For example, with actions it is possible to perform cascading deletes/updates, assign permissions, and do additional calculations and modifications to the data.

ArangoDB also allows users to map URLs to custom actions, making it usable as an application server that handles client HTTP requests with user-defined business logic.

Side note: ArangoDB also supports MRuby as embedded language.

HTTP REST API

As CouchDB ArangoDB offers its functionality via a HTTP REST interface. MongoDB instead uses a binary protocol instead of HTTP. When running performance tests we figured out that the HTTP implementation in ArangoDB is generally fast enough and the potential gains achievable by using binary protocols are likely to be small, and not justify giving up the ease of use and interoperability of
HTTP. ArangoDB supports keep-alive, and clients can optionally send multiple requests at once to ArangoDB using a special batch operation API, further reducing the HTTP overhead if relevant.

Using a HTTP REST API makes it very simple for developers to add an API for their favorite programming language.

Performance

Just as MongoDB, ArangoDB is oriented toward maximum performance (and a feature set which gives you maximum flexibility in your real-life projects). Performance is gained from:

  • Documents with similar structures and data types share their structural information, making individual documents smaller and reducing disk and memory usage
  • Specialized index types that the user can pick the most appropriate from
  • Special batch operations API for bulk operations
  • Multi-threading: multiple client requests can be served concurrently
  • written in C/C++

Our blog has a section with results from performance tests.

MVCC

Database architects can choose between different strategies for dealing with updates and deletes. Especially in early (relational) databases the “update in place” approach was extensively used. “Update in place” means that a record is overwritten in-situ with a new version of this record. MVCC on the other hand describes an append semantics: the new version of a record will be appended to a chronological log and leave the old data untouched.

ArangoDB 1.0 and 1.1 store all revisions of documents in an append-only fashion. The latest version of a document is available to the end user. In this versions there is no concurrency for writes per collection, so it’s not „real” MVCC. More MVCC aspects will be introduced with version 1.3.

A database using an append-only strategy like MVCC has to take care of old, not longer used versions of a document. Those versions have to be removed regularly to keep the database as small as possible. In CouchDB the compaction is manually triggered per database. ArangoDB has an automatic garbage collection.

Horizontal Scalability

One of the advantages of nosql databases is the support of “horizontal scalability”, the ability to add more servers to your cluster and distribute the load among across this cluster.

Collections can be distributed among multiple servers since ArangoDB version 2. The sharding key can be both user defined or chosen automatically.

About Frank Celler

Frank is both entrepreneur and backend developer, developing mostly memory databases for two decades. He is the lead developer of ArangoDB and co-founder of triAGENS. Try to challenge Frank asking him questions on C, C++ and MRuby. Besides Frank organizes Cologne’s nosql group & nosql conferences.