ArangoDB’s design objectives

Architecture Tags: 3 Comments

Before we started programming the new open source NoSQL database, we reflected which design objectives to achieve or to drop. This article summarizes our considerations.

In a nutshell:

  • Schema-free schemas with shapes: Inherent structures at hand are automatically recognized and subsequently optimized.
  • Querying: ArangoDB is able to accomplish complex operations on the provided data (query-by-example and query-language).
  • Application Server: ArangoDB is able to act as application server on Javascript-devised routines.
  • Mostly memory/durability: ArangoDB is memory-based including frequent file system synchronizing.
  • AppendOnly/MVCC: Updates generate new versions of a document; automatic garbage collection.
  • ArangoDB is multi-threaded.
  • No indices on file: Only raw data is written on hard disk.
  • ArangoDB supports single nodes and small, homogenous clusters with zero administration.

Schema-free schemas with “shapes“

ArangoDB organizes data in documents, storing structure-information/metadata separately from user data.

User data is stored only once for all documents which have the same structure. This provides storing efficiency and offers a high-performance data access at the same time. You don’t need to determine the document’s structure at the time of access. You have done it already and therefore you generate an efficient access code using the “shape”. These processes run transparently behind the scenes for the developer’s eye.

The “shapes” concept combines all advantages of schema-free systems with those of the default schemas.

Querying

ArangoDB is able to conduct extensive queries. For that use, in addition to query-by-example, we provide a corresponding query language. Our language is capable to conduct queries of a complexity that overburden other approaches syntactically.

ArangoDB as application server

ArangoDB is capable of storing and executing Javascript-functions within the database as so-called “actions”, independently from user data. The actions are user-defined and therefore highly flexible.

This enables realizing database triggers or even devising atomic and isolated transactions. Generally, these “actions” make it possible to treat documents stored in the database as objects with a defined behavior.

Mostly Memory/Durability

Database documents are stored in the memory?memory-mapped files are used to store them. The operating system has the advantageous option to decide swapping sparsely used areas out of the main memory. Per default, these memory-mapped files are synced frequently?advantageously storing all documents securely at once (durability).

AppendOnly/MVCC

Instead of overwriting existing documents, a completely new version of the document is generated. The two benefits are:

  1. Objects can be stored coherently and compactly in the main memory.
  2. Objects are preserved—isolated writing and reading transactions allow accessing these objects for parallel operations.

The system collects obsolete versions as garbage, recognizing them as forsaken. Garbage collection is asynchronous and runs parallel to other processes.

Multi-threaded/CPU-bound

ArangoDB is not meant as trivial storage to copy and paste simple objects. It is devised as a database that enables complex operations, starting with extensive searches and data aggregation, leading to ArangoDB-stored and Javascript-executable codes.

We are convinced that an operation should be completed where the data is: the database. The limiting factor for this approach is the CPU, not flooded memory capacity or overloaded network connections.

ArangoDB is trying to harness the provided hardware optimally. With multi-core/multi-processor machines all around, ArangoDB is multi-threaded, of course.

No indices on file/startup on runtime

ArangoDB writes only raw data on the hard disk. All supporting data, i.e. indices, are stored only in the main memory. On the down side, you have to generate indices anew after a system failure or rebooting. On the up side, this approach offers a better performance working with applications with frequent writing accesses.

No large cluster/zero administration/synchronous master-master replication

Our design aim is to achieve zero administration of consistent, synchronous master-master replicating clusters on few servers. The same data is available on all servers per synchronous replication with minimal administrative effort.

We expect that most projects are not becoming the next Amazon and a single node or small cluster fits in 99 percent of the use cases.

Martin Schoenert

About Martin Schoenert

Martin is the origin of many fancy ideas in ArangoDB. As chief architect he is responsible for the overall architecture of the system, bringing in his experience from more than 20 years in IT as developer, architect, project manager and entrepreneur. For some time we had a secret competition in the office to ask Martin something he does not know - we gave up finally.
  • atacamo

    sounds great… About the Schema-free schemas with “shapes“., how does it cope with changes in shapes ? If, from now on, I have an additional field on each saved object with previously shape A, do I need to rebuild some indices, is it considered as a different shape, can I retrieve all objects with shape A and shape A’  ?

     

  • martin Schönert

    ArangoDB deals with shapes in a fully automatic way.

    So if you modify existing documents and add additional fields, ArangoDB will automatically detect that they have a different shape and use that new shape.

    Note that a collection may contain documents of different shapes.  That is indeed the normal case.  The assumption is that the number of shapes for a collection is much smaller than the number of documents in that collection (though one can of course construct pathological cases where the = = 2 to the power of ;-).

    And an index will index all documents in a collection (if they have the index key attribute) independent of their shape (i.e. independent of the other attributes they may or may not contain).  It follows that you do not need to rebuild an index when you modify documents in such a way that they have a new shape.

    You cannot directly access the shape of a document.  And so you cannot directly retrieve all documents with shape .  But remember that the shape reflects the existence and types of attributes.  And you can access that.  So you can retrieve all documents that have attribute .  In that way you can indirectly retrieve documents of a certain shape.

    Hope this answers you questions.

    Regards, martin

  • paul_eg_carter

    I’m really liking what you’re aiming at and will try some stress tests, when I get some spare time.
    I’ve got an application with 50 million or so document items needing some sub-set perspectives for e.g. documents in-progress. I would like to use null-suppressed indexes, maintained automatically by the database, but am wondering how long the in-memory index builder would take to retrieve the values. The extra start-up time might be too much for our inpatient users.
    An alternative would be graph edges, I suppose. There would be a small performance cost, but nothing too much. That cross-referencing would need to be maintained by application logic, in an action perhaps.
    I’m also wondering if the MVCC on its own may be significantly more complex to write data-entry-form code than if the application had some node lock facility. The lock data could easily live in memory; after a server crash, any row-based locks must be discarded any way; so I always wondered why some databases put them there and didn’t just hold them in memory. Also multi-master clustering gets a speed boost if the locks are in memory. I concede there’s quite a lot of complexity in lock negotiation in a cluster and guess that’s why I didn’t see it in the project.