ArangoDB 1.1 will come with a new API for batch requests. This batch request API allows clients to send multiple requests to the ArangoDB server inside one multipart HTTP request. The server will then decompose the multipart request into the individual parts and process them as if they were sent individually. The communication layer can sustain up-to 800.000 requests/second – but absolute numbers strongly depend on the number of cores, the type of the requests, network connections and other factors. More important are the relative numbers: Depending on your use-case you can reduce insert/update times by 80%.
How batch requests can help
The primary reason to use the batch API is to reduce network overhead and thus increase performance.
When sending a lot of small requests to the server, the overhead of packaging, transporting, and decomposing the individual HTTP requests and responses can be significant. To mitigate this overhead, we first tried using a binary protocol based on Google’s protocol buffers as an alternative to HTTP. Apart from the obvious downsides (proprietary format, hard to debug etc.) the binary protocol was not substantially faster than the HTTP counterpart. We therefore decided to not go ahead with a binary protocol but offer the batch API instead.
Batching is useful for clients/users that send a lot of requests in a row from one program or process. For example, if a client is to delete 100 documents, this previously required the client to send 100 individual HTTP requests to the server. Using the new batch API, the client can now package all these requests into one (big) request and send that to the server. Packing, transporting, and unpacking the data will then be done just once instead of a 100 times. So this may save a lot of overhead, both for the client and the server.
Rules of thumb
The following rules of thumb apply for using batching:
- the smaller the individual operations are (in terms of their HTTP request length), the more benefit a client may gain from using the batch API over individual HTTP requests
- the more individual operations the client can put into one batch, the higher the savings in terms of network overhead
- clients may not benefit if they issue requests that contain a lot of data so that the overhead of HTTP is insignificant.
- compared to sending a regular HTTP request, batching adds a small overhead for each batch request sent. From a performance point of view it should therefore be avoided to send batch requests containing only one operation
We have done a few benchmarks that show the effects of using the batch API compared to individual requests. The following charts show the total time required to insert 500,000 documents into a collection. Insertion was done with multiple client concurrency levels (from 1 to 16), both with and without using the batch API. When the batch API was used, various batch sizes have been used to show the min and max effects that batching may have.
The following operations were called in the tests:
- individual requests: POST /_api/document?collection=ArangoBenchmark
- batch requests: POST /_api/batch
The batch requests contained the individual requests as if they were executed individually. The tests were conducted with ArangoDB 1.1, started with authentication turned off and default options otherwise. The test client was arangob with the following command-line:
bin/arangob --server.password "" --delay --requests 500000 \ --test-case document --complexity $x \ --batch-size $b --concurrency $c
where batch size ($b) was varied between 0 and 128, concurrency ($c) was varied between 1 and 16. The values 1, 10, and 100 (representing the number of attributes per document) have been used for the complexity parameter ($x).
Clients threads connected to the server using TCP/IP connections with Keep-Alive turned on. Test client and server were located on the same physical host (8 CPUs), competing for the same resources.
Small documents (i.e. small payload, high relative HTTP overhead)
As expected, the most extreme effect was observed when the individual operations are very small in size. In this test, 500,000 documents with just one attribute have been inserted. When not using the batch API, 500,000 HTTP are made. As the document data itself is very small, the overhead of HTTP is very high in this case. The total execution times for different client concurrency levels are depicted as the blue bars in the chart. These are our baseline values.
When using the batch API, we can observe a drastic reduction in execution time: the bigger the batch size, the less total time the execution took. For example, sending 64 operations inside each HTTP request instead of just 1 reduced execution time by more than 80 % when there was no concurrency. This is due to the number of HTTP requests being reduced to 1/64th. Great savings can also be achieved when using multiple client threads to issue the batch requests.
As the chart also shows, it is not a good idea to send batch requests that contain only one operation. In this case, using batching even adds a slight overhead compared to the baseline values. This is the orange bars in the chart above. But starting with a batch size of 2, the benefits kick in.
Medium-sized documents (medium payload, medium relative overhead)
Next, we’ve measured inserting documents with 10 attributes each. The relative overhead of adding an HTTP envelope around the document data in this case is not as big as it has been in the previous case. So the benefits of using batching diminish, but are still obvious:
Big documents (big payload, low relative overhead)
Finally, we have tested the insertion with documents having 100 attributes each. In this case, the overhead of adding an HTTP envelope around each document is even lower. That means the benefits of batching requests are again lower than in the previous case. But batching can still help reducing total execution time to a great extent, at least for lower concurrency levels:
As with all benchmarks, your mileage may vary depending on your specific hardware and workload. However, the benchmarks above show that batch operations can have a benefit in some cases. Using batch requests is of course optional, so clients can ignore them completely or can run a mixed workload consisting of both batch and non-batch requests, based on the specific workload and requirements.
In the above benchmarks, it was all about batch document insertion. Though mass document insertion is a use case for the batch request API, I’d like to point out that the batch API is general. It should be able to handle any operation that ArangoDB can handle when called in non-batched mode. That means the batch API can also be used for batch querying, batch document updates, deletion, or a combination of these.
Regarding the availability of the feature: the batch requests API is available in ArangoDB 1.1, which is still in development. The feature will become officially available when 1.1 gets released. However, ArangoDB is open source so 1.1 can already be tried and tested. If you want to give it a try, please check out the devel branch of our Github repository: https://github.com/triAGENS/ArangoDB/tree/devel
If you’re looking for a way to bulk import documents into ArangoDB 1.0, you might want to have a look at the already existing bulk import API. This API is documented here.