Bulk Importing Relationships
Overview
When setting up a SpiceDB cluster for the first time, there's often a data ingest process required to
set up the initial set of relations.
This can be done with WriteRelationships
(opens in a new tab) running in a loop, but you can only create 1,000 relationships (by default) at a time with this approach, and each transaction creates a new revision which incurs a bit of overhead.
For faster ingest, we provide an ImportBulkRelationships
(opens in a new tab) call, which takes advantage of client-side gRPC streaming to accelerate the process and removes the cap on the number of relations that can be written at once.
Batching
There are two batch sizes to consider: the number of relationships in a chunk written to the stream and the overall number of relationships in the lifetime of the request. Breaking the request into chunks is a network optimization that makes it faster to push relationships from the client to the cluster.
The overall number of relationships should reflect how many rows can easily be written in a single transaction by your datastore. Note that you probably don't want to push all of your relationships through in a single request, as this could time out in your datastore.
Example
We'll use the authzed-dotnet (opens in a new tab) client for this example. Other client libraries will have different syntax and structures around their streaming and iteration, but this should demonstrate the two different levels of chunking that we'll do in the process.
var TOTAL_RELATIONSHIPS_TO_WRITE = 1000;
var RELATIONSHIPS_PER_TRANSACTION = 100;
var RELATIONSHIPS_PER_REQUEST_CHUNK = 10;
// Start by breaking the full list into a sequence of chunks where each chunk fits easily
// into a datastore transaction.
var transactionChunks = allRelationshipsToWrite.Chunk(RELATIONSHIPS_PER_TRANSACTION);
foreach (var relationshipsForRequest in transactionChunks) {
// For each of those transaction chunks, break it down further into chunks that
// optimize for network throughput.
var requestChunks = relationshipsForRequest.Chunk(RELATIONSHIPS_PER_REQUEST_CHUNK);
// Open up a client stream to the server for this transaction chunk
using var importCall = permissionsService.ImportBulkRelationships();
foreach (var requestChunk in requestChunks) {
// For each network chunk, write to the client stream.
// NOTE: this makes the calls sequentially rather than concurrently; this could be
// optimized further by using tasks.
await importCall.RequestStream.WriteAsync(new ImportBulkRelationshipsRequest{
Relationships = { requestChunk }
});
}
// When we're done with the transaction chunk, complete the call and process the response.
await importCall.RequestStream.CompleteAsync();
var importResponse = await importCall;
Console.WriteLine("request successful");
Console.WriteLine(importResponse.NumLoaded);
// Repeat!
}
The code for this example is available here (opens in a new tab).
Retrying and Resuming
ImportBulkRelationships
's semantics only allow the creation of relationships.
If a relationship is imported that already exists in the database, it will error.
This can be frustrating when populating an instance if the process fails with a retryable error, such as those related to transient
network conditions.
The authzed-go (opens in a new tab) client offers a RetryableClient
(opens in a new tab)
with retry logic built into its ImportBulkRelationships
logic.
This is used internally by zed (opens in a new tab) and is exposed by the authzed-go
library, and works by
either skipping over the offending batch if the Skip
strategy is used or falling back to WriteRelationships
with a touch
semantic if the Touch
strategy is used.
Similar logic can be implemented using the other client libraries.
Why does it work this way?
SpiceDB's ImportBulkRelationships
service uses gRPC client streaming (opens in a new tab) as a network optimization.
It does not commit those relationships to your datastore as it receives them, but rather opens a database transaction
at the start of the call and then commits that transaction when the client ends the stream.
This is because there isn't a good way to handle server-side errors in a commit-as-you-go approach. We take this approach because if we were to commit each chunk sent over the network, the semantics of server-side errors are ambiguous. For example, you might receive an error that closes the stream, but that doesn't necessarily mean that the last chunk you sent is where the error happened. The error source could be sent as error context, but error handling and resumption would be difficult and cumbersome.
A gRPC bidirectional streaming (opens in a new tab) approach could help address this by ACKing each chunk individually, but that also requires a good amount of bookkeeping on the client to ensure that every chunk that's written by the client has been acknowledged by the server. Requiring multiple client-streaming requests means that you can use normal language error-handling flows and know exactly what's been written to the server.