Comment on page

Durability and Availability


Topics and Tables are stored with redundancy to guard against temporary or permanent failure of node/loss of media.
  • N - number of nodes in the cluster. Minimum 3.
  • R - number of copies of each data that we maintain in the cluster.
  • F+1 - number of positively confirmed copies, before a client write is signaled complete
  • F - number of node/media failures that the system can tolerate without incurring data loss.
Typical production systems must withstand at least two-node loss. Consequently, minimally, F ≥ 2, R ≥ 3, N ≥ 3.
Below are a few examples of configuring these parameters.
Three or more nodes, strong durability, 3-copies of all data. Writes are not considered complete until ALL three copies are made.
Three-node cluster. May incur (bounded) data loss. Suitable for very-low-latency deployments. Not for strong data durability.
Strong durability with very low and stable tail-latencies. The 4th copy can be pruned after hardening.

Is a single event (or store, however it works internally) copied to all 3 nodes or 2 of the 3?

In a 3-node cluster, (typically) configured as 3-3-2 above, each event (and table state) is replicated to all the 3 nodes.

Is this configurable?

Yes, the number of copies R, as well as fault-tolerance F, will be configurable. The minimum number of copies for the cluster to be production-worthy would be 3, with F = 2.

What if we moved to a 4 or 5 node cluster?

For a 4 or 5 node cluster, the minimum copies remain 3. The system can be configured to replicate to all 4 or 5 nodes. Our Administration Guide will document the best practices and trade-offs if any.

Can we lose some number of nodes at runtime? How many? Is that configurable?

The durability and availability constraints are as follows:
Durability: We can withstand the simultaneous and potentially permanent (e.g., disk media failure) loss of F nodes with zero data loss. For production systems, we recommend F to be at least 2.
Availability: For strongly consistent read availability of 100% of the data, we require N - F nodes to be up and responding, with unviolated, intact media. The system will incur read unavailability only when more than F nodes are absent (e.g., from a temporary network partition); and automatically heal once missing nodes become available.
For write availability, we require only F+1 nodes to be available.

If new nodes come online do they "catch up" automatically?


Can it happen at runtime, or is downtime required to re-sync?

There will not be downtime required. Depending on the amount of time a node was missing, the re-sync can be very quick/incremental or a background re-silvering task proportional to the amount of data being resilvered. Such resilvering is a many-many rebuild - that evenly taxes the remaining nodes - and can proceed concurrently with production workloads.

Do all nodes hold a 100% copy i.e., 3:3 distribution. If it's a 2:3 distribution or less, can there be a fourth node that 100% of the data but not serving cluster requests for the purposes of snapshotting and replaying the data back into the cluster in case of catastrophic failure?

In a (3,3,2) configuration, every node contains 100% of the data. For larger clusters, we envision a distributed snapshot - that is composed of distributed-coordinated snapshots of each node.
Having a single 100% node, as the cluster size grows, could either a) limit the scalability of the cluster, since every write must be reflected on this node or b) fail to capture 100% of the data, if writes are allowed to proceed without it.
We haven’t, but we could consider a lagging 100% node (or even S3, etc.), which will eventually contain all of the data marked as a result of a consistent snapshot operation.