Comment on page
Durability and Availability
Topics and Tables are stored with redundancy to guard against temporary or permanent failure of node/loss of media.
- N - number of nodes in the cluster. Minimum 3.
- R - number of copies of each data that we maintain in the cluster.
- F+1 - number of positively confirmed copies, before a client write is signaled complete
- F - number of node/media failures that the system can tolerate without incurring data loss.
Typical production systems must withstand at least two-node loss. Consequently, minimally, F ≥ 2, R ≥ 3, N ≥ 3.
Below are a few examples of configuring these parameters.
In a 3-node cluster, (typically) configured as 3-3-2 above, each event (and table state) is replicated to all the 3 nodes.
Yes, the number of copies R, as well as fault-tolerance F, will be configurable. The minimum number of copies for the cluster to be production-worthy would be 3, with F = 2.
For a 4 or 5 node cluster, the minimum copies remain 3. The system can be configured to replicate to all 4 or 5 nodes. Our Administration Guide will document the best practices and trade-offs if any.
The durability and availability constraints are as follows:
Durability: We can withstand the simultaneous and potentially permanent (e.g., disk media failure) loss of F nodes with zero data loss. For production systems, we recommend F to be at least 2.
Availability: For strongly consistent read availability of 100% of the data, we require N - F nodes to be up and responding, with unviolated, intact media. The system will incur read unavailability only when more than F nodes are absent (e.g., from a temporary network partition); and automatically heal once missing nodes become available.
For write availability, we require only F+1 nodes to be available.
There will not be downtime required. Depending on the amount of time a node was missing, the re-sync can be very quick/incremental or a background re-silvering task proportional to the amount of data being resilvered. Such resilvering is a many-many rebuild - that evenly taxes the remaining nodes - and can proceed concurrently with production workloads.
In a (3,3,2) configuration, every node contains 100% of the data. For larger clusters, we envision a distributed snapshot - that is composed of distributed-coordinated snapshots of each node.
Having a single 100% node, as the cluster size grows, could either a) limit the scalability of the cluster, since every write must be reflected on this node or b) fail to capture 100% of the data, if writes are allowed to proceed without it.
We haven’t, but we could consider a lagging 100% node (or even S3, etc.), which will eventually contain all of the data marked as a result of a consistent snapshot operation.