Skip to main content

Command Palette

Search for a command to run...

Design a key-value store

Published
3 min read
Design a key-value store
D

I am developer/code-reviewer/debugger/bug-fixer/architect/teacher/builder from dubai, uae

A Key-Value store is just a hash table, until it isn't. At scale, keys shard across servers via consistent hashing and the CAP Theorem becomes the constraint you design around.

CAP Theorem

Consistency will always give you the latest version. Availability will always get a response. Partition tolerance is an operational system even when nodes can no longer communicate.

Since partition tolerance is non-negotiable in distributed systems, the real trade-off is between the other two: CP systems sacrifice availability (some requests fail) and AP systems sacrifice consistency (some requests return stale data). ZooKeeper and HBase are CP; Cassandra and CouchDB are AP.

Redundancy w/Consistency and Versioning

With consistent hashing, the number of replicas N are configured to store duplicated information. W and R are the quorum of write and read acknowledgments needed for data consistency guarantees.

Strong consistency guarantees the last read corresponds to the most recent write. W + R > N ensures this because an overlapping node will have the latest data, which has to be resolved at the cost of availability.

W + R <= N is optimized for writes or reads with only weak consistency guarantees.

Eventual consistency is a type of weak consistency that's achieved by versioning. Clients need to reconcile data using Vector clocks Clocks are lists of (node, counter) tuples associated with every version of the object. Conflicts siblings occur when two lists aren't greater than or equal to the other. Given that lists can grow large, Amazon's Dynamo DB noted that timely truncation of vectors at the expense of complete lineage knowledge didn't result in service degradation at production workloads.

Handling Failures

With cloud distributed systems, it's not if but when a node will fail.

To detect failures, nodes use the Gossip protocol to share heartbeats of other nodes with a subset of others. A failed node is noted if others indicate the node as well.

To maintain durability guarantees during transient node outages, a sloppy quorum of the first N healthy nodes over preferred nodes is considered. Hinted handoff to a temporary node will propagate the event to the intended node once it's back online, reconciling data for eventual consistency guarantees.

In the event of permanent failures, efficient data inconsistency detection is handled by Merkle trees to resolve discrepancies within the quorum.

Resilient applications are configured to run over multiple data centers and even over multiple cloud systems to work in spite of region or provider failures.

System Design

Our decentralized KV store works through a coordinator node that provides a simple get(...) and put(...) API.

Each node in our hash ring-backed Key-Value store does a lot of things: API, failure detection/repair, conflict resolution, storage engine, and replication. To cope with potentially large amounts of data, an in-memory cache is used and backed with SSTables and bloom filters to efficiently serve clients for read and write access patterns.

Happy Hackin'

References


This article is part of the system design series where I am summarizing chapters from The System Design Interview: Volume 1 / Volume 2 amongst other related content