diff --git a/README.md b/README.md index 6fb7864d..f85fe24f 100755 --- a/README.md +++ b/README.md @@ -4,6 +4,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho ## List of papers +* [Design patterns for container-based distributed systems](https://shagunsodhani.com/papers-I-read/Design-patterns-for-container-based-distributed-systems) +* [Cassandra - a decentralized structured storage system](https://shagunsodhani.com/papers-I-read/Cassandra-a-decentralized-structured-storage-system) * [CAP twelve years later - How the rules have changed](https://shagunsodhani.com/papers-I-read/CAP-twelve-years-later-How-the-rules-have-changed) * [Consistency Tradeoffs in Modern Distributed Database System Design](https://shagunsodhani.com/papers-I-read/Consistency-Tradeoffs-in-Modern-Distributed-Database-System-Design) * [Exploring Simple Siamese Representation Learning](https://shagunsodhani.com/papers-I-read/Exploring-Simple-Siamese-Representation-Learning) diff --git a/site/_posts/2020-12-14-Cassandra - a decentralized structured storage system.md b/site/_posts/2020-12-14-Cassandra - a decentralized structured storage system.md new file mode 100755 index 00000000..ec8ef256 --- /dev/null +++ b/site/_posts/2020-12-14-Cassandra - a decentralized structured storage system.md @@ -0,0 +1,87 @@ +--- +layout: post +title: Cassandra - a decentralized structured storage system +comments: True +excerpt: +tags: ['2010', 'Big Data', 'Distributed Systems', 'Software Engineering', Apache, CAP, Data, Database, DBMS, Engineering, Scale, Software, Systems] + +--- + +## Introduction + +* Cassandra is a distributed storage system that runs over cheap commodity servers and handles high write throughput while maintaining low latency for read operations. +* At the time of writing, it was used to support the search for Facebook Inbox. +* [Link to the paper](https://dl.acm.org/doi/10.1145/1773912.1773922) +* [Link to the implementation](https://cassandra.apache.org/) + +## Data Model + +* A table is a distributed multidimensional map. +* The key is a string (generally 16-36 bytes long), while the value is a structured object. +* Every operation under a single row key is atomic per replica. +* Columns are grouped together into sets called column families. +* There are two types of columns families: + * Simple families. + * Super column families: visualized as a column family within a column family. +* Columns can be sorted by name or time (used to display results in time sorted order). +* The API supports insert, get and delete operations. + +## System Architecture + +### Handling Requests + +* Any read/write request gets routed to any node in the cluster. The node determines the replicas for a given key and routes the request. +* For write query, the system waits for a quorum of replicas to acknowledge the writes' completion. +* For read query, the system either routes the requests to the closest replica (might fetch stale results) or routes the requests to all replicas and waits for a quorum of responses. + +### Partitioning + +* Cassandra partitions data across the cluster using consistent hashing with an order-preserving hash function. +* The hash function's output range is treated as a fixed circular ring, and each node is assigned a random position on the ring. +* An incoming request specifies a key used to route requests. +* One benefit of this approach is that the addition/removal of a node only affects its immediate neighbors. +* However, randomly assigning nodes leads to non-uniform data and load distribution. +* Cassandra uses the load information and moves lightly loaded nodes to reduce the load on other nodes. + +### Replication + +* Each data item is replicated at N hosts, where N is the per-instance replication factor. +* Cassandra supports the following replication policies: Rack Unaware, Rack Aware (within a datacenter), and Datacenter Aware. +* For "Rack Aware" and "Datacenter Aware" strategies, Zookeeper elects a leader among the nodes and holds metadata about which range a node is responsible for. +* In case of node failure and network partitions, the quorum requirements are relaxed. + +### Membership + +* Cluster membership is based on Scuttlebutt, a very efficient anti-entropy Gossip based mechanism. +* Cassandra uses a modified version of $\phi$ Accrual Failure Detector for detecting failures, which provides the suspicion level (of failure) for each node. + +### Bootstrapping + +* A node, starting for the first time, chooses a random position in the ring. +* This information is persisted on the local disk, on Zookeeper, and gossiped around the cluster (so any node can route any query in the cluster). +* During bootstrapping, the newly joined node reads a list of contact points (within the cluster) using a configuration file. + +### Local Persistence + +* Generally, a write operation involves a write into a commit log (for durability and recoverability), followed by a write into the in-memory data structures. +* A read operation starts with querying the in-memory data and then looks into the filesystem. +* Read queries on the filesystem use bloom filters. +* Column indices are maintained to make it faster to look up relevant columns. + +## Implementation Details + +* Components implemented in Java. +* System control messages use UDP while messages for replication and request routing uses TCP. +* A new commit log is rolled out after the older one exceeds 128MB of size. +* All the data is indexed using a primary key. +* Data on the disk is chunked into sequences of blocks. Each block contains at most 128 keys and is demarcated by a block index. +* When the data is written to the disk, a block index is generated and maintained in the memory for faster access. +* A compaction process is performed to merge multiple files (on disk) into one file. + +## Practical Experience + +* Data from MySQL servers is added to Cassandra using MapReduce processes. +* Although Cassandra is a completely decentralized system, adding some coordination (via Zookeeper) is helpful. +* For Inbox Search, a per-user index is maintained for all the messages. +* For "term search", the key is the userid, and the words in the message become the super column. +* For searching all the messages ever sent/received by a user, the key is the userid, and the recipient ids are the super columns. \ No newline at end of file diff --git a/site/_posts/2020-12-21-Design patterns for container-based distributed systems.md b/site/_posts/2020-12-21-Design patterns for container-based distributed systems.md new file mode 100755 index 00000000..e9a4e219 --- /dev/null +++ b/site/_posts/2020-12-21-Design patterns for container-based distributed systems.md @@ -0,0 +1,58 @@ +--- +layout: post +title: Design patterns for container-based distributed systems +comments: True +excerpt: +tags: ['2016', 'Design Pattern', 'Distributed Systems', 'Software Engineering', Container, Engineering, Scale, Systems, USENIX] + +--- + +## Introduction + +* The paper describes three design patterns for container-based distributed systems: single-container pattern, single-node pattern, and multi-node pattern. +* [Link to the paper](https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/burns) + +## Single-container management patterns + +* Traditionally, containers have exposed three functions: run, pause and stop. +* A richer API can be implemented to provide fine-grained control to system developers and operators. +* Similarly, much more application information (including monitoring metrics) can be exposed. +* The container interface can be used to define a contract for a complex lifecycle. For example, instead of arbitrarily shutting down the container, the system could signal that it will be terminated, giving the container some time to perform some cleanup/follow-up actions. + +## Single-node, multi-container pattern + +### Sidecar pattern + +* Multiple containers extend and enhance the main container. +* For example, a web-server serves from the local disk (main container) while a side container updates the data. +* Benefits: + * independent development, deployment, and scaling of containers + * possibility of combining different type of containers + * failure containment boundary, i.e., one failing container, need not bring down the entire system. + +### Ambassador pattern + +* Proxy communication to and from the main container with the ambassador hiding the complexities of communication with a distributed (multi-shard system) that may be written in a different language. + +### Adapter pattern + +* Standardize output and interfaces across the containers to provide a simple, homogenized view to external applications. +* A common example is using a single tool for collecting/processing metrics from multiple applications. +* This is different from the adapter pattern, which aims to provide a simplified view of the external world to an application. + +## Multi-node application patterns + +### Leader election pattern + +* In a sharded (or replication-based) system, the system may have to elect a leader (or multiple leaders) among the replicas (or shards). +* Instead of using a leader election library, a leader election container can be used (that communicates with other containers over, say, HTTP). This removes the restriction of using a leader election library compatible with the containers (e.g., using the same language). + +### Work queue pattern + +* A work coordinator container can queue different containers, each of which may have a different implementation or dependencies, thus removing the restriction that all the works use the same runtime. + +### Scatter/gather pattern + +* An external client sends a request to a root container. +* This container fans out the request to many containers that may perform the computation in parallel. +* The root container gathers these parallel computations' results and aggregates them into a response to the external client. \ No newline at end of file diff --git a/site/_site b/site/_site index 7fe877f6..61fbdfd6 160000 --- a/site/_site +++ b/site/_site @@ -1 +1 @@ -Subproject commit 7fe877f6f12aaae7463431cefb63e1492d4cd555 +Subproject commit 61fbdfd6d42554525bec6603bfc2278112c66a8a