Add some more papers

shagunsodhani · Dec 29, 2020 · c3bb11f · c3bb11f
1 parent feea037
commit c3bb11f
Show file tree

Hide file tree

Showing 4 changed files with 148 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -4,6 +4,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho
 
 ## List of papers
 
+* [Design patterns for container-based distributed systems](https://shagunsodhani.com/papers-I-read/Design-patterns-for-container-based-distributed-systems)
+* [Cassandra - a decentralized structured storage system](https://shagunsodhani.com/papers-I-read/Cassandra-a-decentralized-structured-storage-system)
 * [CAP twelve years later - How the rules have changed](https://shagunsodhani.com/papers-I-read/CAP-twelve-years-later-How-the-rules-have-changed)
 * [Consistency Tradeoffs in Modern Distributed Database System Design](https://shagunsodhani.com/papers-I-read/Consistency-Tradeoffs-in-Modern-Distributed-Database-System-Design)
 * [Exploring Simple Siamese Representation Learning](https://shagunsodhani.com/papers-I-read/Exploring-Simple-Siamese-Representation-Learning)

diff --git a/site/_posts/2020-12-14-Cassandra - a decentralized structured storage system.md b/site/_posts/2020-12-14-Cassandra - a decentralized structured storage system.md
@@ -0,0 +1,87 @@
+---
+layout: post
+title: Cassandra - a decentralized structured storage system
+comments: True
+excerpt: 
+tags: ['2010', 'Big Data', 'Distributed Systems', 'Software Engineering', Apache, CAP, Data, Database, DBMS, Engineering, Scale, Software, Systems]    
+
+---
+
+## Introduction
+
+* Cassandra is a distributed storage system that runs over cheap commodity servers and handles high write throughput while maintaining low latency for read operations.
+* At the time of writing, it was used to support the search for Facebook Inbox.
+* [Link to the paper](https://dl.acm.org/doi/10.1145/1773912.1773922)
+* [Link to the implementation](https://cassandra.apache.org/)
+
+## Data Model
+
+* A table is a distributed multidimensional map.
+* The key is a string (generally 16-36 bytes long), while the value is a structured object.
+* Every operation under a single row key is atomic per replica.
+* Columns are grouped together into sets called column families.
+* There are two types of columns families:
+    * Simple families.
+    * Super column families: visualized as a column family within a column family.
+* Columns can be sorted by name or time (used to display results in time sorted order).
+* The API supports insert, get and delete operations.
+
+## System Architecture
+
+### Handling Requests
+
+* Any read/write request gets routed to any node in the cluster. The node determines the replicas for a given key and routes the request.
+* For write query, the system waits for a quorum of replicas to acknowledge the writes' completion.
+* For read query, the system either routes the requests to the closest replica (might fetch stale results) or routes the requests to all replicas and waits for a quorum of responses.
+
+### Partitioning
+
+* Cassandra partitions data across the cluster using consistent hashing with an order-preserving hash function.
+* The hash function's output range is treated as a fixed circular ring, and each node is assigned a random position on the ring.
+* An incoming request specifies a key used to route requests.
+* One benefit of this approach is that the addition/removal of a node only affects its immediate neighbors.
+* However, randomly assigning nodes leads to non-uniform data and load distribution.
+* Cassandra uses the load information and moves lightly loaded nodes to reduce the load on other nodes.
+
+### Replication
+
+* Each data item is replicated at N hosts, where N is the per-instance replication factor.
+* Cassandra supports the following replication policies: Rack Unaware, Rack Aware (within a datacenter), and Datacenter Aware.
+* For "Rack Aware" and "Datacenter Aware" strategies, Zookeeper elects a leader among the nodes and holds metadata about which range a node is responsible for.
+* In case of node failure and network partitions, the quorum requirements are relaxed.
+
+### Membership
+
+* Cluster membership is based on Scuttlebutt, a very efficient anti-entropy Gossip based mechanism.
+* Cassandra uses a modified version of $\phi$ Accrual Failure Detector for detecting failures, which provides the suspicion level (of failure) for each node.
+
+### Bootstrapping
+
+* A node, starting for the first time, chooses a random position in the ring.
+* This information is persisted on the local disk, on Zookeeper, and gossiped around the cluster (so any node can route any query in the cluster).
+* During bootstrapping, the newly joined node reads a list of contact points (within the cluster) using a configuration file.
+
+### Local Persistence
+
+* Generally, a write operation involves a write into a commit log (for durability and recoverability), followed by a write into the in-memory data structures.
+* A read operation starts with querying the in-memory data and then looks into the filesystem.
+* Read queries on the filesystem use bloom filters.
+* Column indices are maintained to make it faster to look up relevant columns.
+
+## Implementation Details
+
+* Components implemented in Java.
+* System control messages use UDP while messages for replication and request routing uses TCP.
+* A new commit log is rolled out after the older one exceeds 128MB of size.
+* All the data is indexed using a primary key.
+* Data on the disk is chunked into sequences of blocks. Each block contains at most 128 keys and is demarcated by a block index.
+* When the data is written to the disk, a block index is generated and maintained in the memory for faster access.
+* A compaction process is performed to merge multiple files (on disk) into one file.
+
+## Practical Experience
+
+* Data from MySQL servers is added to Cassandra using MapReduce processes.
+* Although Cassandra is a completely decentralized system, adding some coordination (via Zookeeper) is helpful.
+* For Inbox Search, a per-user index is maintained for all the messages.
+* For "term search", the key is the userid, and the words in the message become the super column.
+* For searching all the messages ever sent/received by a user, the key is the userid, and the recipient ids are the super columns.
diff --git a/site/_posts/2020-12-21-Design patterns for container-based distributed systems.md b/site/_posts/2020-12-21-Design patterns for container-based distributed systems.md
@@ -0,0 +1,58 @@
+---
+layout: post
+title: Design patterns for container-based distributed systems
+comments: True
+excerpt: 
+tags: ['2016', 'Design Pattern', 'Distributed Systems', 'Software Engineering', Container, Engineering, Scale, Systems, USENIX]    
+
+---
+
+## Introduction
+
+* The paper describes three design patterns for container-based distributed systems: single-container pattern, single-node pattern, and multi-node pattern.
+* [Link to the paper](https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/burns)
+
+## Single-container management patterns
+
+* Traditionally, containers have exposed three functions: run, pause and stop.
+* A richer API can be implemented to provide fine-grained control to system developers and operators.
+* Similarly, much more application information (including monitoring metrics) can be exposed.
+* The container interface can be used to define a contract for a complex lifecycle. For example, instead of arbitrarily shutting down the container, the system could signal that it will be terminated, giving the container some time to perform some cleanup/follow-up actions.
+
+## Single-node, multi-container pattern
+
+### Sidecar pattern
+
+* Multiple containers extend and enhance the main container.
+* For example, a web-server serves from the local disk (main container) while a side container updates the data.
+* Benefits:
+    * independent development, deployment, and scaling of containers
+    * possibility of combining different type of containers
+    * failure containment boundary, i.e., one failing container, need not bring down the entire system.
+
+### Ambassador pattern
+
+* Proxy communication to and from the main container with the ambassador hiding the complexities of communication with a distributed (multi-shard system) that may be written in a different language.
+
+### Adapter pattern
+
+* Standardize output and interfaces across the containers to provide a simple, homogenized view to external applications.
+* A common example is using a single tool for collecting/processing metrics from multiple applications.
+* This is different from the adapter pattern, which aims to provide a simplified view of the external world to an application.
+
+## Multi-node application patterns
+
+### Leader election pattern
+
+* In a sharded (or replication-based) system, the system may have to elect a leader (or multiple leaders) among the replicas (or shards).
+* Instead of using a leader election library, a leader election container can be used (that communicates with other containers over, say, HTTP). This removes the restriction of using a leader election library compatible with the containers (e.g., using the same language).
+
+### Work queue pattern
+
+* A work coordinator container can queue different containers, each of which may have a different implementation or dependencies, thus removing the restriction that all the works use the same runtime.
+
+### Scatter/gather pattern
+
+* An external client sends a request to a root container.
+* This container fans out the request to many containers that may perform the computation in parallel.
+* The root container gathers these parallel computations' results and aggregates them into a response to the external client.
diff --git a/site/_site b/site/_site