Skip to content

Commit

Permalink
Add some more papers
Browse files Browse the repository at this point in the history
  • Loading branch information
shagunsodhani committed Dec 29, 2020
1 parent feea037 commit c3bb11f
Show file tree
Hide file tree
Showing 4 changed files with 148 additions and 1 deletion.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ I am trying a new initiative - a-paper-a-week. This repository will hold all tho

## List of papers

* [Design patterns for container-based distributed systems](https://shagunsodhani.com/papers-I-read/Design-patterns-for-container-based-distributed-systems)
* [Cassandra - a decentralized structured storage system](https://shagunsodhani.com/papers-I-read/Cassandra-a-decentralized-structured-storage-system)
* [CAP twelve years later - How the rules have changed](https://shagunsodhani.com/papers-I-read/CAP-twelve-years-later-How-the-rules-have-changed)
* [Consistency Tradeoffs in Modern Distributed Database System Design](https://shagunsodhani.com/papers-I-read/Consistency-Tradeoffs-in-Modern-Distributed-Database-System-Design)
* [Exploring Simple Siamese Representation Learning](https://shagunsodhani.com/papers-I-read/Exploring-Simple-Siamese-Representation-Learning)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
layout: post
title: Cassandra - a decentralized structured storage system
comments: True
excerpt:
tags: ['2010', 'Big Data', 'Distributed Systems', 'Software Engineering', Apache, CAP, Data, Database, DBMS, Engineering, Scale, Software, Systems]

---

## Introduction

* Cassandra is a distributed storage system that runs over cheap commodity servers and handles high write throughput while maintaining low latency for read operations.
* At the time of writing, it was used to support the search for Facebook Inbox.
* [Link to the paper](https://dl.acm.org/doi/10.1145/1773912.1773922)
* [Link to the implementation](https://cassandra.apache.org/)

## Data Model

* A table is a distributed multidimensional map.
* The key is a string (generally 16-36 bytes long), while the value is a structured object.
* Every operation under a single row key is atomic per replica.
* Columns are grouped together into sets called column families.
* There are two types of columns families:
* Simple families.
* Super column families: visualized as a column family within a column family.
* Columns can be sorted by name or time (used to display results in time sorted order).
* The API supports insert, get and delete operations.

## System Architecture

### Handling Requests

* Any read/write request gets routed to any node in the cluster. The node determines the replicas for a given key and routes the request.
* For write query, the system waits for a quorum of replicas to acknowledge the writes' completion.
* For read query, the system either routes the requests to the closest replica (might fetch stale results) or routes the requests to all replicas and waits for a quorum of responses.

### Partitioning

* Cassandra partitions data across the cluster using consistent hashing with an order-preserving hash function.
* The hash function's output range is treated as a fixed circular ring, and each node is assigned a random position on the ring.
* An incoming request specifies a key used to route requests.
* One benefit of this approach is that the addition/removal of a node only affects its immediate neighbors.
* However, randomly assigning nodes leads to non-uniform data and load distribution.
* Cassandra uses the load information and moves lightly loaded nodes to reduce the load on other nodes.

### Replication

* Each data item is replicated at N hosts, where N is the per-instance replication factor.
* Cassandra supports the following replication policies: Rack Unaware, Rack Aware (within a datacenter), and Datacenter Aware.
* For "Rack Aware" and "Datacenter Aware" strategies, Zookeeper elects a leader among the nodes and holds metadata about which range a node is responsible for.
* In case of node failure and network partitions, the quorum requirements are relaxed.

### Membership

* Cluster membership is based on Scuttlebutt, a very efficient anti-entropy Gossip based mechanism.
* Cassandra uses a modified version of $\phi$ Accrual Failure Detector for detecting failures, which provides the suspicion level (of failure) for each node.

### Bootstrapping

* A node, starting for the first time, chooses a random position in the ring.
* This information is persisted on the local disk, on Zookeeper, and gossiped around the cluster (so any node can route any query in the cluster).
* During bootstrapping, the newly joined node reads a list of contact points (within the cluster) using a configuration file.

### Local Persistence

* Generally, a write operation involves a write into a commit log (for durability and recoverability), followed by a write into the in-memory data structures.
* A read operation starts with querying the in-memory data and then looks into the filesystem.
* Read queries on the filesystem use bloom filters.
* Column indices are maintained to make it faster to look up relevant columns.

## Implementation Details

* Components implemented in Java.
* System control messages use UDP while messages for replication and request routing uses TCP.
* A new commit log is rolled out after the older one exceeds 128MB of size.
* All the data is indexed using a primary key.
* Data on the disk is chunked into sequences of blocks. Each block contains at most 128 keys and is demarcated by a block index.
* When the data is written to the disk, a block index is generated and maintained in the memory for faster access.
* A compaction process is performed to merge multiple files (on disk) into one file.

## Practical Experience

* Data from MySQL servers is added to Cassandra using MapReduce processes.
* Although Cassandra is a completely decentralized system, adding some coordination (via Zookeeper) is helpful.
* For Inbox Search, a per-user index is maintained for all the messages.
* For "term search", the key is the userid, and the words in the message become the super column.
* For searching all the messages ever sent/received by a user, the key is the userid, and the recipient ids are the super columns.
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
layout: post
title: Design patterns for container-based distributed systems
comments: True
excerpt:
tags: ['2016', 'Design Pattern', 'Distributed Systems', 'Software Engineering', Container, Engineering, Scale, Systems, USENIX]

---

## Introduction

* The paper describes three design patterns for container-based distributed systems: single-container pattern, single-node pattern, and multi-node pattern.
* [Link to the paper](https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/burns)

## Single-container management patterns

* Traditionally, containers have exposed three functions: run, pause and stop.
* A richer API can be implemented to provide fine-grained control to system developers and operators.
* Similarly, much more application information (including monitoring metrics) can be exposed.
* The container interface can be used to define a contract for a complex lifecycle. For example, instead of arbitrarily shutting down the container, the system could signal that it will be terminated, giving the container some time to perform some cleanup/follow-up actions.

## Single-node, multi-container pattern

### Sidecar pattern

* Multiple containers extend and enhance the main container.
* For example, a web-server serves from the local disk (main container) while a side container updates the data.
* Benefits:
* independent development, deployment, and scaling of containers
* possibility of combining different type of containers
* failure containment boundary, i.e., one failing container, need not bring down the entire system.

### Ambassador pattern

* Proxy communication to and from the main container with the ambassador hiding the complexities of communication with a distributed (multi-shard system) that may be written in a different language.

### Adapter pattern

* Standardize output and interfaces across the containers to provide a simple, homogenized view to external applications.
* A common example is using a single tool for collecting/processing metrics from multiple applications.
* This is different from the adapter pattern, which aims to provide a simplified view of the external world to an application.

## Multi-node application patterns

### Leader election pattern

* In a sharded (or replication-based) system, the system may have to elect a leader (or multiple leaders) among the replicas (or shards).
* Instead of using a leader election library, a leader election container can be used (that communicates with other containers over, say, HTTP). This removes the restriction of using a leader election library compatible with the containers (e.g., using the same language).

### Work queue pattern

* A work coordinator container can queue different containers, each of which may have a different implementation or dependencies, thus removing the restriction that all the works use the same runtime.

### Scatter/gather pattern

* An external client sends a request to a root container.
* This container fans out the request to many containers that may perform the computation in parallel.
* The root container gathers these parallel computations' results and aggregates them into a response to the external client.
2 changes: 1 addition & 1 deletion site/_site
Submodule _site updated from 7fe877 to 61fbdf

0 comments on commit c3bb11f

Please sign in to comment.