-
Notifications
You must be signed in to change notification settings - Fork 78
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
feea037
commit c3bb11f
Showing
4 changed files
with
148 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
87 changes: 87 additions & 0 deletions
87
site/_posts/2020-12-14-Cassandra - a decentralized structured storage system.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
--- | ||
layout: post | ||
title: Cassandra - a decentralized structured storage system | ||
comments: True | ||
excerpt: | ||
tags: ['2010', 'Big Data', 'Distributed Systems', 'Software Engineering', Apache, CAP, Data, Database, DBMS, Engineering, Scale, Software, Systems] | ||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* Cassandra is a distributed storage system that runs over cheap commodity servers and handles high write throughput while maintaining low latency for read operations. | ||
* At the time of writing, it was used to support the search for Facebook Inbox. | ||
* [Link to the paper](https://dl.acm.org/doi/10.1145/1773912.1773922) | ||
* [Link to the implementation](https://cassandra.apache.org/) | ||
|
||
## Data Model | ||
|
||
* A table is a distributed multidimensional map. | ||
* The key is a string (generally 16-36 bytes long), while the value is a structured object. | ||
* Every operation under a single row key is atomic per replica. | ||
* Columns are grouped together into sets called column families. | ||
* There are two types of columns families: | ||
* Simple families. | ||
* Super column families: visualized as a column family within a column family. | ||
* Columns can be sorted by name or time (used to display results in time sorted order). | ||
* The API supports insert, get and delete operations. | ||
|
||
## System Architecture | ||
|
||
### Handling Requests | ||
|
||
* Any read/write request gets routed to any node in the cluster. The node determines the replicas for a given key and routes the request. | ||
* For write query, the system waits for a quorum of replicas to acknowledge the writes' completion. | ||
* For read query, the system either routes the requests to the closest replica (might fetch stale results) or routes the requests to all replicas and waits for a quorum of responses. | ||
|
||
### Partitioning | ||
|
||
* Cassandra partitions data across the cluster using consistent hashing with an order-preserving hash function. | ||
* The hash function's output range is treated as a fixed circular ring, and each node is assigned a random position on the ring. | ||
* An incoming request specifies a key used to route requests. | ||
* One benefit of this approach is that the addition/removal of a node only affects its immediate neighbors. | ||
* However, randomly assigning nodes leads to non-uniform data and load distribution. | ||
* Cassandra uses the load information and moves lightly loaded nodes to reduce the load on other nodes. | ||
|
||
### Replication | ||
|
||
* Each data item is replicated at N hosts, where N is the per-instance replication factor. | ||
* Cassandra supports the following replication policies: Rack Unaware, Rack Aware (within a datacenter), and Datacenter Aware. | ||
* For "Rack Aware" and "Datacenter Aware" strategies, Zookeeper elects a leader among the nodes and holds metadata about which range a node is responsible for. | ||
* In case of node failure and network partitions, the quorum requirements are relaxed. | ||
|
||
### Membership | ||
|
||
* Cluster membership is based on Scuttlebutt, a very efficient anti-entropy Gossip based mechanism. | ||
* Cassandra uses a modified version of $\phi$ Accrual Failure Detector for detecting failures, which provides the suspicion level (of failure) for each node. | ||
|
||
### Bootstrapping | ||
|
||
* A node, starting for the first time, chooses a random position in the ring. | ||
* This information is persisted on the local disk, on Zookeeper, and gossiped around the cluster (so any node can route any query in the cluster). | ||
* During bootstrapping, the newly joined node reads a list of contact points (within the cluster) using a configuration file. | ||
|
||
### Local Persistence | ||
|
||
* Generally, a write operation involves a write into a commit log (for durability and recoverability), followed by a write into the in-memory data structures. | ||
* A read operation starts with querying the in-memory data and then looks into the filesystem. | ||
* Read queries on the filesystem use bloom filters. | ||
* Column indices are maintained to make it faster to look up relevant columns. | ||
|
||
## Implementation Details | ||
|
||
* Components implemented in Java. | ||
* System control messages use UDP while messages for replication and request routing uses TCP. | ||
* A new commit log is rolled out after the older one exceeds 128MB of size. | ||
* All the data is indexed using a primary key. | ||
* Data on the disk is chunked into sequences of blocks. Each block contains at most 128 keys and is demarcated by a block index. | ||
* When the data is written to the disk, a block index is generated and maintained in the memory for faster access. | ||
* A compaction process is performed to merge multiple files (on disk) into one file. | ||
|
||
## Practical Experience | ||
|
||
* Data from MySQL servers is added to Cassandra using MapReduce processes. | ||
* Although Cassandra is a completely decentralized system, adding some coordination (via Zookeeper) is helpful. | ||
* For Inbox Search, a per-user index is maintained for all the messages. | ||
* For "term search", the key is the userid, and the words in the message become the super column. | ||
* For searching all the messages ever sent/received by a user, the key is the userid, and the recipient ids are the super columns. |
58 changes: 58 additions & 0 deletions
58
site/_posts/2020-12-21-Design patterns for container-based distributed systems.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
--- | ||
layout: post | ||
title: Design patterns for container-based distributed systems | ||
comments: True | ||
excerpt: | ||
tags: ['2016', 'Design Pattern', 'Distributed Systems', 'Software Engineering', Container, Engineering, Scale, Systems, USENIX] | ||
|
||
--- | ||
|
||
## Introduction | ||
|
||
* The paper describes three design patterns for container-based distributed systems: single-container pattern, single-node pattern, and multi-node pattern. | ||
* [Link to the paper](https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/burns) | ||
|
||
## Single-container management patterns | ||
|
||
* Traditionally, containers have exposed three functions: run, pause and stop. | ||
* A richer API can be implemented to provide fine-grained control to system developers and operators. | ||
* Similarly, much more application information (including monitoring metrics) can be exposed. | ||
* The container interface can be used to define a contract for a complex lifecycle. For example, instead of arbitrarily shutting down the container, the system could signal that it will be terminated, giving the container some time to perform some cleanup/follow-up actions. | ||
|
||
## Single-node, multi-container pattern | ||
|
||
### Sidecar pattern | ||
|
||
* Multiple containers extend and enhance the main container. | ||
* For example, a web-server serves from the local disk (main container) while a side container updates the data. | ||
* Benefits: | ||
* independent development, deployment, and scaling of containers | ||
* possibility of combining different type of containers | ||
* failure containment boundary, i.e., one failing container, need not bring down the entire system. | ||
|
||
### Ambassador pattern | ||
|
||
* Proxy communication to and from the main container with the ambassador hiding the complexities of communication with a distributed (multi-shard system) that may be written in a different language. | ||
|
||
### Adapter pattern | ||
|
||
* Standardize output and interfaces across the containers to provide a simple, homogenized view to external applications. | ||
* A common example is using a single tool for collecting/processing metrics from multiple applications. | ||
* This is different from the adapter pattern, which aims to provide a simplified view of the external world to an application. | ||
|
||
## Multi-node application patterns | ||
|
||
### Leader election pattern | ||
|
||
* In a sharded (or replication-based) system, the system may have to elect a leader (or multiple leaders) among the replicas (or shards). | ||
* Instead of using a leader election library, a leader election container can be used (that communicates with other containers over, say, HTTP). This removes the restriction of using a leader election library compatible with the containers (e.g., using the same language). | ||
|
||
### Work queue pattern | ||
|
||
* A work coordinator container can queue different containers, each of which may have a different implementation or dependencies, thus removing the restriction that all the works use the same runtime. | ||
|
||
### Scatter/gather pattern | ||
|
||
* An external client sends a request to a root container. | ||
* This container fans out the request to many containers that may perform the computation in parallel. | ||
* The root container gathers these parallel computations' results and aggregates them into a response to the external client. |
Submodule _site
updated
from 7fe877 to 61fbdf