Skip to content
This repository has been archived by the owner on Nov 18, 2024. It is now read-only.

rfc: add a shared storage #247

Closed
wants to merge 5 commits into from
Closed

rfc: add a shared storage #247

wants to merge 5 commits into from

Conversation

huachaohuang
Copy link
Owner

@huachaohuang huachaohuang commented Jan 4, 2022

@tisonkun
Copy link
Contributor

tisonkun commented Jan 4, 2022

Shall this PR supersedes #246 and thus we can close #246 preferring this one?

@huachaohuang
Copy link
Owner Author

Shall this PR supersedes #246 and thus we can close #246 preferring this one?

#246 describes an abstraction and this PR describes a more concrete design. But I have closed #246 anyway.

Copy link
Contributor

@zojw zojw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to taking into account how to implement and have some detail questions :)


### Write path

The write an object, a client contacts the master to get a list of locations to store the object. The client must ensure that the object has been persisted in the base storage before claiming a successful write. The client can further ensure that the object has been cached to avoid reading from the base storage later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we have 100 cache nodes, it should useful only write 2 or 3 or ... of 100, but do we need that?

but 2 or 3 or ... is suitable be configured in object level or bucket level or cluster level? if so we may need "some place" to store it? and it seems little complex to keep N/M along with node's down and up...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have some global options like min_cache_replicas and max_cache_replicas, and allow the storage to decide the appropriate number of replicas for individual objects for load balance.

docs/rfcs/2022-01-04-cached-storage.md Outdated Show resolved Hide resolved
docs/rfcs/2022-01-04-cached-storage.md Outdated Show resolved Hide resolved
docs/rfcs/2022-01-04-cached-storage.md Outdated Show resolved Hide resolved

### Write path

The write an object, a client contacts the master to get a list of locations to store the object. The client must ensure that the object has been persisted in the base storage before claiming a successful write. The client can further ensure that the object has been cached to avoid reading from the base storage later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write an object, a client contacts the master to get a list of locations to store the object.

How about reusing the existing mechanism in reading path , instead of a writer contacts to master to get locations to store the object?

notifies the master to cache the object for future reads.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "existing mechanism in reading path"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huachaohuang

If the object is not cached, the client reads from the base storage and then notifies the master to cache the object for future reads.

@huachaohuang
Copy link
Owner Author

I will start prototyping to get more information about the design to further improve the document.

@huachaohuang huachaohuang marked this pull request as draft January 5, 2022 11:07
@zojw zojw mentioned this pull request Jan 7, 2022
8 tasks
@huachaohuang huachaohuang mentioned this pull request Jan 8, 2022
Comment on lines 3 to 4
- Status: draft
- Pull Request:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Status: draft
- Pull Request:
- Status: accepted
- Pull Request: https://github.com/engula/engula/pull/247
- Tracking Issue: https://github.com/engula/engula/issues/263

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huachaohuang huachaohuang marked this pull request as ready for review January 15, 2022 03:33
@huachaohuang
Copy link
Owner Author

This RFC is ready. Although some implementation details are missing, it should be good enough in this early stage to get the work in #263 started.

@huachaohuang huachaohuang changed the title rfc: add a cached storage rfc: add a shared storage Jan 15, 2022

### Implementation

A base storage can be built on a cheap and highly reliable cloud object storage(e.g., AWS S3). A cache storage can be a custom-built storage service that stores data on local SSD or cloud block storage (e.g., AWS EBS). An orchestrator can be built on Kubernetes, which acts as an operator. Kubernetes provides most of the features we need from the orchestrator.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the base storage is S3, at what level do you want to cache? (e.g., SST level, multi-part upload level, block level)

From public benchmarks like https://github.com/dvassallo/s3-benchmark, S3 services' latency is relatively high, and will be affected by multiple factors, like how the file is uploaded to s3, the range of get, whether EC2 accesses S3 using VPC, and the configuration of EC2 itself, etc. From my perspective, apart from cache service design, the cache algorithm and cache content is also an interesting part to investigate.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your information. I'm only considering TP scenarios for now. In this case, I think the basic strategy is to cache all data since we don't want to see even a single read from S3. In the future, we may leave some cool SST in S3 according to the access statistics of individual SST to further save costs. But in the current stage, a simple full cache is good enough for the luna engine.

Copy link
Contributor

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments inline for understanding.


![Architecture](images/shared-storage-architecture.drawio.svg)

`SharedStorage` consists of a master, a base storage, a set of cache storages, and a storage orchestrator. The base storage is the single point of truth and should offer reliable object storage. The cache storages cache objects from the base storage to improve read performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define "master", "base storage", "orchestrator" before you talk about them. They're all new concepts without definition.

For example, a master is a coordinator of all cache storage instances that lives along with the Kernel. A base storage is...what? A storage server or S3 cluster?

You may try to connect this architecture with the overall design so that we know what part it is of Engula, instead of an isolated design.


### Read path

To read an object, a client contacts the master to get a list of locations that serve the object. If the object is not cached, the client reads from the base storage and then notifies the master to cache the object for future reads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pseudo code or ordered list could be better.


### Write path

To write an object, a client contacts the master to get a list of locations to store the object. The client must ensure that the object persisted in the base storage before claiming a successful write. The client can further ensure that the object has been cached to avoid reading from the base storage later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pseudo code or ordered list could be better.

@tisonkun
Copy link
Contributor

@huachaohuang according to current state, I tend to regard this RFC and #280 as an evolving design & implementation documents instead of a proposal required to be accepted before we start implementing.

In this way, we can continuously prototyping while keep polishing such document to stabilize our ideas, instead of forcing to merge something indeterminate or unclear.

@huachaohuang
Copy link
Owner Author

@huachaohuang according to current state, I tend to regard this RFC and #280 as an evolving design & implementation documents instead of a proposal required to be accepted before we start implementing.

In this way, we can continuously prototyping while keep polishing such document to stabilize our ideas, instead of forcing to merge something indeterminate or unclear.

Sounds good to me. My original intention about this document is just to align the design and guide the development, instead of figuring out all the details first. I think we can leave these PRs as drafts and evolve them with the implementation until they are stable enough.

@huachaohuang huachaohuang marked this pull request as draft January 18, 2022 03:21
@huachaohuang
Copy link
Owner Author

Closed in favor of #361.

@huachaohuang huachaohuang deleted the cached-storage-rfc branch February 8, 2022 09:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants