Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: IPFS Content Providing #31

Merged
merged 4 commits into from
Jun 23, 2021
Merged

Conversation

aschmahmann
Copy link
Contributor

Taking a stab at a content routing proposal. cc @Stebalien @petar for some thoughts.

My take on the high level content providing issues is that ResourcesPerProvide*NumberOfProvides*ProvideFrequency is too high. Decreasing any of these is valuable and this issue focuses primarily on decreasing the number of resources required per provide and enabling our existing work on decreasing the number of things to provide (e.g. the roots Reprovider strategy).

I'm open to discussion on putting focus on other parts of the equation though.

Comment on lines +92 to +97
- Make IPFS public DHT `put`s take <3 seconds (i.e. come close to `get` performance)
- Some techniques available include:
- Decreasing DHT message timeouts to more reasonable levels
- [Not requiring](https://github.com/libp2p/go-libp2p-kad-dht/issues/532) the "followup" phase for puts
- Not requiring responses from all 20 peers before returning to the user
- Not requiring responses from the 3 closest peers before aborting the query (e.g. perhaps 5 of the closest 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having this framed as "do these things" rather than "get to these goals" will make this easier to scope / make it feel more concrete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to just Make IPFS public DHT puts take <3 seconds or more of this section? The "take <3 seconds" part is mostly because we don't have to do all of them if we hit our target with just a few of the optimizations. I listed them in order from what seems easiest to what seems hardest.

I can be more precise in this section, although I don't want to overly prescribe how this could be implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. the 'puts take <3 seconds' seems like a 'how do we know we're done', rather than a 'plan for work'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news with some lessons learned from libp2p/go-libp2p-kad-dht#709 it turns out that we have a prototype that seems to do the job and already hits under 3s.

The big wins were:

  • Having large routing tables we intermittently refresh means lookups take 0 network hops
  • By changing the number of peers we wait on from 20 to a more flexible function, like wait for 30% of 20 responses then wait for a few hundred ms of no new responses, we dealt with the long tail slowness issues

- The work is useful even though a more comprehensive solution will eventually be put forward, meaning either:
- Users are not willing to wait, or ecosystem growth is throttled, until we build a more comprehensive content routing solution
- The changes made here are either useful independent of major content routing changes, or the changes are able to inform or build towards a more comprehensive routing solution

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these projects are also useful for some byproducts they will have (worth counting):

  • They will probably entail designing/implementing extensible provider records (needed for payment systems, etc.)
  • They will probably entail upgrading the blockstore to a ref-counted, timestamped partial dag store; which is integral going forward for (i) any content routing caching algorithm, (ii) garbage collection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be nice, but I'm shrinking the scope here so we don't necessarily have to tackle these together

Probably the most visible primitive in the web3 dev stack is content addressing which allows someone to retrieve data via its CID no matter who has it. However, while content addressing allows a user to retrieve data from **anyone** it is still critical that there are systems in place that allow a user to find **someone** who has the data (i.e. content routing).

Executing well here would make it easier for users to utilize the IPFS public DHT, the mostly widely visible content routing solution in the IPFS space. This would dramatically improve usability and the onboarding experience for both new users and the experience of existing users, likely leading to ecosystem growth.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would presumably also meet a specific ask from Pinata.

-->

Many of the components of this proposal increase development velocity by either exposing more precise tooling for debugging or working with users, or by directly enabling future work.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These project will also likely further decouple content routing (and the complex caching algorithms it utilizes) from specific applications like bitswap and graphsync.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus enabling higher app developer velocity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be true, but isn't necessarily the case in the MVP here.

proposals/ipfs-content-providing.md Outdated Show resolved Hide resolved
_How would a developer or user use this new capability?_
<!--(short paragraph)-->

Users who use go-ipfs would be able to tell what percentage of their provider records have made it out to the network in a given interval and would notice more of their content being discoverable via the IPFS public DHT. Additionally, users would have a number of configurable options available to them to both modify the throughput of their provider record advertisements and to advertise fewer provider records (e.g. only advertising pin roots)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember discussing this one time. Would be a huge improvement for most of real-world uses (package manages, wikipedia snapshots)

Suggested change
Users who use go-ipfs would be able to tell what percentage of their provider records have made it out to the network in a given interval and would notice more of their content being discoverable via the IPFS public DHT. Additionally, users would have a number of configurable options available to them to both modify the throughput of their provider record advertisements and to advertise fewer provider records (e.g. only advertising pin roots)
Users who use go-ipfs would be able to tell what percentage of their provider records have made it out to the network in a given interval and would notice more of their content being discoverable via the IPFS public DHT. Additionally, users would have a number of configurable options available to them to both modify the throughput of their provider record advertisements and to advertise fewer provider records (e.g. only advertising pin roots, or only the root of each file is unixfs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add this in too, but it might be out of scope for this project. It's an extra feature which, while valuable, might not be as high value as the other ones here.

proposals/ipfs-content-providing.md Show resolved Hide resolved
These alternatives are not exclusive with the proposal

1. Focus on decreasing the number of provider records
- e.g. Add more options for data reproviding such as for UnixFS files only advertising Files and Directories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 we should add this as a new Reprovider.Strategy (thinking.. pinned+files-roots)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that would be nice. Maybe only announce a file if the node has the whole file in cache?

Maybe worth a discussion if that should be the default for example for browser integrations (like brave) and ipfs-desktop. If someone just wants to share some files, I don't see a reason to announce all chunks. Hunting for nodes which have just some single blocks of a file because of deduplication is probably not worth the effort of connecting to them.

Copy link
Contributor

@BigLep BigLep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together - good stuff!

proposals/ipfs-content-providing.md Show resolved Hide resolved
proposals/ipfs-content-providing.md Show resolved Hide resolved
_How sure are we that this impact would be realized? Label from [this scale](https://medium.com/@nimay/inside-product-introduction-to-feature-priority-using-ice-impact-confidence-ease-and-gist-5180434e5b15)_.

<!--Explain why this rating-->
2 . We don't have direct market research demonstrating improving the resiliency of content routing will definitely lead to more people choosing IPFS or to work with the stack. However, this is a pain point for many of our users (as noted on the IPFS Matrix, Discuss and GitHub) and something we have encountered as an issue experienced by various major ecosystem members (Protocol Labs infra, Pinata, Infura, etc.).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have more data on:

  1. How this pain point has impacted them (e.g., has it prevented certain use cases)?
  2. How have they worked around it?
  3. What kind of performance they're expecting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's been a problem for some use cases like package management (e.g. ipfs and pacman  ipfs/notes#84, IPFS and Gentoo Portage (distfiles) ipfs/notes#296), and pinning services have had difficulty as well.
  2. Applications can sort of get around this by advertising application names (e.g. myApp ) instead of data CIDs. However, this falls apart as the number of application users gets larger. For certain use cases ipfs-cluster could come in handy as well. Pinning services have a few different approaches that are basically 1) build a custom reprovider that tries to be a bit faster (although mostly by throwing more resources + parallelism at the problem and not tweaking the underlying DHT client usage) 2) have really high connection limits so they're connected to tons of people, and permanently connect to major gateways.
  3. I'm not sure, but mostly they just want data added to go-ipfs to just be made available for downloading without worrying about it and without it being crazy expensive to run

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It's been a problem for some use cases like package management (e.g. [ipfs and pacman  ipfs and pacman  ipfs/notes#84]

If you have any questions on this, @BigLep feel free to ask :)

2 . We don't have direct market research demonstrating improving the resiliency of content routing will definitely lead to more people choosing IPFS or to work with the stack. However, this is a pain point for many of our users (as noted on the IPFS Matrix, Discuss and GitHub) and something we have encountered as an issue experienced by various major ecosystem members (Protocol Labs infra, Pinata, Infura, etc.).

## Project definition
#### Brief plan of attack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any new test scenarios that we'd need to develop? For example, as part of CI, should we have a test that asserts X advertisements can be made within Y seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to do in CI especially if those tests are publicly viewable. However, it wouldn't be so bad to just check in on our metrics since they report performance on go-ipfs master + the latest release and it already metrics it already has on provide speed. However, if we want to test some of the massive providing strategies (e.g. huge routing tables + many provides) we'll likely need some more testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I don't know the landscape to have more input. A couple of more thoughts:

  1. If there is fear of regression here, then having a test that can catch that seems reasonable.
  2. If we are going to advertise that customers with massive providing strategies will see improved performance, I think we'll want to verify this in some way and should include that in the work plan.

#### What does done look like?
_What specific deliverables should completed to consider this project done?_

The project is done when users can see how much of their provide queue is complete, are able to allocate resources to increase their provide throughput until satisfied, and allocating resources is either not prohibitively expensive, or it is deemed too much work to decrease the resource allocation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumbs up for "continuous transparency": seeing the state of providing at all times.

_Why might this project be lower impact than expected? How could this project fail to complete, or fail to be successful?_

- People have other issues that the DHT put performance is just masking, which means we will not immediately be able to see the impact from this project alone
- Users will not want to spend the raw bandwidth of emitting their records even if lookups are instant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: Do any customers complain about bandwidth today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I've heard of (although @Stebalien might have more info), but providing is pretty heavily limited so the DHT provide bandwidth is unlikely to be a problem today.

The question is around what happens next, i.e. once putting data in the DHT is fast there will still be users who aren't really able to use it.


Some back of the envelope math here is:

A user with 100M provider records where each record is 100 bytes (this is a large overestimate, it's more like 40, but we may want to add some more data to the records) who puts each record to 20 nodes every 24hrs uses 200GiB/day of upload bandwidth. AWS egress prices are around $0.09/GB, so around $20/month.

Again this is an overestimate and might be dwarfed by the egress costs of serving the actual data or other associated costs, but it's not 0.

https://archive.org/ has 538B webpages. If every one of those webpages (the vast majority of which I assume are not normally accessed) was to be individually addressed and advertised in the DHT daily it would be quite expensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation and back-of-envelope math; makes sense. Given this info, I'm assuming most (something like 99%?) of customers won't care. I assume huge dataset customers have other special requirements/needs/setup that we'll have other work to make their journey delightful anyways. Given the desire to make IPFS an exceptional tool for developers, the bandwidth increase seems acceptable to take given the benefit.

2 . We don't have direct market research demonstrating improving the resiliency of content routing will definitely lead to more people choosing IPFS or to work with the stack. However, this is a pain point for many of our users (as noted on the IPFS Matrix, Discuss and GitHub) and something we have encountered as an issue experienced by various major ecosystem members (Protocol Labs infra, Pinata, Infura, etc.).

## Project definition
#### Brief plan of attack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I don't know the landscape to have more input. A couple of more thoughts:

  1. If there is fear of regression here, then having a test that can catch that seems reasonable.
  2. If we are going to advertise that customers with massive providing strategies will see improved performance, I think we'll want to verify this in some way and should include that in the work plan.

@BigLep BigLep added the Steward Priority Stewards priority project due to enabling us to move faster and/or safer. label Apr 5, 2021
@aschmahmann aschmahmann marked this pull request as ready for review April 5, 2021 19:20
- Not requiring responses from all 20 peers before returning to the user
- Not requiring responses from the 3 closest peers before aborting the query (e.g. perhaps 5 of the closest 10)
- Add a function to the DHT for batch providing (and putting) and utilize it in go-ipfs
- Tests with https://github.com/libp2p/go-libp2p-kad-dht/pull/709 showed tremendous speedups even in a single threaded provide loop if the provider records were sorted in XOR space
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a very small number of failures we were able to reach around 3 puts per second for 1k puts and 20 provides per second for 60k puts. The provides per second should increase the more we do at a time. This is as opposed to 1 provide per 30 seconds.

- Enable downloading sub-DAGs when a user already has the root node, but is only advertising the root node
- e.g. have Bitswap sessions know about the graph structure and walk up the graph to find providers when low on peers
- Add a new command to `go-ipfs` (e.g. `ipfs provide`) that at minimum allows users to see how many of their total provider records have been published (or failed) in the last 24 hours)
- Add an option to go-libp2p-kad-dht for very large routing tables that are stored on disk and are periodically updated by scanning the network
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BigLep
Copy link
Contributor

BigLep commented Jun 1, 2021

The functionality here is happening as an experimental feature in go-ipfs 0.9 (see ipfs/kubo#8058 )

@aschmahmann
Copy link
Contributor Author

aschmahmann commented Jun 1, 2021

@BigLep most of it is, however "Enable downloading sub-DAGs when a user already has the root node, but is only advertising the root node" is not done yet.

If you wanted to we could reasonably close this issue and open a new one aimed at decreasing the number of provider records that need to be advertised in the system.

@jacobheun jacobheun merged commit b712ea4 into main Jun 23, 2021
@jacobheun jacobheun deleted the proposal/ipfs-content-providing branch June 23, 2021 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Steward Priority Stewards priority project due to enabling us to move faster and/or safer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants