Skip to content
This repository has been archived by the owner on Apr 16, 2020. It is now read-only.

Review Filestore for 300TB Challenge. Update Stories + Specs #85

Closed
flyingzumwalt opened this issue Jan 13, 2017 · 10 comments
Closed

Review Filestore for 300TB Challenge. Update Stories + Specs #85

flyingzumwalt opened this issue Jan 13, 2017 · 10 comments
Assignees

Comments

@flyingzumwalt
Copy link
Contributor

flyingzumwalt commented Jan 13, 2017

Review filestore -- declare what needs to be done to land it, at least for the 300TB Challenge, and call out any implications for ipfs-pack

Concise Filestore "spec" https://gist.github.com/whyrusleeping/5565651011b49ddc9ddeec9ffd169050

@kevina
Copy link

kevina commented Jan 16, 2017

The p.r. to review is ipfs/kubo#3368.

@flyingzumwalt
Copy link
Contributor Author

flyingzumwalt commented Jan 17, 2017

Relevant notes from Sprint Planning Call:

Conclusions from review:
The current implementation mixes porcelain UX concerns with the underlying iplementation/plumbing. This makes the interfaces confusing & complicated. It also makes the underlying plumbing more complicated and less robust than it should be.
Best approach: take the pieces of the code that we need and package it as an experimental feature with simple, straightforward interfaces.

@jbenet & @whyrusleeping need to sit down (probably today) and figure out how they want to proceed with this. @flyingzumwalt will try to capture that info in the filestore Stories & Epics Main things that need to be specified:

  • How to do the internals/plumbing
  • What the UX should look like

@flyingzumwalt flyingzumwalt changed the title Review Filestore for 300TB Challenge Review Filestore for 300TB Challenge. Update Stories + Specs Jan 17, 2017
@flyingzumwalt
Copy link
Contributor Author

@jbenet @whyrusleeping I'm leaving this open and "in progress" until we've updated the filestore Specs and Stories

@kevina
Copy link

kevina commented Jan 17, 2017 via email

@jbenet
Copy link
Contributor

jbenet commented Jan 18, 2017

@kevina context: ipfs/team-mgmt#309 (comment)

@flyingzumwalt
Copy link
Contributor Author

flyingzumwalt commented Jan 19, 2017

Reviewing Filestore for data.gov Sprint

Agenda:

  • review current Filestore and new, minimal spec
  • make list of possible constraints (features, UX considerations, systems problems)
  • make a clear implementation plan that we can run with

Background:

Notes

Worried about getting the UX Wrong. Want to be careful to get it right.

Goals for this sprint

  • Make just enough porcelain now to support the data.gov effort.
  • Test it really well

... so it's important to distinguish between porcelain and plumbing

Bad things that must never happen:

  • files going missing

Examples: Dropbox & bittorrent

the bittorrent one is basically just bittorrent, but the UX is aimed at people who don't want to make torrent files and run torrent trackers, etc

Dropbox is set up around UX of "track this directory"

  • haven't looked into how dropbox handles UX edge cases like a directory getting deleted remotely when someone adds a file to it

Things to consider

  • where will blocks be stored, how will they be stored, how might they be put into an s3 datastore
  • adding tracking of files that are already in the filestore
  • if we have ipfs-pack and can have an ipfs repo inside a pack, that creates exciting possibilities. how will that impact UX?

Existing flatfs code base

Lots of lessons learned from @kevina's efforts. We can probably re-use the protobuf code and some of the commands.

Constraints

Given we have a repo and are going to add objects to the repo that reference stuff elsewhere on the filesystem at specific memory locations with specific size

What happens when

  • file is not readable
  • it's not the right data
  • (consistency issues) somehow some of the objects have been GC'ed, so when you download that content based on the available objects, the file is incomplete/corrupt

Possibility: "Filestore" on both sides -- aka: ipfs get-as-filestore. pull down the blocks, "export" them to location on filesystem, and delete the block from the ipfs repo

  • implies an "export to filestore" function that must be streamable (ie. while downloading blocks, stream them out to filesystem)
  • in order to do this smoothly, you need the entire listing of blocks-to-download before starting the download.
  • runs a risk of breaking the abstraction layers -- forces bitswap to be aware of where the block is going
    • but there might be ways to work around this

Interesting Use case: I've got a laptop and lots of hard drives. I want to help back up data.gov. I attach a big drive, pull down data, routing it onto the attached drive, and then detatch the drive. I frequently close my laptop (ie. to put it in my backpack) Occasionally I reconnect the drive to share the data.

Check-in: Why are we doing this work?

Why are we doing this work?

  1. People with existing content want to add that content without duplicating it.
    • the most prominent cases for this involve immutable content Can we just assume the users want the blocks to be immutable? -- ditch use cases where users are allowed to modify stuff that's referenced by filestore
  2. We want people to be able to interact with files using posix tools & methodologies
  3. We want data/content to exist independently of devices
    • "filesystem location" is a metaphor

Conclusions

Short-term Assumption: filestore is for files that won't be changed casually

For this current sprint, we're dealing with data that will not be changed/moved. The files & directories that Jack registers using filestore will stay in-place. Renaming or moving those files would be a notable event where we can expect Jack to explicitly rebuild the ipfs-pack manifest (and therefore update the filestore lookup tables)

This allows us to set aside (for now) use cases where users will casually and frequently change or rename files. For those use cases, at least for now, we encourage people to use Fuse to mount IPFS content (which might have been added using filestore), and update/rename the content there.

How filestore will work

When you want "filestore" behavior, you generate an ipfs-pack with its own ipfs repo configured to locate its blocks using "relative paths" to the content you're referencing. The structure of a pack is similar to git repositories -- the .ipfs directory sits in the root of the pack, alongside the other files/directories at the root of the pack. By default the relative path can only reference content inside of the pack -- just as git doesn't let you add content outside the repo.

"descendants only" can be turned off but strongly discourage it because moving the ipfs-pack would break all the paths, etc...

Export to a pack

Exporting IPFS content to an ipfs-pack is an inevitable use case. ipget could be the tool (or starting point for tool) to do this -- give it an ipfs hash and it will build packs from the existing ipfs content/blocks

Stories

Register a filesystem Directory in IPFS without Duplicating Content

recorded as #92

Given:
I have a directory on my filesystem whose contents I want to serve through IPFS

Then:
I register the directory with IPFS, which turns it into an ipfs-pack. I will then be able to serve that content through IPFS by either running the ipfs-pack as a node or by relying on a local (intermediary) IPFS node.

How the Internals Should Work

When I tell IPFS to register a directory, it first turns the directory into an IPFS Pack. To do this, it

  • builds an ipfs-pack manifest for the directory
  • writes the ipfs-pack manifest in the root of the directory

It then registers the pack with your local ipfs node(s). To do this it

  • generates an .ipfs repo in the root of the pack directory
  • configures the repo to use filestore mode. meaning that it uses "relative paths" to resolve the content of its blocks.
  • "adds" the ipfs-pack manifest to the repository's filestore

The structure of a pack is similar to git repositories -- the .ipfs directory sits in the root of the pack, alongside the other files/directories at the root of the pack. By default the relative path can only reference content inside of the pack -- just as git doesn't let you add content outside the repo.

Serve Content Directly From a Pack to IPFS Network (no intermediary node)

recorded as #108

Given
I have an IPFS pack, which contains an IPFS repository, I want to serve that content directly to the network.

Then
I start an IPFS node in filestore mode that uses the ipfs-pack as its filestore. It serves the contents of the pack directly to the network.

Serve Content from Local Packs Through a Regular IPFS node

recorded as #109

Given
I have an OS with multiple packs on it, I want to run a single IPFS node and serve the content from all the packs through the that node.

Then
...

Update, Rename or Delete Contents of an IPFS Pack

recorded as #93

Given
I have modified some of the files & directories in an IPFS, I want to update the pack's manifest to reflect those changes.

Then
...

Generate a New IPFS Pack from Existing IPFS Content

recorded as #90

Given
I have the hash of a directory that was previously added to IPFS, I want to build an IPFS pack containing the files & directories corresponding to the hash.

Then
...

Note: ipget could be the tool (or starting point for a tool) to do this -- give it an ipfs hash and it will build packs from the existing ipfs content/blocks.

Selectively Add Files and Directories to the Pack

recorded as #127

Given I only want to add some of the files and/or subdirectories to the IPFS pack.

Then I follow these steps:

  • Initialize an empty ipfs-pack, preventing it from populating the ipfs-pack manifest (ie. --populate-manifest false)
  • Manually add each file or directory to the manifest

Selectively Ignore Files and Directories

recorded as #128

Given
There are files and/or sub-directories that I do not want added to the pack,

Then
I add the files to a .ipfsignore file and then build the pack manifest.

Use a Single Pack to Track Many Files Across an OS

recorded as #129

Given:

  • I have many files spread across my OS
  • I want one ipfs-pack with one manifest for all of my files

Option 1: put the ipfs-pack at the root of your filesystem and selectively add files (see Story: Selectively Add Files)

Option 2: Disable "internal paths only" mode on filestore. By default, filestore uses only "internal paths", meaning that it only allows you to reference files that are inside the IPFS pack's root directory. This is similar to git, whose repositories only allow you to add content that is inside the git repository's working directory.

"internal paths only" mode can be turned off but we strongly discourage it because moving the ipfs-pack would break all the paths, etc...

@kevina
Copy link

kevina commented Jan 19, 2017

@flyingzumwalt did you mean to keep this issue closed?

@flyingzumwalt
Copy link
Contributor Author

Sorry @kevina. That was a mistake.

@kevina
Copy link

kevina commented Jan 20, 2017

Overall I really like this idea. Depending of what @jbenet thinks of my basic filestore implementation (ipfs/kubo#3368) and what he had in mine for the layout of ipfs-pack (if anything) I might be able to implement this fairly quickly (like over the weekend).

My proposal will be to just make the filestore level DB be version 0 of the ipfs pack with some additional metadata to determine the root hash which will be a unixfs directory node. The index will be an an unixfs directory node that can be stored in the filestore DB directly (the protobuf format used by the filestore can already stores non-leaf nodes). I can understand, however if we want a simpler format (perhaps text based) that does not depend on implementation details.

I do not like the idea of non-internal paths. Rather I propose that if a user wants that we require absolute paths, far less brittle that way. For the purposes of this sprint I propose we just don't allow them.

In order to use multiple packs at the same time we are going to require some sort of multi-datastore. If there are too many packs attached to an ipfs node then there could be performance problems as, with out some sort of indexing, each pack will need to be searched in turn. A bloom filter for each pack can help, but not eliminate the problem.

Pinning and in particular the Garbage Collector is going to be a problem. I am going to assume that an ipfs pack is immutable therefore blocks should never be deleted (and it may even be impossible to delete in the ipfs pack in on an read only filesystem) by the GC. One way to solve this would be to use a recursive pin on the root unixfs directory node. However, this will cause the garbage collector to do a huge amount of unnecessary work and use up a lot of unnecessary memory. Basically it will first walk the root node and load the hashes of all the blocks in the pack into memory. It will then iterate over all the hashes of the block in the pack only to discover that none of the hashes in the pack can be collected. A better way would be to have the GC ignore what in a pack altogether.

I tried to propose a solution to the multi-datastore problem in ipfs/kubo#3119 and ipfs/kubo#3257. Due to lack of time by the core team member, the proposal in #3119 never received any serious consideration. When the implementation was reviewed by @whyrusleeping he (mostly in a private IRC conversation) rejected my idea of considering all filestore objects implicitly pinned due to the complexities it added to the interface. We might be able to punt on the GC and pinning issues for this sprint, but I hope that we can revisit the issue sometime soon.

@flyingzumwalt
Copy link
Contributor Author

The code has been reviewed. Work is proceeding with issues like #92

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants