Review Filestore for 300TB Challenge. Update Stories + Specs #85

flyingzumwalt · 2017-01-13T18:05:32Z

Review filestore -- declare what needs to be done to land it, at least for the 300TB Challenge, and call out any implications for ipfs-pack

Concise Filestore "spec" https://gist.github.com/whyrusleeping/5565651011b49ddc9ddeec9ffd169050

kevina · 2017-01-16T17:37:16Z

The p.r. to review is ipfs/kubo#3368.

flyingzumwalt · 2017-01-17T20:25:39Z

Relevant notes from Sprint Planning Call:

Conclusions from review:
The current implementation mixes porcelain UX concerns with the underlying iplementation/plumbing. This makes the interfaces confusing & complicated. It also makes the underlying plumbing more complicated and less robust than it should be.
Best approach: take the pieces of the code that we need and package it as an experimental feature with simple, straightforward interfaces.

@jbenet & @whyrusleeping need to sit down (probably today) and figure out how they want to proceed with this. @flyingzumwalt will try to capture that info in the filestore Stories & Epics Main things that need to be specified:

How to do the internals/plumbing
What the UX should look like

flyingzumwalt · 2017-01-17T20:27:10Z

@jbenet @whyrusleeping I'm leaving this open and "in progress" until we've updated the filestore Specs and Stories

kevina · 2017-01-17T21:11:41Z

I don't think my filestore fully understood and would like to be involved in these discussions. I would like to know where the mixing is. I did spend several months on this after all. I also have never talked in person with @jbenet on this.

…

On January 17, 2017 3:25:40 PM EST, Matt Zumwalt ***@***.***> wrote: ## Relevant notes from Sprint Planning Call: Conclusions from review: The current implementation mixes porcelain UX concerns with the underlying iplementation/plumbing. This makes the interfaces confusing & complicated. It also makes the underlying plumbing more complicated and less robust than it should be. Best approach: take the pieces of the code that we need and package it as an *experimental* feature with simple, straightforward interfaces. @jbenet & @whyrusleeping need to sit down (probably today) and figure out how they want to proceed with this. @flyingzumwalt will try to capture that info in the [filestore Stories & Epics](https://github.com/ipfs/archives/issues?q=is%3Aopen+label%3Afilestore) Main things that need to be specified: * How to do the internals/plumbing * What the UX should look like -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #85 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

jbenet · 2017-01-18T22:59:13Z

@kevina context: ipfs/team-mgmt#309 (comment)

flyingzumwalt · 2017-01-19T20:35:53Z

Reviewing Filestore for data.gov Sprint

Agenda:

review current Filestore and new, minimal spec
make list of possible constraints (features, UX considerations, systems problems)
make a clear implementation plan that we can run with

Background:

Notes

Worried about getting the UX Wrong. Want to be careful to get it right.

Goals for this sprint

Make just enough porcelain now to support the data.gov effort.
Test it really well

... so it's important to distinguish between porcelain and plumbing

Bad things that must never happen:

files going missing

Examples: Dropbox & bittorrent

the bittorrent one is basically just bittorrent, but the UX is aimed at people who don't want to make torrent files and run torrent trackers, etc

Dropbox is set up around UX of "track this directory"

haven't looked into how dropbox handles UX edge cases like a directory getting deleted remotely when someone adds a file to it

Things to consider

where will blocks be stored, how will they be stored, how might they be put into an s3 datastore
adding tracking of files that are already in the filestore
if we have ipfs-pack and can have an ipfs repo inside a pack, that creates exciting possibilities. how will that impact UX?

Existing flatfs code base

Lots of lessons learned from @kevina's efforts. We can probably re-use the protobuf code and some of the commands.

Constraints

Given we have a repo and are going to add objects to the repo that reference stuff elsewhere on the filesystem at specific memory locations with specific size

What happens when

file is not readable
it's not the right data
(consistency issues) somehow some of the objects have been GC'ed, so when you download that content based on the available objects, the file is incomplete/corrupt

Possibility: "Filestore" on both sides -- aka: ipfs get-as-filestore. pull down the blocks, "export" them to location on filesystem, and delete the block from the ipfs repo

implies an "export to filestore" function that must be streamable (ie. while downloading blocks, stream them out to filesystem)
in order to do this smoothly, you need the entire listing of blocks-to-download before starting the download.
runs a risk of breaking the abstraction layers -- forces bitswap to be aware of where the block is going
- but there might be ways to work around this

Interesting Use case: I've got a laptop and lots of hard drives. I want to help back up data.gov. I attach a big drive, pull down data, routing it onto the attached drive, and then detatch the drive. I frequently close my laptop (ie. to put it in my backpack) Occasionally I reconnect the drive to share the data.

Check-in: Why are we doing this work?

Why are we doing this work?

People with existing content want to add that content without duplicating it.
- the most prominent cases for this involve immutable content Can we just assume the users want the blocks to be immutable? -- ditch use cases where users are allowed to modify stuff that's referenced by filestore
We want people to be able to interact with files using posix tools & methodologies
We want data/content to exist independently of devices
- "filesystem location" is a metaphor

Conclusions

Short-term Assumption: filestore is for files that won't be changed casually

For this current sprint, we're dealing with data that will not be changed/moved. The files & directories that Jack registers using filestore will stay in-place. Renaming or moving those files would be a notable event where we can expect Jack to explicitly rebuild the ipfs-pack manifest (and therefore update the filestore lookup tables)

This allows us to set aside (for now) use cases where users will casually and frequently change or rename files. For those use cases, at least for now, we encourage people to use Fuse to mount IPFS content (which might have been added using filestore), and update/rename the content there.

How filestore will work

When you want "filestore" behavior, you generate an ipfs-pack with its own ipfs repo configured to locate its blocks using "relative paths" to the content you're referencing. The structure of a pack is similar to git repositories -- the .ipfs directory sits in the root of the pack, alongside the other files/directories at the root of the pack. By default the relative path can only reference content inside of the pack -- just as git doesn't let you add content outside the repo.

"descendants only" can be turned off but strongly discourage it because moving the ipfs-pack would break all the paths, etc...

Export to a pack

Exporting IPFS content to an ipfs-pack is an inevitable use case. ipget could be the tool (or starting point for tool) to do this -- give it an ipfs hash and it will build packs from the existing ipfs content/blocks

Stories

Register a filesystem Directory in IPFS without Duplicating Content

recorded as #92

Given:
I have a directory on my filesystem whose contents I want to serve through IPFS

Then:
I register the directory with IPFS, which turns it into an ipfs-pack. I will then be able to serve that content through IPFS by either running the ipfs-pack as a node or by relying on a local (intermediary) IPFS node.

How the Internals Should Work

When I tell IPFS to register a directory, it first turns the directory into an IPFS Pack. To do this, it

builds an ipfs-pack manifest for the directory
writes the ipfs-pack manifest in the root of the directory

It then registers the pack with your local ipfs node(s). To do this it

generates an .ipfs repo in the root of the pack directory
configures the repo to use filestore mode. meaning that it uses "relative paths" to resolve the content of its blocks.
"adds" the ipfs-pack manifest to the repository's filestore

The structure of a pack is similar to git repositories -- the .ipfs directory sits in the root of the pack, alongside the other files/directories at the root of the pack. By default the relative path can only reference content inside of the pack -- just as git doesn't let you add content outside the repo.

Serve Content Directly From a Pack to IPFS Network (no intermediary node)

recorded as #108

Given
I have an IPFS pack, which contains an IPFS repository, I want to serve that content directly to the network.

Then
I start an IPFS node in filestore mode that uses the ipfs-pack as its filestore. It serves the contents of the pack directly to the network.

Serve Content from Local Packs Through a Regular IPFS node

recorded as #109

Given
I have an OS with multiple packs on it, I want to run a single IPFS node and serve the content from all the packs through the that node.

Then
...

Update, Rename or Delete Contents of an IPFS Pack

recorded as #93

Given
I have modified some of the files & directories in an IPFS, I want to update the pack's manifest to reflect those changes.

Then
...

Generate a New IPFS Pack from Existing IPFS Content

recorded as #90

Given
I have the hash of a directory that was previously added to IPFS, I want to build an IPFS pack containing the files & directories corresponding to the hash.

Then
...

Note: ipget could be the tool (or starting point for a tool) to do this -- give it an ipfs hash and it will build packs from the existing ipfs content/blocks.

Selectively Add Files and Directories to the Pack

recorded as #127

Given I only want to add some of the files and/or subdirectories to the IPFS pack.

Then I follow these steps:

Initialize an empty ipfs-pack, preventing it from populating the ipfs-pack manifest (ie. --populate-manifest false)
Manually add each file or directory to the manifest

Selectively Ignore Files and Directories

recorded as #128

Given
There are files and/or sub-directories that I do not want added to the pack,

Then
I add the files to a .ipfsignore file and then build the pack manifest.

Use a Single Pack to Track Many Files Across an OS

recorded as #129

Given:

I have many files spread across my OS
I want one ipfs-pack with one manifest for all of my files

Option 1: put the ipfs-pack at the root of your filesystem and selectively add files (see Story: Selectively Add Files)

Option 2: Disable "internal paths only" mode on filestore. By default, filestore uses only "internal paths", meaning that it only allows you to reference files that are inside the IPFS pack's root directory. This is similar to git, whose repositories only allow you to add content that is inside the git repository's working directory.

"internal paths only" mode can be turned off but we strongly discourage it because moving the ipfs-pack would break all the paths, etc...

kevina · 2017-01-19T21:25:37Z

@flyingzumwalt did you mean to keep this issue closed?

flyingzumwalt · 2017-01-19T21:29:39Z

Sorry @kevina. That was a mistake.

kevina · 2017-01-20T05:04:16Z

Overall I really like this idea. Depending of what @jbenet thinks of my basic filestore implementation (ipfs/kubo#3368) and what he had in mine for the layout of ipfs-pack (if anything) I might be able to implement this fairly quickly (like over the weekend).

My proposal will be to just make the filestore level DB be version 0 of the ipfs pack with some additional metadata to determine the root hash which will be a unixfs directory node. The index will be an an unixfs directory node that can be stored in the filestore DB directly (the protobuf format used by the filestore can already stores non-leaf nodes). I can understand, however if we want a simpler format (perhaps text based) that does not depend on implementation details.

I do not like the idea of non-internal paths. Rather I propose that if a user wants that we require absolute paths, far less brittle that way. For the purposes of this sprint I propose we just don't allow them.

In order to use multiple packs at the same time we are going to require some sort of multi-datastore. If there are too many packs attached to an ipfs node then there could be performance problems as, with out some sort of indexing, each pack will need to be searched in turn. A bloom filter for each pack can help, but not eliminate the problem.

Pinning and in particular the Garbage Collector is going to be a problem. I am going to assume that an ipfs pack is immutable therefore blocks should never be deleted (and it may even be impossible to delete in the ipfs pack in on an read only filesystem) by the GC. One way to solve this would be to use a recursive pin on the root unixfs directory node. However, this will cause the garbage collector to do a huge amount of unnecessary work and use up a lot of unnecessary memory. Basically it will first walk the root node and load the hashes of all the blocks in the pack into memory. It will then iterate over all the hashes of the block in the pack only to discover that none of the hashes in the pack can be collected. A better way would be to have the GC ignore what in a pack altogether.

I tried to propose a solution to the multi-datastore problem in ipfs/kubo#3119 and ipfs/kubo#3257. Due to lack of time by the core team member, the proposal in #3119 never received any serious consideration. When the implementation was reviewed by @whyrusleeping he (mostly in a private IRC conversation) rejected my idea of considering all filestore objects implicitly pinned due to the complexities it added to the interface. We might be able to punt on the GC and pinning issues for this sprint, but I hope that we can revisit the issue sometime soon.

flyingzumwalt · 2017-01-25T17:23:42Z

The code has been reviewed. Work is proceeding with issues like #92

flyingzumwalt added this to the 300 TB Challenge milestone Jan 13, 2017

flyingzumwalt assigned jbenet Jan 13, 2017

flyingzumwalt added the ready label Jan 15, 2017

flyingzumwalt mentioned this issue Jan 15, 2017

Sprint: Data.gov (aka 300 TB Challenge) #87

Open

flyingzumwalt mentioned this issue Jan 16, 2017

Review ipfs-pack draft proposal #99

Closed

flyingzumwalt assigned whyrusleeping Jan 17, 2017

flyingzumwalt changed the title ~~Review Filestore for 300TB Challenge~~ Review Filestore for 300TB Challenge. Update Stories + Specs Jan 17, 2017

flyingzumwalt added in progress and removed ready labels Jan 17, 2017

flyingzumwalt closed this as completed Jan 19, 2017

flyingzumwalt removed the in progress label Jan 19, 2017

flyingzumwalt mentioned this issue Jan 19, 2017

Story: Register a filesystem Directory in IPFS without Duplicating Content #92

Closed

flyingzumwalt reopened this Jan 19, 2017

flyingzumwalt added the in progress label Jan 19, 2017

flyingzumwalt closed this as completed Jan 25, 2017

flyingzumwalt removed the in progress label Jan 25, 2017

hsanjuan unassigned whyrusleeping Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review Filestore for 300TB Challenge. Update Stories + Specs #85

Review Filestore for 300TB Challenge. Update Stories + Specs #85

flyingzumwalt commented Jan 13, 2017 •

edited

Loading

kevina commented Jan 16, 2017

flyingzumwalt commented Jan 17, 2017 •

edited

Loading

flyingzumwalt commented Jan 17, 2017

kevina commented Jan 17, 2017 via email •

edited

Loading

jbenet commented Jan 18, 2017

flyingzumwalt commented Jan 19, 2017 •

edited

Loading

kevina commented Jan 19, 2017

flyingzumwalt commented Jan 19, 2017

kevina commented Jan 20, 2017 •

edited

Loading

flyingzumwalt commented Jan 25, 2017

Review Filestore for 300TB Challenge. Update Stories + Specs #85

Review Filestore for 300TB Challenge. Update Stories + Specs #85

Comments

flyingzumwalt commented Jan 13, 2017 • edited Loading

kevina commented Jan 16, 2017

flyingzumwalt commented Jan 17, 2017 • edited Loading

Relevant notes from Sprint Planning Call:

flyingzumwalt commented Jan 17, 2017

kevina commented Jan 17, 2017 via email • edited Loading

jbenet commented Jan 18, 2017

flyingzumwalt commented Jan 19, 2017 • edited Loading

Reviewing Filestore for data.gov Sprint

Agenda:

Background:

Notes

Things to consider

Existing flatfs code base

Constraints

Check-in: Why are we doing this work?

Conclusions

Short-term Assumption: filestore is for files that won't be changed casually

How filestore will work

Export to a pack

Stories

Register a filesystem Directory in IPFS without Duplicating Content

How the Internals Should Work

Serve Content Directly From a Pack to IPFS Network (no intermediary node)

Serve Content from Local Packs Through a Regular IPFS node

Update, Rename or Delete Contents of an IPFS Pack

Generate a New IPFS Pack from Existing IPFS Content

Selectively Add Files and Directories to the Pack

Selectively Ignore Files and Directories

Use a Single Pack to Track Many Files Across an OS

kevina commented Jan 19, 2017

flyingzumwalt commented Jan 19, 2017

kevina commented Jan 20, 2017 • edited Loading

flyingzumwalt commented Jan 25, 2017

flyingzumwalt commented Jan 13, 2017 •

edited

Loading

flyingzumwalt commented Jan 17, 2017 •

edited

Loading

kevina commented Jan 17, 2017 via email •

edited

Loading

flyingzumwalt commented Jan 19, 2017 •

edited

Loading

kevina commented Jan 20, 2017 •

edited

Loading