[WIP] Towards Issue #875, avoid duplicating files added to ipfs #2600

kevina · 2016-04-22T21:05:40Z

NOT READY FOR MERGE

This is a work in progress, but I wanted to get some early feedback on my work towards #875. Implementing this feature touched a lot of code and requires some API changes.

Basic adding of files without storing the data now works. This currently needs to be done when the node is offline. To use:

add --no-copy <some local file>

will add the file without copying the blocks. If the file is moved or changed than any blocks created from that file will become invalid.

Notes for review:

The first two commits will not be in the final full request. The commit "Refactor Makefile" has been submitted separately and the commit "Temporary disable a failing test" should be dealt with separately.
All of the commits with "Required for avoid duplicating files added to ipfs #875" mean what the commit says and are the ones that I want early feedback on right now.
Some of the commits directly modify go-datastore in the Godeps/ directory. I did this for simplicity and also taked to @whyrusleeping on irc about this. This will be cleaned up before the final pull request.

Move the go commands that should run under cmd/ipfs in the Makefile in cmd/ipfs rather than doing a "cd cmd/ipfs && go ..." in the root Makefile. The "cd cmd/ipfs && go ..." lines causes problems with GNU Emacs's compilation mode. With the current setup Emacs is unable to jump to the location of the error outputted by go compiler as it can not find the source file. The problem is that the embedded "cd" command causes Emacs's compilation mode to lose track of the current directory and thus attempts to look for the source file in the wrong directory. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

jefft0 · 2016-04-26T19:55:15Z

Hi Kevin. Has someone looked at your pull request? Do you need someone to try it? (Or wait for it to be more baked?)

whyrusleeping · 2016-04-26T19:58:42Z

I've looked over a bit of it, holding off on leaving any comments or doing a serious review until its asked for

kevina · 2016-04-26T20:12:21Z

@whyrusleeping Please do a serious review on what functions, keeping in mind this is a work in progress. Here is what works as of now:

ipfs add --no-copy FILE to add a file without copying in offline mode. If the server is not offline you will get a error about a non-absolute pathname.
ipfs filestore ls list files in the filestore
ipfs filestore verify verify the contents in the filestore.

There is still lots to do, but I am hesitant on putting too much work until I get some feedback that I am heading down the right path.

@jefft0 yes please try it out

Both: Commit up to (and including) f26c2df is stable and I won't do any forced updates until this is closer to getting into master. Commits after that are newer and slightly less stable and I might do a forced update if I discover a problem (for example tests failing). If this is a problem for either of you let me know.

Edit: I had to rebase to correct a mistake, now everything up to (and including) f26c2df should be stable.

Required for #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Also change other paths to be absolute. Required for #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Required for #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

None of the other methods in the measure package return this error, instead they only call RecordValue() when the value is []byte. This change makes batch Put consistent with the other methods and allows non []byte data to be passed though the measure datastore. Required for #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

The DataPtr points to the location of the data within a file on the file system. It the node is a leaf it also contains an alternative serialization of the Node or Block that does not contain the data. Required for #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

The datastore has an optional "advanced" datastore that handles Put requests for non []byte values, a "normal" datastore that handles all other put requests, and then any number of other datastore, some of them that can be designated read-only. Delete requests are passed on to all datastore not designed read-only. For now, querying will only work on a "normal" datastore. Note: Only tested in the case of just a "normal" datastore and the case of an "advanced" and "normal" datastore. Towards #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

This involved: 1) Constructing an alternative data object that instead of raw bytes is a DataPtr with information on where the data is on the file system and enough other information in AltData to reconstruct the Merkle-DAG node. 2) A new datastore "filestore" that stores just the information in DataPtr. When retrieving blocks the Merkle-DAG node is reconstructed from combining AltData with the data from the file in the file system. Because the datastore needs to reconstruct the node it needs access to the Protocol Buffers for "merkledag" and "unixfs" and thus, for now, lives in go-ipfs instead of go-datastore. The filestore uses another datastore to store the protocol buffer encoded DataPtr. By default this is the leveldb datastore, as the size fo the encoded DataPtr is small. Towards #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Note that as per issue #2259 content is not verified for local retrieval so changing the file content will not always be detected. Towards #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Towards #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

In merkledag.Node and blocks.Block maintain a DataPtr License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

whyrusleeping · 2016-04-27T23:19:53Z

This is looking promising, there are definitely a few problems we're going to have to think through, but this is progress.

The first thing I suggest is to not add an extra parameters to the blockstore methods, instead we can look at making Block an interface, and having the current implementation be rawBlock or something, and then your filedataBlock can also implement it. We can do type switching internally to handle things differently.

kevina · 2016-04-28T00:07:02Z

@whyrusleeping I am not sure I am completely following. Are you saying you would like to see something like the filestore.DataWOpts for blocks also and have filedataBlock also contain the add options to avoid adding the addOpts paramater to the blockstore methods?

whyrusleeping · 2016-04-28T00:11:13Z

@kevina yeah, basically. The idea is also to not have to do weird augments to the types like youre doing to Block. Instead, make Block an interface, and create a completely separate type for what youre doing.

kevina · 2016-04-28T00:35:53Z

@whyrusleeping I see what you are saying about not adding the DataPtr field to Block.

Are you also against adding a generic addOpts parameter to the blockstore methods as a way of passing additional parameters when adding blocks? The addOpts parameter is a generic interface{} and can simply be set to nil when not used. Having a way to pass generic parameters to the Add() and Put() methods seams like a good thing to have going forward. In fact, I was going to propose that the datastore methods also accept a generic opts paramater, but it would of touched too much code and would of changed the Datastore interface.

In the future I might want a way to pass additional parameters to the Get() methods. With the Add() and Put() methods I can hack option passing by including the parameters with the data object to be added (like I did with filestore.DataWOpts). With the Get() methods such as hack is not possible unless you also make Key an interface.

whyrusleeping · 2016-04-28T01:51:24Z

I am also against adding extra parameters to the existing methods. I think keeping the methods simple and using the type system for conveying extra information is a better option.

License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

kevina · 2016-04-28T07:19:42Z

@whyrusleeping okay. I just pushed two commits to make blocks.Block an interface and remove the addOpts parameter from blockstore methods.

whyrusleeping · 2016-04-28T16:52:40Z

repo/repo.go

@@ -25,6 +25,8 @@ type Repo interface {
 	// SetAPIAddr sets the API address in the repo.
 	SetAPIAddr(addr string) error

+	Self() Repo


why is this needed?

I needed a way to be able to get to the FSRepo and hence get to the filestore.Datastore in the next commit (4ef5531). In this commit node.Repo.(*fsrepo.FSRepo) does not work as node.Repo is repo.ref so I needed to use add the Self() method and use node.Repo.Self().(*fsrepo.FSRepo)

the node.Repo should either be an fsrepo or a mock repo. It should work with a simple type assertion

In this commit node.Repo.(*fsrepo.FSRepo) does not work as node.Repo is repo.ref

I'm not sure what you mean by this. what is repo.ref?

node.Repo is not a FSRepo it is a repo.ref. repo.ref is defined here: https://github.com/ipfs/go-ipfs/blob/c067fb9e83e89cf04226d2c43de7c6fd5ebbccd2/repo/onlyone.go#L50. It may be easier for you to just try it without the Self() and see for yourself.

Ah, I see now. I incorrectly assumed the fsrepo.Open constructor returned an fsrepo

kevina · 2016-04-29T03:04:32Z

@jefft0 if you want to do some preliminary testing now would be a good time

Here is what you can try:

Add a file with ipfs add --no-copy. Try to download the hash from a different ipfs node (such as the gateway). Look at the contents of the filestore with ipfs filestore ls. Change the contents of a file or move a file and run ipfs filestore verify. Use ipfs filestore rm-invalid to clean up any invalid objects due to the file changing.

There is basic help now available for the ipfs filestore commands, please try it out and let me know if my text makes sense.

Objects are pinned as usual when adding, but removing the pin and running ipfs repo gc will not remove the object from the filestore. That is because I am unsure if this is a desirable thing to do as the block stubs take very little space. (It is also unclear if filestore objects should be pinned at all.) So for now objects need to be removed manually with ipfs filestore rm <hash>. This command will also unpin the file if necessary.

Add tests for: filestore ls filestore verify filestore rm-invalid filestore rm Also rename t0046-add-no-copy.sh to t0260-filestore.sh. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

jefft0 · 2016-04-29T13:29:47Z

Boy, that's a lot of work in a short amount of time! add --no-copy and the filestore comands work for me. The --help is useful and makes sense to me.

I found one counter-intuitive thing. I made a short hello file and did ipfs add --no-copy hello.txt. After adding more text to the end of the file, ipfs filestore verify still reports OK. I understand that filestore saves a slice and the slice is still good. But I want to get the hashes of the files I have on disk and ask IPFS if it has them. Now the hashes are different. Do you see the issue? What do you think? Is there a middle-ground solution?

kevina · 2016-04-29T15:22:26Z

@jefft0 Thanks for testing it out!

Do you see the issue? What do you think? Is there a middle-ground solution?

I don't see it as a major issue. The objects are actually still valid, if you try to get the old object the contents would not have changed. "ipfs filestore verify" is meant to verify individual blocks not complete files. It should be fairly easy to add a command that will verify files and check that the "WholeFile" flag is correct, but that is a low priority for me right now.

jefft0 · 2016-04-29T15:39:19Z

OK. Is it ready for me to stress test with a bunch of 200 MB video files (my use case)? If you're still tweaking performance, I'll hold off.

Simplify files.File interface by combining Offset() and AbsPath() methods into one that return a files.ExtraInfo interface that can be extended with additional information. Simplify chunk.Splitter by returning a Bytes struct in the NextBytes() method. This eliminates the need for the AbsPath() method and the need to return the data and offset separately License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

…der. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

kevina · 2016-04-29T16:04:50Z

OK. Is it ready for me to stress test with a bunch of 200 MB video files (my use case)? If you're still tweaking performance, I'll hold off.

I have not tuned it for performance at all but I would be interested in how well it performs on a large file.

kevina · 2016-04-29T16:09:07Z

@whyrusleeping

I think instead of adding an offset here to the return (which will pretty much only be used by this codepath). We should change the []byte return to be some type that allows us to either return data, or to return a file piece, similar to the change made to the block object. This new type (whatever we end up making) can also replace the data field in the merkledag.Node object to simplify the DataPtr stuff you've got going.

I just pushed some commits that basically implemented this and simplified a lot of code. The DataPtr stuff in merkledag.Node contains different information so I can not use this new type for that.

…eader. Remove ExtraInfo() method from the files.File interface as it not strictly necessary. Also add SetExtraInfo() to AdvReader to eliminate the need for the NewReaderWaddOpts wrapper. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

jefft0 · 2016-04-30T08:38:53Z

I would be interested in how well it performs on a large file.

On my MacBook Pro, I added a folder with 24 video files totaling 3GB. ipfs add -r takes 33 seconds and ipfs add -r --no-copy only takes 18 seconds.

vitzli · 2016-05-01T06:52:06Z

I found weird bug/glitch in filestore verify/ls: it outputs the hash that does not start with Qm… (current version), could somebody verify it?
Environment: Debian Stretch, current version, amd64, recent version from kevina/go-ipfs, branch issue-875-wip (built couple hours ago).

It happened when I tried to add/verify ceph-base_10.2.0-1~bpo80%2b1_amd64.deb, (copy for archival reasons):

user@deb-stretch:~$ ipfs add --no-copy ceph-base_10.2.0-1~bpo80+1_amd64.deb 
added QmU1bhY1HN4QScSnJNQRQ5qmnpwfZMA3bDs2fhdCkz2rkr ceph-base_10.2.0-1~bpo80+1_amd64.deb

same with 

user@deb-stretch:~$ ipfs add -n ceph-base_10.2.0-1~bpo80+1_amd64.deb 
added QmU1bhY1HN4QScSnJNQRQ5qmnpwfZMA3bDs2fhdCkz2rkr ceph-base_10.2.0-1~bpo80+1_amd64.deb

but ipfs filestore ls produces:

…
QmXwJZRKWpbEDCsxAq7N4jT5Wu7N78HgLnwFn9WhMN3GvM leaf  /home/user/ceph-base_10.2.0-1~bpo80+1_amd64.deb 38797312 262144
6PKoQ1TrGRFrN7Ch9t9fQUmFLEYHqa7r3YhtrPU9eYvKd leaf  /home/user/ceph-base_10.2.0-1~bpo80+1_amd64.deb 34865152 262144
QmYCKEbjsC6VEWmZp22dVxUJnNYhWv2R2xDU9CeZ9zmrQ8 leaf  /home/user/ceph-base_10.2.0-1~bpo80+1_amd64.deb 45350912 262144
…

verify command return "ok" for this block.

The block could be extracted with dd (offset 34865152 is skip=133 with bs=262144):

dd if=ceph-base_10.2.0-1~bpo80+1_amd64.deb of=badblock-6PKo.bin skip=133 bs=262144 count=1

badblock-6PKo.bin - extracted block

Doing add -n badblock-6PKo.bin gives:

added QmXynZ3WT44YtANcECmEWdYJFtm7KPBDUEwVrzAqAjtMCi badblock-6PKo.bin

and add --no-copy badblock-6PKo.bin gives the same result.

BUT, doing ipfs filestore ls on it returns the same weird hash:

user@deb-stretch:~$ ipfs filestore ls
6PKoQ1TrGRFrN7Ch9t9fQUmFLEYHqa7r3YhtrPU9eYvKd leaf  /home/user/badblock-6PKo.bin - 262144

For filestore rm: doing rm 6PKo… removes the block without "unpinnig" text, but block disappears from ls output, while filestore rm QmXyn… says "unpinng" and removes the block from ls/verify.

kevina · 2016-05-01T07:03:26Z

@vitzli this is due to issue #2601 and #994. It is very annoying but nothing I can control directly. Please direct discussion to one of those bug reports. If this becomes a major problem I am thinking about hacking around the problem in my code, but am holding off for now.

kevina · 2016-05-01T07:11:32Z

@jefft0. I am glad to hear that adding it is faster. Retrieving locally is a bit slower, but my informal test have determined that is due to always verifying the hash. A better solution might be use modification times and only verify when the file's modification timestamp has changed.

jefft0 · 2016-05-05T16:49:58Z

@kevina, If I already did ipfs add cats.mp4 and then do ipfs add --no-copy cats.mp4 should it replace the copied values in the block store with links to cats.mp4 (maybe to recover space in the block store)? Conversely, if I already did ipfs add --no-copy cats.mp4 and then do ipfs add cats.mp4 it will add the blocks to the block store. In this case, does ipfs get read blocks from the block store (ignoring the links to cats.mp4)?

whyrusleeping · 2016-05-05T17:02:55Z

blocks/blockstore/write_cache.go

 	for _, b := range bs {
-		if _, ok := w.cache.Get(b.Key()); !ok {
+		// Don't cache "advance" blocks
+		if _, ok := b.(*blocks.BasicBlock); ok {


This check doesnt need to be just for a single block type. If we have a given block in the flatfs store, theres no need to add a filestore reference to it as well.

The reason for this change is related to #2600 (comment)

whyrusleeping · 2016-05-05T17:48:01Z

I think the next step here is going to be to break up the changes in this PR and start working some of the 'framework' stuff into the codebase. For example: the change to the blocks.Block type could be extracted and merged in separately

kevina · 2016-05-05T18:02:26Z

@kevina, If I already did ipfs add cats.mp4 and then do ipfs add --no-copy cats.mp4 should it replace the copied values in the block store with links to cats.mp4 (maybe to recover space in the block store)?

At the moment the blocks will be in both datastores. I am working on doing something about this.

kevina · 2016-05-05T18:04:41Z

I think the next step here is going to be to break up the changes in this PR and start working some of the 'framework' stuff into the codebase. For example: the change to the blocks.Block type could be extracted and merged in separately

Okay, I will start by separating out the block.Block change to an interface type as I would like to see that go in to prevent bitrot.

whyrusleeping · 2016-05-05T18:47:44Z

as I would like to see that go in to prevent bitrot.

100% agreed, avoiding stagnation is good

kevina · 2016-05-06T05:37:12Z

Closing and creating new request, See #2634.

whyrusleeping self-assigned this Apr 26, 2016

kevina added 14 commits April 27, 2016 05:48

Add Offset() method to files.File and create new AdvReader interface.

27ec622

Required for #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Add AbsPath() to files.File interface.

c410ea8

Also change other paths to be absolute. Required for #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Add extra parameter to indicate how the content should be added.

92c13ba

Required for #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Add basic tests for "add --no-copy".

2b8704e

Note that as per issue #2259 content is not verified for local retrieval so changing the file content will not always be detected. Towards #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

When reconstructing block always verify the result.

f26c2df

Towards #875. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Bug fix to commit e92e9da:

8ff6185

In merkledag.Node and blocks.Block maintain a DataPtr License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Add Basic Query and "Direct" commands to filestore. Needs Testing.

955e763

License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Save ref to Filestore in repo for future direct access.

14601a2

License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Add Self() method to be able to get to FSRepo.

31e2e21

License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Add "filestore" commands to list contents and verify filestore.

4ef5531

License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

kevina added 2 commits April 28, 2016 03:13

Make blocks.Block an interface.

a999a72

License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

Remove addOpts paramater from blockstore methods.

7addf3e

License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

whyrusleeping reviewed Apr 28, 2016
View reviewed changes

Add sharness tests for new "filestore" commands.

dd46d8d

Add tests for: filestore ls filestore verify filestore rm-invalid filestore rm Also rename t0046-add-no-copy.sh to t0260-filestore.sh. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

kevina added 2 commits April 29, 2016 11:59

Eliminate separate addOpts parameter by pushing the info into the rea…

1e21dbe

…der. License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

kevina added 2 commits April 29, 2016 21:01

Report an error when trying to use "--no-copy" when daemon is online.

56a6b43

License: MIT Signed-off-by: Kevin Atkinson <[email protected]>

whyrusleeping reviewed May 5, 2016
View reviewed changes

This was referenced May 5, 2016

Make blocks.Block an interface. #2633

Merged

[WIP] Filestore Implementation #2634

Closed

kevina closed this May 6, 2016

kevina mentioned this pull request May 24, 2016

Cannot restart daemon after files API ops and repo gc #2698

Closed

kevina mentioned this pull request Aug 26, 2016

Stable bits from Pull Request #2634 #2792

Merged

kevina mentioned this pull request Oct 16, 2016

Add PosInfo() method to blocks and merkeldag nodes. #3309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Towards Issue #875, avoid duplicating files added to ipfs #2600

[WIP] Towards Issue #875, avoid duplicating files added to ipfs #2600

kevina commented Apr 22, 2016 •

edited

Loading

jefft0 commented Apr 26, 2016

whyrusleeping commented Apr 26, 2016

kevina commented Apr 26, 2016 •

edited

Loading

whyrusleeping commented Apr 27, 2016

kevina commented Apr 28, 2016

whyrusleeping commented Apr 28, 2016

kevina commented Apr 28, 2016

whyrusleeping commented Apr 28, 2016

kevina commented Apr 28, 2016 •

edited

Loading

whyrusleeping Apr 28, 2016

kevina Apr 28, 2016

whyrusleeping Apr 28, 2016

kevina Apr 28, 2016

whyrusleeping Apr 28, 2016

kevina commented Apr 29, 2016

jefft0 commented Apr 29, 2016

kevina commented Apr 29, 2016

jefft0 commented Apr 29, 2016

kevina commented Apr 29, 2016

kevina commented Apr 29, 2016

jefft0 commented Apr 30, 2016

vitzli commented May 1, 2016

kevina commented May 1, 2016 •

edited

Loading

kevina commented May 1, 2016 •

edited

Loading

jefft0 commented May 5, 2016

whyrusleeping May 5, 2016

kevina May 5, 2016

whyrusleeping commented May 5, 2016

kevina commented May 5, 2016

kevina commented May 5, 2016

whyrusleeping commented May 5, 2016 •

edited

Loading

kevina commented May 6, 2016

[WIP] Towards Issue #875, avoid duplicating files added to ipfs #2600

[WIP] Towards Issue #875, avoid duplicating files added to ipfs #2600

Conversation

kevina commented Apr 22, 2016 • edited Loading

jefft0 commented Apr 26, 2016

whyrusleeping commented Apr 26, 2016

kevina commented Apr 26, 2016 • edited Loading

whyrusleeping commented Apr 27, 2016

kevina commented Apr 28, 2016

whyrusleeping commented Apr 28, 2016

kevina commented Apr 28, 2016

whyrusleeping commented Apr 28, 2016

kevina commented Apr 28, 2016 • edited Loading

whyrusleeping Apr 28, 2016

Choose a reason for hiding this comment

kevina Apr 28, 2016

Choose a reason for hiding this comment

whyrusleeping Apr 28, 2016

Choose a reason for hiding this comment

kevina Apr 28, 2016

Choose a reason for hiding this comment

whyrusleeping Apr 28, 2016

Choose a reason for hiding this comment

kevina commented Apr 29, 2016

jefft0 commented Apr 29, 2016

kevina commented Apr 29, 2016

jefft0 commented Apr 29, 2016

kevina commented Apr 29, 2016

kevina commented Apr 29, 2016

jefft0 commented Apr 30, 2016

vitzli commented May 1, 2016

kevina commented May 1, 2016 • edited Loading

kevina commented May 1, 2016 • edited Loading

jefft0 commented May 5, 2016

whyrusleeping May 5, 2016

Choose a reason for hiding this comment

kevina May 5, 2016

Choose a reason for hiding this comment

whyrusleeping commented May 5, 2016

kevina commented May 5, 2016

kevina commented May 5, 2016

whyrusleeping commented May 5, 2016 • edited Loading

kevina commented May 6, 2016

kevina commented Apr 22, 2016 •

edited

Loading

kevina commented Apr 26, 2016 •

edited

Loading

kevina commented Apr 28, 2016 •

edited

Loading

kevina commented May 1, 2016 •

edited

Loading

kevina commented May 1, 2016 •

edited

Loading

whyrusleeping commented May 5, 2016 •

edited

Loading