Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Towards Issue #875, avoid duplicating files added to ipfs #2600

Closed
wants to merge 32 commits into from
Closed

[WIP] Towards Issue #875, avoid duplicating files added to ipfs #2600

wants to merge 32 commits into from

Conversation

kevina
Copy link
Contributor

@kevina kevina commented Apr 22, 2016

NOT READY FOR MERGE

This is a work in progress, but I wanted to get some early feedback on my work towards #875. Implementing this feature touched a lot of code and requires some API changes.

Basic adding of files without storing the data now works. This currently needs to be done when the node is offline. To use:

add --no-copy <some local file>

will add the file without copying the blocks. If the file is moved or changed than any blocks created from that file will become invalid.

Notes for review:

  • The first two commits will not be in the final full request. The commit "Refactor Makefile" has been submitted separately and the commit "Temporary disable a failing test" should be dealt with separately.
  • All of the commits with "Required for avoid duplicating files added to ipfs #875" mean what the commit says and are the ones that I want early feedback on right now.
  • Some of the commits directly modify go-datastore in the Godeps/ directory. I did this for simplicity and also taked to @whyrusleeping on irc about this. This will be cleaned up before the final pull request.

Move the go commands that should run under cmd/ipfs in the Makefile in
cmd/ipfs rather than doing a "cd cmd/ipfs && go ..." in the root
Makefile.

The "cd cmd/ipfs && go ..." lines causes problems with GNU Emacs's
compilation mode.  With the current setup Emacs is unable to jump to
the location of the error outputted by go compiler as it can not find
the source file.  The problem is that the embedded "cd" command causes
Emacs's compilation mode to lose track of the current directory and
thus attempts to look for the source file in the wrong directory.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@jefft0
Copy link
Contributor

jefft0 commented Apr 26, 2016

Hi Kevin. Has someone looked at your pull request? Do you need someone to try it? (Or wait for it to be more baked?)

@whyrusleeping
Copy link
Member

I've looked over a bit of it, holding off on leaving any comments or doing a serious review until its asked for

@kevina
Copy link
Contributor Author

kevina commented Apr 26, 2016

@whyrusleeping Please do a serious review on what functions, keeping in mind this is a work in progress. Here is what works as of now:

  • ipfs add --no-copy FILE to add a file without copying in offline mode. If the server is not offline you will get a error about a non-absolute pathname.
  • ipfs filestore ls list files in the filestore
  • ipfs filestore verify verify the contents in the filestore.

There is still lots to do, but I am hesitant on putting too much work until I get some feedback that I am heading down the right path.

@jefft0 yes please try it out

Both: Commit up to (and including) f26c2df is stable and I won't do any forced updates until this is closer to getting into master. Commits after that are newer and slightly less stable and I might do a forced update if I discover a problem (for example tests failing). If this is a problem for either of you let me know.

Edit: I had to rebase to correct a mistake, now everything up to (and including) f26c2df should be stable.

@whyrusleeping whyrusleeping self-assigned this Apr 26, 2016
kevina added 14 commits April 27, 2016 05:48
Required for #875.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
Also change other paths to be absolute.

Required for #875.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
Required for #875.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
None of the other methods in the measure package return this error,
instead they only call RecordValue() when the value is []byte.  This
change makes batch Put consistent with the other methods and allows
non []byte data to be passed though the measure datastore.

Required for #875.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
The DataPtr points to the location of the data within a file on the
file system.  It the node is a leaf it also contains an alternative
serialization of the Node or Block that does not contain the data.

Required for #875.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
The datastore has an optional "advanced" datastore that handles Put
requests for non []byte values, a "normal" datastore that handles all
other put requests, and then any number of other datastore, some of
them that can be designated read-only.  Delete requests are passed on
to all datastore not designed read-only.

For now, querying will only work on a "normal" datastore.

Note: Only tested in the case of just a "normal" datastore and the
case of an "advanced" and "normal" datastore.

Towards #875.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
This involved:

1) Constructing an alternative data object that instead of raw bytes is
a DataPtr with information on where the data is on the file system and
enough other information in AltData to reconstruct the Merkle-DAG node.

2) A new datastore "filestore" that stores just the information in
DataPtr.  When retrieving blocks the Merkle-DAG node is reconstructed
from combining AltData with the data from the file in the file system.
Because the datastore needs to reconstruct the node it needs access to
the Protocol Buffers for "merkledag" and "unixfs" and thus, for now,
lives in go-ipfs instead of go-datastore.

The filestore uses another datastore to store the protocol buffer
encoded DataPtr.  By default this is the leveldb datastore, as the size
fo the encoded DataPtr is small.

Towards #875.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
Note that as per issue #2259 content is not verified for local
retrieval so changing the file content will not always be detected.

Towards #875.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
Towards #875.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
In merkledag.Node and blocks.Block maintain a DataPtr

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@whyrusleeping
Copy link
Member

This is looking promising, there are definitely a few problems we're going to have to think through, but this is progress.

The first thing I suggest is to not add an extra parameters to the blockstore methods, instead we can look at making Block an interface, and having the current implementation be rawBlock or something, and then your filedataBlock can also implement it. We can do type switching internally to handle things differently.

@kevina
Copy link
Contributor Author

kevina commented Apr 28, 2016

@whyrusleeping I am not sure I am completely following. Are you saying you would like to see something like the filestore.DataWOpts for blocks also and have filedataBlock also contain the add options to avoid adding the addOpts paramater to the blockstore methods?

@whyrusleeping
Copy link
Member

@kevina yeah, basically. The idea is also to not have to do weird augments to the types like youre doing to Block. Instead, make Block an interface, and create a completely separate type for what youre doing.

@kevina
Copy link
Contributor Author

kevina commented Apr 28, 2016

@whyrusleeping I see what you are saying about not adding the DataPtr field to Block.

Are you also against adding a generic addOpts parameter to the blockstore methods as a way of passing additional parameters when adding blocks? The addOpts parameter is a generic interface{} and can simply be set to nil when not used. Having a way to pass generic parameters to the Add() and Put() methods seams like a good thing to have going forward. In fact, I was going to propose that the datastore methods also accept a generic opts paramater, but it would of touched too much code and would of changed the Datastore interface.

In the future I might want a way to pass additional parameters to the Get() methods. With the Add() and Put() methods I can hack option passing by including the parameters with the data object to be added (like I did with filestore.DataWOpts). With the Get() methods such as hack is not possible unless you also make Key an interface.

@whyrusleeping
Copy link
Member

I am also against adding extra parameters to the existing methods. I think keeping the methods simple and using the type system for conveying extra information is a better option.

kevina added 2 commits April 28, 2016 03:13
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@kevina
Copy link
Contributor Author

kevina commented Apr 28, 2016

@whyrusleeping okay. I just pushed two commits to make blocks.Block an interface and remove the addOpts parameter from blockstore methods.

@@ -25,6 +25,8 @@ type Repo interface {
// SetAPIAddr sets the API address in the repo.
SetAPIAddr(addr string) error

Self() Repo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed a way to be able to get to the FSRepo and hence get to the filestore.Datastore in the next commit (4ef5531). In this commit node.Repo.(*fsrepo.FSRepo) does not work as node.Repo is repo.ref so I needed to use add the Self() method and use node.Repo.Self().(*fsrepo.FSRepo)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the node.Repo should either be an fsrepo or a mock repo. It should work with a simple type assertion

In this commit node.Repo.(*fsrepo.FSRepo) does not work as node.Repo is repo.ref

I'm not sure what you mean by this. what is repo.ref?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node.Repo is not a FSRepo it is a repo.ref. repo.ref is defined here: https://github.com/ipfs/go-ipfs/blob/c067fb9e83e89cf04226d2c43de7c6fd5ebbccd2/repo/onlyone.go#L50. It may be easier for you to just try it without the Self() and see for yourself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now. I incorrectly assumed the fsrepo.Open constructor returned an fsrepo

@kevina
Copy link
Contributor Author

kevina commented Apr 29, 2016

@jefft0 if you want to do some preliminary testing now would be a good time

Here is what you can try:

Add a file with ipfs add --no-copy. Try to download the hash from a different ipfs node (such as the gateway). Look at the contents of the filestore with ipfs filestore ls. Change the contents of a file or move a file and run ipfs filestore verify. Use ipfs filestore rm-invalid to clean up any invalid objects due to the file changing.

There is basic help now available for the ipfs filestore commands, please try it out and let me know if my text makes sense.

Objects are pinned as usual when adding, but removing the pin and running ipfs repo gc will not remove the object from the filestore. That is because I am unsure if this is a desirable thing to do as the block stubs take very little space. (It is also unclear if filestore objects should be pinned at all.) So for now objects need to be removed manually with ipfs filestore rm <hash>. This command will also unpin the file if necessary.

Add tests for:
  filestore ls
  filestore verify
  filestore rm-invalid
  filestore rm

Also rename t0046-add-no-copy.sh to t0260-filestore.sh.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@jefft0
Copy link
Contributor

jefft0 commented Apr 29, 2016

Boy, that's a lot of work in a short amount of time! add --no-copy and the filestore comands work for me. The --help is useful and makes sense to me.

I found one counter-intuitive thing. I made a short hello file and did ipfs add --no-copy hello.txt. After adding more text to the end of the file, ipfs filestore verify still reports OK. I understand that filestore saves a slice and the slice is still good. But I want to get the hashes of the files I have on disk and ask IPFS if it has them. Now the hashes are different. Do you see the issue? What do you think? Is there a middle-ground solution?

@kevina
Copy link
Contributor Author

kevina commented Apr 29, 2016

@jefft0 Thanks for testing it out!

Do you see the issue? What do you think? Is there a middle-ground solution?

I don't see it as a major issue. The objects are actually still valid, if you try to get the old object the contents would not have changed. "ipfs filestore verify" is meant to verify individual blocks not complete files. It should be fairly easy to add a command that will verify files and check that the "WholeFile" flag is correct, but that is a low priority for me right now.

@jefft0
Copy link
Contributor

jefft0 commented Apr 29, 2016

OK. Is it ready for me to stress test with a bunch of 200 MB video files (my use case)? If you're still tweaking performance, I'll hold off.

kevina added 2 commits April 29, 2016 11:59
Simplify files.File interface by combining Offset() and AbsPath()
methods into one that return a files.ExtraInfo interface that can be
extended with additional information.

Simplify chunk.Splitter by returning a Bytes struct in the NextBytes()
method.  This eliminates the need for the AbsPath() method and the
need to return the data and offset separately

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@kevina
Copy link
Contributor Author

kevina commented Apr 29, 2016

OK. Is it ready for me to stress test with a bunch of 200 MB video files (my use case)? If you're still tweaking performance, I'll hold off.

I have not tuned it for performance at all but I would be interested in how well it performs on a large file.

@kevina
Copy link
Contributor Author

kevina commented Apr 29, 2016

@whyrusleeping

I think instead of adding an offset here to the return (which will pretty much only be used by this codepath). We should change the []byte return to be some type that allows us to either return data, or to return a file piece, similar to the change made to the block object. This new type (whatever we end up making) can also replace the data field in the merkledag.Node object to simplify the DataPtr stuff you've got going.

I just pushed some commits that basically implemented this and simplified a lot of code. The DataPtr stuff in merkledag.Node contains different information so I can not use this new type for that.

kevina added 2 commits April 29, 2016 21:01
…eader.

Remove ExtraInfo() method from the files.File interface as it not
strictly necessary.

Also add SetExtraInfo() to AdvReader to eliminate the need for the
NewReaderWaddOpts wrapper.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@jefft0
Copy link
Contributor

jefft0 commented Apr 30, 2016

I would be interested in how well it performs on a large file.

On my MacBook Pro, I added a folder with 24 video files totaling 3GB. ipfs add -r takes 33 seconds and ipfs add -r --no-copy only takes 18 seconds.

@vitzli
Copy link
Contributor

vitzli commented May 1, 2016

I found weird bug/glitch in filestore verify/ls: it outputs the hash that does not start with Qm… (current version), could somebody verify it?
Environment: Debian Stretch, current version, amd64, recent version from kevina/go-ipfs, branch issue-875-wip (built couple hours ago).

It happened when I tried to add/verify ceph-base_10.2.0-1~bpo80%2b1_amd64.deb, (copy for archival reasons):

user@deb-stretch:~$ ipfs add --no-copy ceph-base_10.2.0-1~bpo80+1_amd64.deb 
added QmU1bhY1HN4QScSnJNQRQ5qmnpwfZMA3bDs2fhdCkz2rkr ceph-base_10.2.0-1~bpo80+1_amd64.deb

same with 

user@deb-stretch:~$ ipfs add -n ceph-base_10.2.0-1~bpo80+1_amd64.deb 
added QmU1bhY1HN4QScSnJNQRQ5qmnpwfZMA3bDs2fhdCkz2rkr ceph-base_10.2.0-1~bpo80+1_amd64.deb

but ipfs filestore ls produces:

…
QmXwJZRKWpbEDCsxAq7N4jT5Wu7N78HgLnwFn9WhMN3GvM leaf  /home/user/ceph-base_10.2.0-1~bpo80+1_amd64.deb 38797312 262144
6PKoQ1TrGRFrN7Ch9t9fQUmFLEYHqa7r3YhtrPU9eYvKd leaf  /home/user/ceph-base_10.2.0-1~bpo80+1_amd64.deb 34865152 262144
QmYCKEbjsC6VEWmZp22dVxUJnNYhWv2R2xDU9CeZ9zmrQ8 leaf  /home/user/ceph-base_10.2.0-1~bpo80+1_amd64.deb 45350912 262144
…

verify command return "ok" for this block.

The block could be extracted with dd (offset 34865152 is skip=133 with bs=262144):

dd if=ceph-base_10.2.0-1~bpo80+1_amd64.deb of=badblock-6PKo.bin skip=133 bs=262144 count=1

badblock-6PKo.bin - extracted block

Doing add -n badblock-6PKo.bin gives:

added QmXynZ3WT44YtANcECmEWdYJFtm7KPBDUEwVrzAqAjtMCi badblock-6PKo.bin

and add --no-copy badblock-6PKo.bin gives the same result.

BUT, doing ipfs filestore ls on it returns the same weird hash:

user@deb-stretch:~$ ipfs filestore ls
6PKoQ1TrGRFrN7Ch9t9fQUmFLEYHqa7r3YhtrPU9eYvKd leaf  /home/user/badblock-6PKo.bin - 262144

For filestore rm: doing rm 6PKo… removes the block without "unpinnig" text, but block disappears from ls output, while filestore rm QmXyn… says "unpinng" and removes the block from ls/verify.

@kevina
Copy link
Contributor Author

kevina commented May 1, 2016

@vitzli this is due to issue #2601 and #994. It is very annoying but nothing I can control directly. Please direct discussion to one of those bug reports. If this becomes a major problem I am thinking about hacking around the problem in my code, but am holding off for now.

@kevina
Copy link
Contributor Author

kevina commented May 1, 2016

@jefft0. I am glad to hear that adding it is faster. Retrieving locally is a bit slower, but my informal test have determined that is due to always verifying the hash. A better solution might be use modification times and only verify when the file's modification timestamp has changed.

@jefft0
Copy link
Contributor

jefft0 commented May 5, 2016

@kevina, If I already did ipfs add cats.mp4 and then do ipfs add --no-copy cats.mp4 should it replace the copied values in the block store with links to cats.mp4 (maybe to recover space in the block store)? Conversely, if I already did ipfs add --no-copy cats.mp4 and then do ipfs add cats.mp4 it will add the blocks to the block store. In this case, does ipfs get read blocks from the block store (ignoring the links to cats.mp4)?

for _, b := range bs {
if _, ok := w.cache.Get(b.Key()); !ok {
// Don't cache "advance" blocks
if _, ok := b.(*blocks.BasicBlock); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check doesnt need to be just for a single block type. If we have a given block in the flatfs store, theres no need to add a filestore reference to it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this change is related to #2600 (comment)

@whyrusleeping
Copy link
Member

I think the next step here is going to be to break up the changes in this PR and start working some of the 'framework' stuff into the codebase. For example: the change to the blocks.Block type could be extracted and merged in separately

@kevina
Copy link
Contributor Author

kevina commented May 5, 2016

@kevina, If I already did ipfs add cats.mp4 and then do ipfs add --no-copy cats.mp4 should it replace the copied values in the block store with links to cats.mp4 (maybe to recover space in the block store)?

At the moment the blocks will be in both datastores. I am working on doing something about this.

@kevina
Copy link
Contributor Author

kevina commented May 5, 2016

I think the next step here is going to be to break up the changes in this PR and start working some of the 'framework' stuff into the codebase. For example: the change to the blocks.Block type could be extracted and merged in separately

Okay, I will start by separating out the block.Block change to an interface type as I would like to see that go in to prevent bitrot.

@whyrusleeping
Copy link
Member

whyrusleeping commented May 5, 2016

as I would like to see that go in to prevent bitrot.

100% agreed, avoiding stagnation is good

@kevina
Copy link
Contributor Author

kevina commented May 6, 2016

Closing and creating new request, See #2634.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants