-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid duplicating files added to ipfs #875
Comments
Yep, this can be implemented as either (a) a different repo altogether, or (b) just a different datastore. It should certainly be an advanced feature, as moving or modifying the original file at all would render the objects useless, so users should definitely know what they're doing. note it is impossible for ipfs to monitor changes constantly, as it may be shut down when the user modifies the files. this sort of thing requires an explicit intention to use it this way. An intermediate point might be to give ipfs a set of directories to watch/scan and make available locally. this may be cpu intensive (may require lots of hashing on each startup, etc). |
the way git-annex deals with this is by moving the file to a hidden directory ( it's freaking annoying to have all those symlinks there, but at least there's only one copy of the file. |
ipfs could track files the same way a media player trak its media collection:
|
|
Here's a disk usage plot when adding a large (~3.4 GiB) file:
~12 GiB used while adding the file. Half way I need to kill Somewhat related, is there a way to cleanup the partially added stuff after killing UPDATE: it seems that |
A couple of extra notes about the disk usage:
Anyway, I've heard you guys are working on an a new repo backed so I just added this for the sake of completion. |
@rubiojr the disk space is being consumed by the eventlogs, which is on my short list for removing from ipfs. check ~/.go-ipfs/logs |
@whyrusleeping not in this case apparently:
6280 ldb files |
The leveldb files did not average 3.8 MiB each, some of them were smaller in fact. My bad. |
wow. That sucks. But should be fixed quite soon, i just finished the migration tool to move block storage out of leveldb. |
since this is a highly requested feature, can we get some proposals of how it would work with the present |
My proposal would be a shallow repo that acts like an index of torrent files. Where it thinks it can serve a block until it tries to open the file from the underlying file system. I'm not sure how to manage chunking. Saving |
Something like |
something like |
piecing it together with the repo is trickier. maybe it can be a special datastore that stores this index info in the flatfs, but delegates looking up the blocks on disk. something like // in shallowfs
// stores things only under dataRoot. dataRoot could be `/`.
// stores paths, offsets, and a hash in metadataDS.
func New(dataRoot string, metadataDS ds.Datastore) { ... }
// use
fds := flatfs.New(...)
sfs := shallowfs.New("/", fds) |
would be cool if linux supported symlinks to segments of a file... |
Perhaps separating out the indexing operation (updating the hash->file-segment map) from actually adding files to the repo might work? The indexing could be done mostly separately from ipfs, and you'd be able to manually control what needs to be (re-)indexed. The blockstore then checks if the block has been indexed already (or passes through to the regular datastore otherwise). |
Copy-on-write filesystems with native deduplication can be relevant here. For example https://btrfs.wiki.kernel.org Copying files just adds little metadata, data extents are shared. I can use it with big torrents, edit files still being a good citizen and seeding the originals. Additional disk space usage is in the size of the edits.
are just files sharing extents On adding a file that is already in the I am sure there is a lot of other more or less obvious ideas and some more crazy ones like using union mounts(unionfs/aufs) with ipfs as a RO fs with RW fs mounted over it for network live distro installation or going together with other VM stuff floating around here. |
@striepan indeed! this all sounds good. If anyone wants to look into making an fs-repo implementation patch, this could come sooner. (right now this is lower prio than other important protocol things.) |
I agree with @striepan; I even believe that copy-on-write filesystems are the solution to this problem. What needs to be done in ipfs, though, is to make sure that the right modern API (kernel ioctl) is used for the copy to be efficient. Probably, go-ipfs just uses native go API for copying, so we should eventually benefit from go supporting recent Linux kernels, right? Can anybody here give a definite status report on that? |
What would happen on Windows? (Are there any copy-on-write filesystems on Windows?) |
@Mithgol not really, the This will be especially visible when IPLD is implemented as there will be two encodings active in the network. |
Are you implying that I can't If that is so, then that should be seen (and treated) literally as a part of the issue currently titled “avoid duplicating files added to ipfs”. If two files are the same (content-wise), then their distribution and storage should be united in IPFS. Otherwise storage efforts and distribution efforts are doubled and wasted. Also elements of the (so called) Permanent Web are suddenly not really permanent: when they are lost, they're designed to not be ever found, because even if someone somewhere discovers such lost file in an offline archive and decides to upload it to the Permanent Web, it is likely to yield a different IPFS hash and thus an old hyperlink (which references the original IPFS hash) is still doomed to remain broken forever. If encodings and shardings and IPLD and maybe a dozen of other inner principles make it inevitable for the same files to have different IPFS hashes, then maybe yet another DHT should be added to the system and it should map IPFS hashes, for example, to cryptographic hashes (and vice versa) and then some subsystem would be able to deduplicate efforts of distribution and storage of the same files, would allow lost files to reappear in the network after uploading. |
However, while this problem should be seen (and treated) literally as a part of the issue currently titled “avoid duplicating files added to ipfs”, this issue is still about deduplicating on disk. It probably is not wise to broaden its discussion here. I've decided to open yet another issue (ipfs/notes#126) to discuss the advantages (or maybe the necessity) of each file having only one address determined by its content. |
@kevina you will need to perform the adds without the daemon running because the daemon and the client arent necessarily on the same machine. If I try to 'zero-copy add' a file clientside, and i'm telling the daemon about this, The daemon has no idea what file i'm talking about, and has no reasonable way to reference that file. |
Just FYI: I am making good progress on this. The first implementation will basically implement the @jefft0 You can find my code at https://github.com/kevina/go-ipfs/tree/issue-875-wip. Expect lots of forced updates on this branch. |
Sorry for all this noise. It seams GitHub keeps commits around forever even after doing a forced update. I will avoid using issue mentions in most of the commits to avoid this problem. :) |
The code is now available at https://github.com/ipfs-filestore/go-ipfs/tree/kevina/filestore and is being discussed in pull request #2634, |
Because this is a major change that might be too big for a single pull request I decided to maintain this as a separate fork while I work through the API issues with whyrusleeping. I have created a README for the new filestore available here https://github.com/ipfs-filestore/go-ipfs/blob/kevina/filestore/filestore/README.md some notes on my fork are available here https://github.com/ipfs-filestore/go-ipfs/wiki. At this point I could use testers. |
GreyLink DC++ client uses extended NTFS file attribute to store TigerTree (Merkle-dag) hash of file http://p2p.toom.su/gltth. It allows to not rehash file in case of re-addition or check and recover broken file from the ntwork if copies available. |
The filestore code has been merged, and will be shipped in 0.4.7 (try out the release candidate here: https://dist.ipfs.io/go-ipfs/v0.4.7-rc1) For some notes and usage instructions, see this comment: #3397 (comment) This issue can finally be closed :) |
Why do i get "ERROR :merkledag node was not a directory or shard " while tryig to add a file to ipfs ? Can anyone help please?? |
@mycripto11116 Could you open a new issue and describe what steps you take to reproduce the issue? |
@jeromy thanks I will do that. Just for a quick reply ...I am new to
experimenting with IPFS. While trying to add a 370k+ jpg file it does shows
the files hash but if I try to view the subblocks with IPFS ls <hash> .. it
shows the error mesg Error : merkledag node was not a directory or shard .
Regards
Jay
|
Blocks are not files, for blocks you'll have to use |
it would be very useful to have files that are passed through
ipfs add
not copied into thedatastore
. for example here, i added a 3.2GB file, which meant the disk usage for that file now doubled!Basically, it would be nice if the space usage for adding files would be O(1) instead of O(n) where n is the file sizes...
The text was updated successfully, but these errors were encountered: