Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocks don't have a nice size #2053

Closed
robcat opened this issue Dec 10, 2015 · 23 comments
Closed

Blocks don't have a nice size #2053

robcat opened this issue Dec 10, 2015 · 23 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature need/community-input Needs input from the wider community

Comments

@robcat
Copy link

robcat commented Dec 10, 2015

Most filesystems have blocks of a standard size of 4k or 16k.

The blocks generated by the ipfs default chunker are stored in 2¹⁸+14 bytes files (262158 b).
These extra bytes don't really play nice with most filesystems and disks (an extra block is allocated for just these 14 bytes).

Are these bytes an essential part of the block that can't be moved elsewhere (e.g. in its filename or in the database)?

@whyrusleeping
Copy link
Member

good point, we should probably adjust the block size to fit neatly there. the extra 14 bytes is the protobuf wrapping on the data.

@jbenet
Copy link
Member

jbenet commented Dec 11, 2015

@robcat great observation -- agreed. this will shift a bit with incoming ipld, too. we should chunk as

N - overhead

@robcat
Copy link
Author

robcat commented Dec 11, 2015

@jbenet

this will shift a bit with incoming ipld, too. we should chunk as N - overhead

Yes, this would be the obvious fix.
But this would still be not compatible with the deduplication features that many file systems already have.
(the chunks would have weird offsets and sizes, while the filesystem is probably storing the original file in 2^n bytes chunks)

An alternative solution would be to make a special exception for blobs: use 2^n sizes for chunks, and the blockstore could transparently strip the wrapping when writing and then re-add it when reading it from the filesystem (the special status of the stored block may be encoded in the filename or in one attribute).
This sounds like a premature optimization, but would really help in halving the storage requirements when running ipfs in parallel with httpd... (maybe related to #875)

@jbenet
Copy link
Member

jbenet commented Dec 11, 2015

An alternative solution would be to make a special exception for blobs: use 2^n sizes for chunks, and the blockstore could transparently strip the wrapping when writing and then re-add it when reading it from the filesystem (the special status of the stored block may be encoded in the filename or in one attribute).
This sounds like a premature optimization, but would really help in halving the storage requirements when running ipfs in parallel with httpd... (maybe related to #875)

This makes a lot of sense. and we should definitely investigate how this would play out. Optimizing this will matter significantly, as we're trying to put enormous amounts of data on ipfs :) -- the timeline for implementation can wait (if easy do it sooner), but certainly want to figure it out soon for specs.

I suspect this would get a TON easier if we were to allow the ipld format that's just the raw data (no wrapping) -- obviously strictly for leaves, without any links. This has been requested many times, (and im not opposed to it, we just need to make damn sure that we distinguish it correctly).

@robcat
Copy link
Author

robcat commented Dec 11, 2015

I suspect this would get a TON easier if we were to allow the ipld format that's just the raw data (no wrapping) -- obviously strictly for leaves, without any links.

Sorry if I go a bit offtopic, but while I was focused on the file archival scenario I forgot to think about the links.

Why are the links stored together with the binary data in the first place?
Links beg to be stored in a separate database (it would be much easier to navigate the graph, follow backlinks, etc.).

@robcat
Copy link
Author

robcat commented Dec 11, 2015

Why are the links stored together with the binary data in the first place?
Links beg to be stored in a separate database (it would be much easier to navigate the graph, follow backlinks, etc.).

@jbenet My last post was not just a rant about links: if they were stored separately, this block size issue would be immediately solved.

@jbenet
Copy link
Member

jbenet commented Dec 12, 2015

@robcat hash links are part of the data model. this is why it's a merkleized data structure. an entire object is one item.

i understand that this seems odd when you think of ipfs as just a way to track posix files. but ipfs is much more beyond that, the point is to make ipfs objects a data model where people can construct arbitrary data structures -- and dump all their existing ones -- and have merkle links as first class primitives.

this is not a great rendering, but take a look:

@jbenet
Copy link
Member

jbenet commented Dec 12, 2015

btw, a storage model that rips out the merkle links, storing them separately, and then puts them back together when constructing the object logically (and to send it, and hash it to make sure things are ok, etc) screams implementation pain and suffering. this is why #875 will be hard.

caveat is that by extending IPLD to allow just raw data edge nodes, we can make #875 easy to implement, and get most of the way you want here.

also note that none of this prevents us from maintaining indexes of the links in a graph database (for backlinks etc), but that's a different layer. (there's a lot wrapped into these decisions)

@robcat
Copy link
Author

robcat commented Dec 12, 2015

i understand that this seems odd when you think of ipfs as just a way to track posix files. but ipfs is much more beyond that, the point is to make ipfs objects a data model where people can construct arbitrary data structures -- and dump all their existing ones -- and have merkle links as first class primitives.

@jbenet
I agree. In a certain sense my view is even more radical than yours: to make links first class citizens, blocks have to be put in a lower class.

The current graph structure (as in the IPFS paper):

type IPFSLink struct {
    Name string
    Hash Multihash
    Size int
}

type IPFSObject struct {
    links []IPFSLink
    data []byte
}

// the specific serialization format is an integral part of the object
func *IPFSObject Serialize() []byte

If I just want to walk on the graph, I necessarily have to get all the binary blobs and hash them (to confirm the object integrity), even if I don't really care about them. That's because links are on the same level as data.

But walking on the graph is a so common use case! The result is that, in practice, objects become specialized: they usually contain either only links or only binary data.
In fact, I still have to come across a really compelling case of an object that needs to include both links and data.

This is the structure I have in mind:

type IPFSLink struct {
    Name string
    Hash Multihash
    Size int
    Binary bool    // this is a link to an opaque []byte object
}

type IPFSObject []IPFSLink
func *Node Serialize() []byte

// the binary Leaves (i.e. []byte) are already flat: no serialization needed

Pro:

  • sane hashing for Nodes (no hashing of the binary data required)
  • no useless wrapping of the Leaves (i.e. deduplication for free)
  • sane hashing for Leaves (same as sha256sum hashes)
  • possibility to use specialized graph storage for Nodes
  • trivial pruning of the tree, in case we don't need to keep the binary blobs (e.g. storage constrained devices)

Contra:

  • in case an object has to contain both links and a big blob, one more link is needed
  • Leaves are second class entities, and they have to be accessed through their parents (whoever receives the blob, has to know in advance that they are dealing with unwrapped data)

@jbenet
Copy link
Member

jbenet commented Dec 13, 2015

We're still on different pages. see https://github.com/ipfs/specs/blob/ipld-spec/merkledag/ipld.md and the talk i linked earlier.

ipfs objects will be arbitrary data objects. the great majority of them will not be posix files nor blocks of them. they will be small pieces of interlinked data in applications.

@robcat
Copy link
Author

robcat commented Dec 14, 2015

We're still on different pages. see https://github.com/ipfs/specs/blob/ipld-spec/merkledag/ipld.md and the talk i linked earlier.

Sorry @jbenet, I watched and read all the linked material, but according to me they sit on different levels. IPLD is the definition of a common language that allows to interpret and canonicalize the nodes of the graph; here I was instead discussing about implementation details (do we really need to specify a serialization format for object that are already flat?)

ipfs objects will be arbitrary data objects. the great majority of them will not be posix files nor blocks of them. they will be small pieces of interlinked data in applications.

I understand this vision perfectly, but maybe we have "dual" views. According to me, the Data property in a Node object (as the current implementation does it) can only be justified by the unix mantra "everything is a file". Ordinary objects should encode all their information using Links, not by some opaque byte stream.

Also, we maybe don't agree on what is small. According to me, the content of the Data field is by definition always big (otherwise, it could be even encoded as a link with an identity-multihash value) and does not belong in the strongly connected component of the graph (where the small pieces of interlinked data will live). This doesn't force the graph to be a tree at all (as in this example).

But if we are not still understanding each other, I would like to ask:

  • Why does the node object have a special Data field, despite the fact that IPLD doesn't give it any special consideration?
  • How are this title property and this array of links going to be handled by the go-ipfs implementation?
  • What is a compelling case of object that has to store links and binary data at the same time?
  • How is my proposed implementation of the Node object going to prevent arbitrary data objects from being stored in the graph?

@jbenet
Copy link
Member

jbenet commented Dec 14, 2015

Also, we maybe don't agree on what is small. According to me, the content of the Data field is by definition always big (otherwise, it could be even encoded as a link with an identity-multihash value) and does not belong in the strongly connected component of the graph (where the small pieces of interlinked data will live). This doesn't force the graph to be a tree at all (as in this example).

  • Even today, we already have many intermediate nodes (nodes with lots of links) larger than most file objects (containing data)
  • Also, an object's "data" won't be segregated off in a Data field any more. the non-link properties are interspersed in the object with the links. And some data can be added to the links themselves. See the example about unix directories with modes again.
  • Why does the node object have a special Data field, despite the fact that IPLD doesn't give it any special consideration?

IPLD is not yet deployed, we're moving away from the current format to IPLD. This transition is not complete yet. The old format will continue to be supported -- backwards compatible -- so links do not break.

  • How are this title property and this array of links going to be handled by the go-ipfs implementation?

I don't understand the question. they will be stored in the same serialized object. See https://github.com/ipfs/go-ipld/

  • What is a compelling case of object that has to store links and binary data at the same time?
  • A filesystem directory.
  • Any webapp data models:
    • social webapp: users, posts, relationships, etc.
    • store webapp: users, products, categories, bundles, orders, ...
    • any api that today uses JSON or XML or Protobuf or ...
  • A file with metadata or extended attributes
  • An archive format
  • Versioning objects (git commits, darcs patches, CRDTs, etc)
  • Cryptographic artifacts (keys, signatures, certificates, ...)

It's not "binary data" specifically. it's just "data".

  • How is my proposed implementation of the Node object going to prevent arbitrary data objects from being stored in the graph?

You distinguish "data carrying nodes" as leaves, and special. IPLD does not, and embeds data directly in the same object. Also, you do not have the same data model as IPLD (full json-like tree), you have the old merkledag link-list model.


Other points for consideration:

  • we're moving to making a node's entire own repo as being represented by ipfs objects itself, so that other primitives and tools work equally well with the "control data" (control plane analogy), and so that it can be moved around with ipfs too.
  • @robcat thank you for taking the time to think about all this, and sorry this is going in a different direction.
  • The idea of using different stores (like graph stores) is a great one, and will still be relevant and used, despite some objects being larger.
  • And yes, some implementation could try splitting off some objects based on their topology, however, this is something we dont have time to focus on now, and would offer less benefit now compared to optimizing lots of other major bottlenecks we still have.

davidar added a commit to davidar/specs that referenced this issue Jun 13, 2016
This patch fixes some of the inconsistencies in the IPLD spec, as well as resolving some existing issues, namely:

- ipfs/kubo#2053 - allow raw data in leaf nodes
- ipfs/kubo#1582 - use EJSON for encoding (non-UTF8) binary data
- mandate CBOR Strict Mode to avoid having to deal with duplicated keys
- change `link` to `content` (although we could also `data` or something similar), as file content needn't be merkle-links, and can in fact be directly embedded into small files
- removed the extraneous `subfiles` wrapper, as file content can be either (a merkle-link to) a bytestring, a unicode string, or a list thereof
@davidar
Copy link
Member

davidar commented Jun 15, 2016

I suspect this would get a TON easier if we were to allow the ipld format that's just the raw data (no wrapping) -- obviously strictly for leaves, without any links. This has been requested many times, (and im not opposed to it, we just need to make damn sure that we distinguish it correctly).

An alternative solution would be to make a special exception for blobs: use 2^n sizes for chunks, and the blockstore could transparently strip the wrapping when writing and then re-add it when reading it from the filesystem (the special status of the stored block may be encoded in the filename or in one attribute).
This sounds like a premature optimization, but would really help in halving the storage requirements when running ipfs in parallel with httpd... (maybe related to #875)

This makes a lot of sense. and we should definitely investigate how this would play out. Optimizing this will matter significantly, as we're trying to put enormous amounts of data on ipfs :) -- the timeline for implementation can wait (if easy do it sooner), but certainly want to figure it out soon for specs.

@jbenet I've suggested some changes to the IPLD spec in ipfs/specs#111 to address this issue, comments welcome :)

@Kubuxu
Copy link
Member

Kubuxu commented Jun 15, 2016

Here - #2122 - we also also talking about limiting block size to 256KiB to allow better aligment and packing in storage (as block size is usually 4KB but it can be bigger), which was initial topic of this issues.

@kevina
Copy link
Contributor

kevina commented Jun 15, 2016

Note that it should be possible to adopt my filestore code (#875, #2634) to separate out the payload and store it separately and basically implement what I think @robcat is proposing. This was actually something I was going to propose and can have some nice performance benefit. For example the garbage collector only cares about the links in a node; by separating out the payload you could implement something like adding a GetLinks() method to the DAGService and this method can be satisfied without having to load the payload from disk.

If this idea is implemented it might not necessary be a good idea to set the block size to 256KiB as that means the payload could then have odd sizes and the odd sizes might not be compatible with a filesystems deduplication. If the payload size is exactly 256KiB than the filesystem has a much better chance of deduplicating the block, if that data is also present in another file.

Something to think about.

@Kubuxu
Copy link
Member

Kubuxu commented Jun 15, 2016

It is true, but you have to consider how many nodes will be pure raw data nodes which also will happen to be in file that user publishes and how many of them will be linking, storing metadata and use custom structures. Users are mostly consumers in our times (1% to 99% rule). Your filestore is nice thing but for consuming majority it won't be that useful.

Other thing is that filesystem will be only able to dedup it if the content of file begins at 4KiB boundary, which would require exactly N * 4KiB of metadata before it, I don't feel like it is going to happen.

On the other hand if raw nodes where to be introduced (DAG that consist only from raw uninterpretable for IPFS data) they would store precisely 256KiB of given file, making it aligned perfectly with file system boundaries, allowing for filesystem deduplication. And linking nodes also are 256KiB and align to filesystem boundaries not wasting any space.

@kevina
Copy link
Contributor

kevina commented Jun 15, 2016

@Kubuxu, my idea was that the metadata be stored separately from the payload (i.e. the contents of the file). I.e. the metadata could be stored in a leveldb (as it will be small) while the payload could be stored in a separate file on disk. The size of the metadata in this case will be irrelevant.
(This idea is different from the filestore as each block will still be stored in its own file using the same system the flatfs datastore does now, but the contents stored on disk will only be the payload).

As far as raw nodes without any metadata, that will also solve the problem. Has anyone given any idea on how that would be implemented? Without any sort of header how would you know what you are getting when you retrieve a block?

@kevina
Copy link
Contributor

kevina commented Jun 15, 2016

To summarize my point from IRC unless raw nodes are used for all data, at some point in the future it might make sense to store the data separately from the metadata in repo. If the data is stored separately than it is the size of the data component of the block that is important as for as optimizing storage goes.

I would propose we limit the size of the data to 256KiB - 4KiB unless raw nodes are being used. The block size can than still be limited to 256KiB.

@davidar
Copy link
Member

davidar commented Jun 16, 2016

As far as raw nodes without any metadata, that will also solve the problem. Has anyone given any idea on how that would be implemented? Without any sort of header how would you know what you are getting when you retrieve a block.

@kevina see my suggestions in ipfs/specs#111

@whyrusleeping
Copy link
Member

lets get this done alongside the ipld stuff. With the new CIDv1 stuff we will be able to address completely raw blocks!

@whyrusleeping whyrusleeping added this to the ipld integration milestone Aug 19, 2016
@jbenet
Copy link
Member

jbenet commented Aug 21, 2016

@whyrusleeping note this: ipfs/specs#130 (comment)

Worth noting that it also depends on whether people want to be able to use those raw blocks as "files" of their own. If yes, then either unixfs must be taught how to do that from a raw block, or the file structure should still exist around the raw block. if not, we then have to generate two objects per block, just the raw data, and a very small object pointing to the raw data that makes it a file. (the important thing on this is being able to aggregate the indexing data structures of files wherever). It ultimately is a discussion trading off the advantages of raw blocks, vs the expressive advantages of "everything is a file", and how to tune those two.

(btw, this is separate from the "raw blocks should be smaller than ( + overhead)", which was an issue we discussed previously.

@em-ly em-ly added kind/enhancement A net-new feature or improvement to an existing feature need/community-input Needs input from the wider community labels Aug 25, 2016
@whyrusleeping
Copy link
Member

Raw blocks are now a thing, running ipfs add --raw-leaves <other args> will result in the use of raw blocks wherever possible. This means the size of a block on disk (for larger files) will now be 256k evenly, and not overlapping it by a few bytes.

Closing this issue now, please go ahead and reopen this or open a new issue if further discussion is needed.

@whyrusleeping
Copy link
Member

(also worth noting that you can ipfs cat raw blocks directly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature need/community-input Needs input from the wider community
Projects
None yet
Development

No branches or pull requests

7 participants