-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstractions in IPLD: high level and low level data representations #91
Comments
Thank you for getting to this s it can become quite complex, especially when we don't know yet exactly what we are talking about. I don't quite get the difference between layer 2 and layer 3 (IPLD block vs IPLD objects). Are you saying that a single logical IPLD objects could be composed of multiple physical IPLD blocks that would be composed together to form the logical object ? I would aim for something simpler by merging the two, and make the application layer (4th layer) aware of the physical constraints of the underlying blocks and integrate the chunking into the application layer. I can see the value of having a chunking that comes for free for every application. But wouldn't it come with added complexity and inefficiencies? As for the path solutions, I quite like the 1st solution with different prefixes for different usages. You could make an analogy with URI schemes. I would imagine solething like:
In my opinion, each application might want to have the full flexibility of being able to define their own path system, with a common structure of course. This makes much more sense. Also, some applications might not necessarily want to implement a filesystem hierarchy at all. it should not be mandatory (and we always have the IPLD paths to debug the data). |
Thanks for your reply @mildred, this concept is very new in my mind and I found it very hard to explain.
Yes, exactly!
So, by merging them (and by merging them I don't mean using two different path scheme in one), then when I add a json to ipld, it could become physically different than it originally was, so imagine seeing Also, in the paragraph in which I talk about the differences you can see what I mean by these two layers actually being different. One needs paths to be transparent, one needs the urls to be simple (no
They were to me, until I realized that these may need two different path schemes as soon as they diverge in their physical representation see Note on different path schemes Of course, don't get me wrong, I am not settled on this idea of separating the two concepts either. I would prefer them to be merged. The way I see the difference between the two, is that IPLD objects are a layer (or an application) on top of IPLD blocks. One gives me very nice path scheme that corresponds to the actual representation of the data (the high level one) and abstracts the fact that an object my be split in multiple objects. |
Makes complete sense. Not much different than protocols running over TLS instead of TCP. In the end, applications will be able to decide over which layer they want to be implemented. Some might like the abstraction and not having to worry about little details while others might welcome the possibility to be closer to the wire format and limit the overhead (I suspect that unixfs will be of the latter). |
@mildred if you want we can take this conversation over IRC/hangout (we need a way to explain this in a easier way - happy to chat) The fact that I would like this not to be just yet another app that you can build on IPFS is because when I add some data, I expect to get this data with the same representation, in other words, I am describing a different higher level resolver than the one that we envision for IPLD. IPLD blocks are the low level thing on which everything runs on, IPFS resolves to files, IPLD objects resolve to structured data (if that makes sense). What I am not convince about is to separate them |
Unfortunately, I don't have time to talk much today but I'll try. The distinction between the object layer/block layer isn't the same as TLS versus TCP. Layer 3 is the data model layer and layer 2 is the data representation layer. The purpose of having two distinct layers is to get application-specific efficiency/control (IPLD blocks) while still maintaining the logical structure of the data (IPLD objects). That is, split separates the model from the representation. One core idea that you may be missing is that any valid tree of IPLD blocks is a valid IPLD object. This means that UNIXFS can choose to structure its IPLD blocks in a way that makes sense for unix filesystems while For example, a logical IPLD object might be: {
"a": "lots of data",
"big dir": {
"aa": "first",
"ab": "second",
"b": "third"
}
} where the actual IPLD blocks might be: {
"a": {
@link: hash(blocks),
"size": 24, // mandatory? data only?
},
"big dir": {
"a": {
@prefix: true, // interpret this as a prefix of a directory tree.
@link: hash(a_dir),
size: ??,
},
"b": "third", // inline file
}
};
blocks = {
@blocks: bytes, // Interpret this object as a single byte array.
0: {
@link: hash(block_first),
size: 7,
7: {
@link: hash(block_second),
size: 5
};
// These are, by themselves, valid objects.
block_first = "lots of";
block_second = " data";
// Again, a valid IPLD object.
a_dir = {
"aa": "first",
"ab": "second",
}; |
@Stebalien thanks for joining the conversation, this is really great
I think this is a much better way to explain what I explained above (we could try to use IPLD data model vs IPLD data representation naming convention if blocks and objects don't work). Your example gives an actual real use case in which the user may define the lower representation, but when they access via IPLD(data repr. scheme), they want to access the actual representation, without taking care of resolving it manually! |
Ok, I understand, and perhaps we need to find a name for this layer 3 so we can start writing code for it. I have a question here: will layer 3 allow links, or will links be resolved at layer 2 only? If links are in both layers, how to specify if a link is to be resolved in layer 2 or in layer 3? Layer 2 only links would be fine for me. |
Obviously, these are my opinions. IPFS links are invisible in layer 3. However, if we want support for mutable links (IPNS, HTTP, etc.), those would have to be visible. For now, I think it's reasonable to say that turning a layer 3 object into layer 2 blocks is the application's job. In the future, it might be worth it to add an API to IPFS that takes a layer 3 object and automatically chunks it up into layer 2 blocks (and tries to re-use existing layer 2 blocks) but this is a future optimization. However, the current discussion assumes that we only allow metadata on links. If the purpose of metadata is to describe properties linked objects without having to modify them or to describe properties relevant only to the data representation (size of linked content, etc.), it makes sense to support metadata on links only. If the purpose of metadata is to describe relationships, it makes sense to support metadata on all relationships. Personally, I'd like to support metadata on all relationships. That is: {
"joke": "Why did the chicken cross the road?",
// Describes the relationship "obj1"
"joke/": { // nicola doesn't like this syntax. I'm open to suggestions.
"comment": "A funny file"
},
"sad story": {
"@link": hash,
"size": 99, // Not really metadata. This is a part of the link spec.
},
"sad story/": {
"comment": "A sad file. Do not read."
}
} Incidentally, this allows metadata to be stored in it's own linked layer 2 block. |
#important I am not sure if I follow, the way I see this happening is the following. We have some descriptive notation (@jbenet has some hints on the direction we should take) to describe the way shards, for example, could happen. IPLD objects like web pages, should not need to resolve their links. Of course, if you want to resolve an entire object, there must be an API call that allows you to resolve all the links recursively. The idea here is to keep the same structure, but expose the links. tl;dr
Without links, with shards// original data
{
friends: {
nicola: {name: "Nicola"},
..
zayan: {name: "Zayan"}
}
}
// to ipld blocks
{
friends: {
@merge: [{@link: hash1}, {@link: hash2}, {@link: hash3}],
metadata: "something here"
}
}
// hash1 == { nicola: ..}
// hash2 == { ... }
// hash3 == { .. , zayan: ..}
// to ipld object
{
friends: {
nicola: {name: "Nicola"},
..
zayan: {name: "Zayan"}
}
}
With links, with shards// original data
{
friends: {
nicola: {@link: "hash1", meta1: "data on this link!"},
..
zayan: {@link: "hash2", meta2: "more about this!"}
}
}
// to ipld blocks
{
friends: {
@merge: [hash1, hash2, hash3]
}
}
// hash1 == { nicola: { @link: ..
// hash2 == { ... }
// hash3 == { .. , zayan: {@link: ..
// to ipld object
{
friends: {
nicola: {@link: "hash1"},
..
zayan: {@link: "hash2"}
}
}
// to ipld object recursively explored
{
friends: {
nicola: {name: "Nicola"},
..
zayan: {name: "Zayan"}
}
}
note: in this example in ipld-object some times I use the dot |
To me, layer 3 is really a convention or a way of thinking about the data on top of which we can build APIs. For example, to avoid recursively fetching objects, you could ask ipfs for, e.g., a "directory listing" of an object. However, you bring up a good point. In general, there is a distinction between data included in an object, and linked data. The discussion so far (or, at least my interpretation of it) conflates the two. Maybe it would be a good idea to distinguish between links and includes (I believe I mentioned this off-line but don't think we got very far in that discussion). That is, merges, blocks, includes, etc. (things that allow splitting large objects into smaller reusable chunks) would be layer 2 constructs that would disappear in layer 3 but true links could still remain. Links would demarcate clear boundaries between distinct objects that happen to be related in some way while an include would mean "this content is logically part of this object but is stored elsewhere". However, what about distinct objects that should be stored together for efficiency purposes? That is, given the above we have a way of splitting large objects but no way to combine small objects. I'm imagining a deep directory tree such as: {
"a": {
"b" : {
"c": {
"d": "stuff",
}
}
}
} Each "directory" here is a distinct logical object but splitting this up into multiple blocks is wasteful. One way to solve this is through metadata (which could be stored as a tag in CBOR). That is, in layer 2, the above object would look like: {
"a": {
"b" : {
"c": {
"d": "stuff",
"d/": { "distinct": true }, // Could be stored a tag in CBOR for efficiency.
}
"c/": { "distinct": true },
},
"b/": { "distinct": true },
}
"a/": { "distinct": true },
} In this way, a "link" would just be a "distinct" include. Note: my terminology is probably confusing. |
To make we're all on the same page, the purpose of the layer 2/layer 3 distinction as I see it is that logical object boundaries should not dictated by performance requirements. That is, separation of model and representation. Enumerated:
Did I miss anything? |
I would say: just don't make a link and include the data directly in place of the link. This is a very simple solution with no need to add another concept. edit: by include I mean put the data there without any indirection. And if you need to reference this partial object from another place, this is where you should use IPLD paths in IPLD links. |
@mildred I am not sure if by include you mean @Stebalien I agree with your 3 objectives |
Glad to see this discussion happening. have a lot to say, of course. sorry i haven't gotten to write. Some points for discussion:
I was reminded through a recent conversations, that these are really just functional datastructures, and that there likely is very good literature on implementing these efficiently. We can look into that for some time to get clarity on layer2 and layer3. The Trivial computations (like byte-stream merge (for files), or object merge (for json sharding)) are fine to hardcode, but what would be really powerful is to make sure this model is extensible with proper computation. Think being able to implement CRDTs over this trivially easy. Please both take a look at these prelimninary discussions/experiment:
This is an extremely powerful way to think about data. AND IT IS VERY HARD TO DO RIGHT. And it has been written about extensively. We (I) need to spend a good amount of time searching the literature to gain understanding of results important to this endeavor. Please for now only use "merge" as a tool for thought, or to express how layer3 may be created out of layer2, but do not assume We can repurpose the word blocks for layer 2, but I would caution against it and encourage us to find other words, as "blocks" historically are dumb byte sequences, not nice objects, and it already means that in ipfs:
|
@nginnever can you post the sketches @nicola and I put in your notebook here? |
I have for a long time used physical / logical to describe the 2 main Physical: the domain of hardware and practical engineering constraints Does anyone have specific concrete examples or precedents that would On Mon, Apr 18, 2016 at 2:19 AM, Juan Benet [email protected]
james mcfarland |
IPLD Sketches - https://gateway.ipfs.io/ipfs/QmUz898hhH2Z8X3c8Jd6V1DiJqhSqLNi5u45oNcZ2qWcFp IPLD Pad Agenda - https://pad.riseup.net/p/EByKoWmYHrjz |
This issues follows on my original proposal #90 (see the full scheme https://github.com/nicola/interplanetary-paths) discussed this weekend with the team in New York and conversations with @Stebalien, @jbenet, @diasdavid, @mildred (ping @dignifiedquire )
READ first: because there is so much text already, I wrote #important whenever it is the important part to read (of mine of course - feel free to use this in your posts)
Background
In this issue, I argue that there are two path schemes that we should offer to traverse IPLD objects:
Different layers of abstraction
We can abstract the different forms of data in different layers. For example, imagine we have
file1.jpg
in the folderdir
.Layer 4 (application)
The nice path that an application like unixfs should offer to their final user should allow to do the following
/$hash/dir/file1.jpg
.Layer 3 (IPLD object path)
Let's assume that the unixfs application decides to structure their data in this way:
Note: in the case of unixfs, we could aim at merging Layer 3 with Layer 4, but for the sake of the argument, I just made unixfs a bit more complex than it should.
Layer 2 (IPLD block path)
However, since the folder is very big, our chunker (this can be implemented in many ways, let's assume that The Nicola IPLD Chunker works this way) is going to split the IPLD object in multiple IPLD objects, that we are going to call IPLD blocks
Summary of the 4 layers and their paths
/$hash/dir/file1.jpg
/$hash/dir/files/file1.jpg
/$hash/shard3/files/shard1/file1/0/0/0//
Differences between 2 and 3
In other words the two key layers for IPLD are the 2nd and the 3rd.
The reason why they should have different path schemes is because they are both important to the final application developer (depending whether they are writing higher or lower level application).
The difference between the two is that one traverses the actual IPLD data blocks, while the other one abstracts that. From the previous example:
/hash/file3
would be a path of the final representation/hash/shard1/file1
is what we will have to use instead.Note on different path schemes
Also, the way the separator will traverse either layer may have different meaning, for example in Layer 3, maybe there is no need to have transparent links, while it can be important for Layer 2.
So for example
/file/name === Nicola
/file/permission === undefined
/file//name === Nicola
(note the//
meaning that we are traversing a link)/file/permission === 0777
Reworking examples in the current spec
For simplicity call
high-ipld
high level, andlow-ipld
the low level (the low level is the current IPLD)Notes on implementation
Two paths options
Maybe it can be done with different prefixes (/ipld1 , /ipld2)
By using a different separator than
/
, for example.
, then we can mix the two path schemes. Assume that/
will traverse the high level representation and.
the lower level. I am not sure how this would workNotes about this conversation
At the beginning my perception of IPLD was that pathing would resolve the high level representation, so that if I have a JSON, I could just be able to traverse it
/friends/0/name
, however the current IPLD pathing may not allow that.Also,
blocks
andobjects
are in reference to file system concepts, they are open for better naming#important
The text was updated successfully, but these errors were encountered: