-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feedback/proposal: separate concerns more and provide a safe wrapper around the car format #416
Comments
In trying to understand the proposal here I'm struggling a bit to understand the consumer that we're building this for. I think the interface you propose above is the direct one used at the moment in Kubo, but that has ended up being pretty unique. In almost all of our other imports/exports of car files, there is a trust boundary and so we end up with a traversal / block-level validation. things like TraverseV1 provide this, and are safer than allowing writing of blocks in the way you propose. Your argument that "they are specialised", is something I'm interpreting as pointing to this use of ipld-prime or of ipld format.
|
you end with
does this mean this decision of work has already been made and prioritized? |
@willscott no, this was unclear, I'll open a pull request we are gonna review it at some point in the future. |
The core of the argument is that the car format is a stream of blocks, I do not belive it is defined anywhere in the spec that a car file must represent a traversal. I don't see any selector field in the header either.
And I am surprised that
There are two sides of it:
1.That just seems like a wrong premise. 2.There are realistical usecases where the consumer is already doing a traversal, having go-car traversing the same data again is then at best very wastefull at worst impossible. For example linux2ipfs:
The add loop is a pipelined paralised mess (which isn't parallised enough yet). That is in no way realistical to be supported in go-car. An other example I can come up with is a tool that download data over graphsync and then outputs a car file. 2.1
I'm not asking for go-car to support all kind of traversal options, I'm just saying that we could just have a simple API, and trust consumers to write the kind of loops, recursive, paralelised loops, ... traversal pattern they want. Note that this feature is already available, it's just using the dangerous |
1
2
I do also want to point out that the blockstore interface over a car is the other viable alternative that might meet you needs: 2.1The ipld-prime traversal no longer has this limitation due to the preload structure introduced somewhat recently to allow for efficient bitswap. |
1.
SGTM 🙂 2.
I understand the importance of incrementally verifiable car files and the push for their use in Filecoin deals. However, it's worth noting that as you most likely already know incremental verification and pipelined incremental construction are inherently incompatible due to the nature of hashes. In my specific use case with linux2ipfs and my own cluster of IPFS nodes, the API is trusted between me and myself and don't require incremental verification. Regarding the complexity for other teams in the ecosystem, have you considered specifications ✨ ? I believe that clearer specifications could help address these concerns. The IPLD specs differ from what has been discussed here:
This text precisely explain that any order or completness or whatever about the content of the car is not an RFC2119 MUST. For instance, there was an issue last year with estuary failing Filecoin deals due to differing directory name sorting methods. Turns out all unixfs implementations up until this point only ever produced lexicographicaly sorted directories, while linux2ipfs was doing a mix of lexicographical and numerical sorting. I think using specifications to enforce technical requirements is much better than artificially limiting APIs based on what a selected group of wizards 🧙 holds as source of truth in their mind. A possible solution could be a new spec that utilizes a new bit in the V2 bitfield to indicate whether a car is ordered, along with a field for the ipld selector to allow for incrementally verifiable partial cars (like produced by lassie) without having to transfer this OOB. 2.1Thx for pointing out the recent update to ipld-prime traversal with the preload structure. I wasn't aware of this improvement. Out of topic notes:Estuary does not incrementally verify car files, it just require to knows the root and do a complete traversal after the fact (I use a dirty trick to make it accept partial DAGs traversal by building a fake DAG that only contains the blocks of the partial DAG in the precise car being uploaded using some dirty codec hacking). I've considered to use an MD4 or MD5 seed protobuf block, then hot patch it with the final fake root in that car using a length extension attack. |
I drafted a long response to this (longer than this one!) but I'm going to ditch that and say a couple of things, focusing on the CAR-creation process and will ignore the rest of the original post because that doesn't seem actionable. As I've said in at least one other discussion - my preference would be to further evolve go-car (and much of the rest of our stack) toward the go-ipld-prime perspective of IPLD. As it stands, much of the functional undergirding of our stack is already there but we have layering of abstractions in order to be backward compatible with ancient APIs that we don't want to go away. Hence the A distilled form in code that I'm currently working on for a test case that has it all in one place is here (this commit may or may not persist, but if not, a later reader will hopefully find it in Similar code can be found in go-car in WRT Although again, you could do it with func writeCar(root cid.Cid, blocks []block.Block, w io.Writer) error {
carWriter, err := storage.NewWritable(w, []cid.Cid{root}, car.WriteAsCarV1(true), car.AllowDuplicatePuts(false))
if err != nil {
return err
}
for _, blk := range blocks {
if err := carWriter.Put(blk.Cid().KeyString(), blk.RawBytes()); err != nil {
return err
}
}
return nil
} Lastly, I'll note that |
Note this mostly exclude indexes from the picture because I havn't used them and havn't needed them so I can't comment well on their API.
The APIs are either too low level and require consumers to have a copy of the car spec to be used or provide a level above the CAR format and requires consumer to provide a bunch of features they might not have.
APIs that are too level to be used without a copy of the car spec:
HeaderSize
WriteHeader
CarHeader
LoadCar
ReadHeader
ReadVersion
Header
The things above are usefull, but I don't think they are enough to claim this librairy can be used to easily decode car files.
It's like trying to use
encoding/json
but you can only usejson.Decoder.Token
.APIs that are too high level and provides features and types that are not needed to interact with the car format:
DefaultWalkFunc
WriteCar
WriteCarWithWalker
Dag
MaxTraversalLinks
TraverseLinksOnlyOnce
SelectiveCar
SelectiveCarPrepared
TraverseToFile
TraverseV1
Thoses are usefull, but they are specialised helper functions, if I am not creating a car file from a random access CID block interface (
github.com/ipfs/go-ipld-format.NodeGetter
) or if I am not using https://pkg.go.dev/github.com/ipfs/go-merkledag or https://pkg.go.dev/github.com/ipld/go-ipld-prime I cannot use thoses.Things I think are good:
CarReader
It is simple, has one job (provide an iterator that read from an
io.Reader
and return you blocks as they are found in the carv1 stream), with a sane safe API, it does not require consumers to understand deep things about the carv1 spec.BlockReader
Same as above
Streaming a carv1 body from a stream of blocks.
This can't be found in neither the v1 or v2 packages.
You have to write this code:
Which is impossible to figure out for any new comer without a deep read and exploration of the car spec or by looking up some code that already do this.
Note that it's also really easy to messup, the
...[]byte
might lead you to think you can do this for example:util.LdWrite(writer, block.Cid().Bytes(), block.RawData(), block2.Cid().Bytes(), block2.RawData())
but no this does not follow the car spec and will be silently incorrectly serialised.I get why this API exists, I can see edge cases where it would be usefull, I don't think it is acceptable as the only way to stream a stream of blocks.
Things I think would make this better:
Provide an
util.Ldwrite
free way to stream a car body.An API like this would be enough:
I would also move the helpers and lower level functions away in different packages. Given the current state creating a new package like
github.com/ipld/go-car/simple
bundling easy safe wrappers around the car format sounds simpler (no need to have a tool rewrite consumers to a new import path).Somewhat out of scope notes:
It is impossible to do anything allocation free, random example about reading blocks:
It would be nice if Blockreaders object had a
Peek() (cid []byte, block []byte, error)
method, the difference is that it usebufio.Reader.Peek
and returns a pointer tobufio.Reader
's internal pointer, this allows to read a block without allocation.Just so you know I'll make thoses changes to
github.com/ipfs/boxo/car
and provide a lighter API (just exposeBlockReader
andBlockWriter
).The text was updated successfully, but these errors were encountered: