Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

estimate file car size #59

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

willscott
Copy link
Collaborator

@willscott willscott commented Aug 6, 2023

fix #58

I wonder having put this together if we really want this as part of this package / core interface. It's very finicky around:

  • Chunking
  • Cid structure
  • Car block sizing

Also note that protobuf encoding because of it's use of un-aligned varints is very difficult to get accurate without actually doing the work. That said the number of intermediate nodes isn't too bad in any realistic case, so that's probably an okay tradeoff here.

It might be better to pull to a client implementation rather than here?

@willscott willscott requested review from masih and rvagg August 6, 2023 11:44
@rvagg
Copy link
Member

rvagg commented Aug 7, 2023

My initial response to this is - firstly in my mental hierarchy this doesn't belong here, but it could belong in go-car if it was made more configurable. go-car is a consumer of go-unixfsnode, which sits lower in the stack, and it already has precedent of having unixfs-specific code and I think that'd be OK to extend that. Making go-unixfsnode have CAR-specific code seems like mixing up the stack hierarchy a little too much?

I'd like to move https://github.com/filecoin-project/lassie/tree/main/pkg/verifiedcar into go-car, as a separate utility package, this could be similar, doing a very specific job.

Alternatively just new repo in either ipld or ipfs orgs would be fine.

Regarding pb sizing, protowire has some decent protobuf sizing utilities, do they help here? https://github.com/ipld/go-codec-dagpb/blob/master/marshal.go#L129-L138

Regarding interface and making it more generic, how about something like what we have in https://github.com/filecoin-project/lassie/tree/main/pkg/verifiedcar where you make the config and use (and reuse) it to estimate sizes:

unixfscar.SizeEstimateConfig{
  BlockSize: 1024 * 128, // defaults to chunk.DefaultBlockSize if 0
  LeafLinkPrototype: ...,
  FileLinkPrototype: ...,
}.Estimate(dataSize)

The default (unixfscar.SizeEstimateConfig{}.Estimate(dataSize)) will do basically what you have here, but the config can be extended as we come up with more configurability concerns. You could even configure one of these with properties to do a full directory walk, and then walk through a directory with it:

cfg = unixfscar.SizeEstimateConfig{}
cfg.StartDir('foo')
cfg.AddFile('file1', 1000)
cfg.AddFile('file2', 2000)
cfg.StartDir('bar')
cfg.AddFile('a', 10)
cfg.EndDir('bar')
cfg.EndDir('foo')

... or just maybe cfg.EstimateDir(dir) and make it do the walk itself.

@rvagg
Copy link
Member

rvagg commented Aug 7, 2023

Oh, going back and reviewing #58, the cfg API approach could help with piecing together a full CAR from multiple DAGs:

cfg = unixfscar.SizeEstimateConfig{}
totalSize := 0
totalSize += cfg.EstimateFile(123456)
totalSize += cfg.EstimateFile(223344)
totalSize += cfg.EstimateFile(100)
totalSize += cfg.EstimateV1Header(1) // single root?

Copy link
Member

@masih masih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Will 🎉

Left questions and minor suggestions, overall LGTM.
In terms of where the functionality would live, I have no preference.

@@ -57,6 +58,91 @@ func BuildUnixFSFile(r io.Reader, chunker string, ls *ipld.LinkSystem) (ipld.Lin
}
}

// EstimateUnixFSFile estimates the byte size of the car file that would be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why the word "estimate"? I read that as: the actual size may differ. Is that right? If so, would it make sense to update the godoc to elaborate on the discrepancy?

If not, I recommend avoiding this terminology.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only real discrepancy in my head at this point is that depending on your actual data, the resulting car may have de-duplicated blocks so may be smaller than expected

data/builder/file.go Outdated Show resolved Hide resolved
}
links = nxtLnks
}
fmt.Printf("estimated %d intermeidate nodes\n", icnt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logger?

ls.StorageWriteOpener = storage.OpenWrite

icnt := 0
for len(links) > 1 {
Copy link
Member

@masih masih Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to update the godoc of this function with the cost complexity of estimating size vs writing the CAR out into a temporary file and getting the size that way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the extent of what we can really say is that the size calculation doesn't need memory, but will do the CPU work for generation of intermediate blocks.

return bwc(lnk)
}, err
}
rt, _, err := builder.BuildUnixFSFile(bytes.NewReader(b), "", &ls)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare total size returned with estimated size?

@willscott
Copy link
Collaborator Author

@masih how would you feel about adapting this method directly into the client code you're working on? do you think it's preferable to find a place for it in car / here instead?

Co-authored-by: Masih H. Derkani <[email protected]>
@masih
Copy link
Member

masih commented Aug 7, 2023

how would you feel about adapting this method directly into the client code you're working on?

This seems like a path of least resistance. 👍

In the long run I agree with Rod, in that we probably want to either adopt it in go-car, or set up a separate repo/module to offer unixfs CAR serialisation capabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add function to determine the size of final unixfs DAG given raw bytes
3 participants