-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Big formats" model #165
Comments
To make this talk more substantial, here's my initial proposal and proof-of-concept: https://github.com/kaitai-io/ksy_java_bytecode Points I'd like to discuss and make "standard":
|
I agree with the need to either bundle sample files, or have an automated test mechanism which can download sample files from external websites. For testing of a .ksy specification against sample files, one approach could be to convert an input binary file into an XML document, YAML document or similar which can then be compared with a common tool such as |
Yeah, that should be more or less suitable approach. Web IDE can already dump whole structure recursively into a JSON file. In fact, building a recursive dumper in any language that has basic reflection capabilities should be pretty quick and easy. A few things to discuss here:
|
I think a bit differrent.
maybe
The whole folder should have ONLY the set of .ksy files specific to the format and optional markdown doc. Unspecific files must go into the main kaitai_struct_formats repo. Also there we can have a folder for tests, but I'm not sure if we need this. Most of Kaitai-generated parsers need some postprocessing, so I guess that the tests should go into the libraries' repo.
what is $TITLE? and the most important: Postprocessing librariesIn order to build a postprocessing library, the library's repo has a main ks_formats repo as a submodule. It updates submodules, that way it gets all the repos connected to that, including the one used by the library. It has a drawback - you will have to download whole repo and all its subrepos. It can consume lot of space and time. The better solution is to download only ksy files needed. Fortunately it can be done within this approach with the help of a simple python (or any other) script making graph traversal and http queries. The optimal way to fetch the repo is chosen by postprocessing lib dev by passing a parameter to fetching and compiling script. This script should be integrated with language package manager, for example I have started development of a setupttols plugin for python. |
Hmm, I think I like it — clear and concise.
Sorry, I don't quite understand. Probably we're talking about the same thing: main .ksy should be placed in root directory of a repository, additional .ksy files should be placed there too (if needed).
What exactly are "unspecific files"?
We do, that's the idea.
Postprocessing? Generally, most of .ksy files available in kaitai_struct_formats repo is ready to be used as is.
I'm not sure if that's a good idea. Generally, end-users are not that interested in all that gory details of .ksy development, they just need a finished product. So, for now I'm proposing to just copy "finished product", i.e. .ksy file to KSF repo from time to time.
All formats universally can assume that, it's not the matter of "standalone repo", or just some random .ksy file in random path. |
I suggest to have no main ksy. Which is main - it is decision made by the library developer. The layout of the repo is the one its owner decides. It can contain all the files in a single folder, it can contain a hierarchy.
The files which are likely to be useful for formats other than the ones in this repo. Building blocks. The examples are now in
Yes, they are, but in most cases it is useful to have a library doing the stuff. For example, in most cases we want spectrum data be loaded as a numpy array, not a python one. So we need some code checking for presence of numpy and using numpy if it is available. In
The repo's branch master should be a finished product by convention. A bleeding-edge but a finished one. For unfinished ones we can have separate branches.
It is the convention. |
Yes, I have thought about this (but with BSON instead of text-based format since we are dealing with binaries and since it allows some extension). |
How will this model work with https://ide.kaitai.io ? If someone wants to enhance or use an existing file format specification, will the |
That part of the WebIDE will be completely rewritten. I don't know yet when, but on the devel branch there is already an implementation where you can directly load ksys from your custom Github repo and will be able to commit the changes. We will solve the import question too, probably there will be a search order of data sources and the first one where the file is found will be used. |
I've hacked a quick recursive dumper that allows to dump to YAML, JSON and XML (more or less standard pretty-printer calls of standard libraries). It is available now as ksdump, inside ksv visualizer repo (it reuses some of visualizer code to run compile & load data). Some results for you to compare: |
A few things I want to note:
|
4 мая 2017 г. 19:33:18 GMT+03:00, Mikhail Yakshin <[email protected]> пишет:
A few things I want to note:
* This dumper fails badly with stack overflow with infinitely recursive
structures (for example, iso9660.ksy)
* File sizes:
* 16677 — wheel.yaml
* 27075 — wheel.json (~1.6x to YAML)
* 46994 — wheel.xml (~2.8x to YAML)
* YAML wraps long lines with large raw hex dumps, XML and JSON don't do
that
* XML is very verbose and noisy (i.e. there is no way to specify data
type except for adding extra "type"-like attribute), also it mangles
our `lower_underscore_case` into `lower-minus-case`
* JSON has diffing problem: adding another attribute would likely
replace trailing `}` with `},`, thus making extra diff line, but,
arguably, it is most legible out of these three choices
* Default libyaml output is somewhat ugly :(
* There is no simple way to control order of output, beside digging
deep into YAML/XML/JSON library and making our own low-level
serialization routines. My current try is supposed to be alphabetical
(to keep it stable), but even that is actually not guaranteed :(
There is no simple way to control order of output
Arrays are guarrantied to be ordered.
BTW, how about BSON?
|
Of course.
BSON would be probably worse than these 3: you're more or less unable to view BSON without special tools, there are virtually no BSON diffing utilities (and, even if you'd come up with one, GitHub won't know a thing about diffing BSONs), it's even less widespread and understood than XML/YAML/JSON, etc. |
After playing with YAML dumping for some time, I've realized that YAML is not immune to diff problems either. Say, something like that: - bar: 1
foo: 2 will get transformed into - aaa: 0
bar: 1
foo: 2 if we're adding |
aka snapshot testing see also gron - make json greppable (and diffable)
add a {
"type": "kaitai_struct_format",
"name": "java_bytecode",
"version": "0.1",
"main": "src/java_bytecode.ksy",
"import_paths": [
"https://github.com/kaitai-io/kaitai_struct_formats"
],
"scripts": {
"test": "./tests/run.sh"
}
} ... or keep ksy files in https://github.com/kaitai-io/kaitai_struct_formats and only out-source test files, to keep the main repo small meta:
id: some_format
tests: https://github.com/kaitai-io/kaitai_struct_formats_some_format_tests |
I currently do this. IMHO it is the most convenient to just fork KSF and develop formats in it. And then regularly rebase over
There exists https://github.com/kaitai-io/kaitai_struct_samples , but yes, if we put all the files there, it soon will be pretty big. So if we centralize samples in a single repo, I guess Git LFS is a necessity. But LFS is a feature of repos which usage is limimed. Though I guess on HuggingFace.co Git LFS is a "free" feature, I'm not sure if it is not abuse to use that website for anything other than machine learning models.
I guess not JSON, but YAML, since we already use YAML.
Nope, each
Surely needed. My "scripts": {
"test": "./tests/run.sh"
} I guess, no. Maybe samples: "dir/with/tests/relative/to/repo/root Samples should contain subdirs with names matching specs or samples:
repo: "https://github.com/kaitai-io/kaitai_struct_samples"
refspec: "master"
path: "dir/with/tests/relative/to/repo/root" if samples live in a foreign repo. |
my idea was that some binary files can be generated, for example: sqlite database, ext2 filesystem, png image - also useful for fuzz testing, or for generating extremely large files (for testing limits and performance) risk: malicious code hidden in complex test script so yeah, its easier to have a test corpus of static files and expected results |
Good idea, really useful for fuzzing. But I guess for this case the script should not be called by test runner ... instead test runner should be called by the script. Kinda # paths in the metadata point to non-existent files
./generateTestFiles ./ks-repo.yaml # generates the test files by the needed path, after it the paths in the metadata point to existent files
kaitai-test-runner ./ks-repo.yaml --junitxml=result.xml For the most of big testing files I guess it can be possible to use sparse files (the ones taking less space on disc than they appear to have, holes are filled with zeros, some legacy file systems don't support them, but NTFS (I guess without some old Windows versions, but cannot exclude they can be implemented the same way symlink support to them was implemented), APFS, ext4, btrfs and zfs do. Git supports them and automatically sparsifies files that were not explicitly sparsified with |
My idea (maybe naive and impractical, but whatever) was that in 99% of formats, you don't need huge binaries to test all structures defined in a In most cases, a big file doesn't bring any benefits in terms of "code coverage" (or format structure coverage) over a small one. Usually, it's only big because either:
That's in theory. In practice, unfortunately, trying to get these "small" high-quality files may be difficult, if not impossible. Many samples we can take from the internet will not uphold this criteria, and generating/crafting own files requires a lot of knowledge and time. So I understand the simplicity of adding whatever file we can find regardless of how big it is into the collection, but yeah, the Git repo may become unusable over time after adding a bunch of larger files. |
The idea was that GIT LFS stores hashes within repo, and files are stored separately. It allows a git repo itself to be small, and files to be stored and populated separately. But to think more about this, sparce checkout should also do the trick. But sparse checkout and partial clones have terrible UX making them unusable anywhere where the commands have to be typed manually. |
It looks like it - and quite severely, at least on GitHub. From https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage I understand that if you have a 10 MiB file as a Git LFS object and 103 downloads of this file within a month, you've run out of the monthly free bandwidth and Git LFS stops working until the next month. |
So using Git LFS on GitHub Free doesn't seem to bring any advantages - it's just more limited. Normally, you can have files up to 100 MiB:
and it's strongly recommended to keep the repository below 5 GiB:
Although Git LFS technically gives you an ability to store files larger than 100 MiB, in practice you can't do this anyway, because one such file means that the repository can only be cloned 10 times per month. |
We have been discussing this from time to time for probably half-a-year already, so I guess it's time to get this rolling.
The idea is simple: some formats (for example, Adobe Photoshop .psd format, Java .class, Corel Draw .cdr, Adobe Flash .swf, Microsoft Office formats, etc) are pretty complex and developing .ksy files for them takes a lot of time and effort. It is pretty uncomfortable to develop it as a single file in fork of our standard kaitai_struct_formats repo. It's much better idea to have one-repo-per-such-"big"-format development model.
This would allow potentially to have distinct space to store:
Having distinct repositories also helps a lot in collaboration, as you can just give out write rights to anyone you care, skipping longer pull request procedures to main KSF repo.
Thus, I propose to discuss overall "recommended" layout of such a repository, what it should and should not have, how to name them, etc.
Cc @davidhicks @koczkatamas @LogicAndTrick @KOLANICH
The text was updated successfully, but these errors were encountered: