-
-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization: Read only first 128B from the file when searching for shebang #1040
Conversation
Thank you for your PR! Do you have a benchmark or similar that shows that this is faster than the current implementation? |
I cannot use my desktop until I get back home tomorrow, but I'll try to logically break down the detection cost on best/worst cases. I/O
Searching
*Note: Current implementation may be slightly faster in this case since *Note²: Also not really because the reduced I/O and allocation cost is much more significant. Heap Allocation
SummaryBoth reading and searching is capped at 128B after this patch while the current implementation might process the entire file - in the worst scenario, tens of gigabytes from a binary file if someone ran tokei on a LLM project. I don't know if there is a binary detection logic in the directory iteration part of tokei to prune that kind of madness, but as someone who plans to use tokei mainly as a filetype detection library, safety features like this on Hope that sums up the advantage of this PR. |
It was a misinput! |
Little off-topic question, why does |
I couldn't say why, I haven't focused on tokei in a few years. |
BenchmarksStill not convinced? Test Data# ./chungus: 4 GiB of zeros
$ dd if=/dev/zero of=./chungus bs=4M count=1024
1024+0 records in
1024+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.50768 s, 780 MB/s
# ./script: tiny script with shebang
$ echo "#!/usr/bin/env python3\nprint('Hello World!')" > ./script
|
I started using tokei as a library in my recent projects, and hopefully I can find some time to improve the codebase and deal with unresolved issues in the span of next few months :) |
How many CI jobs do you have |
Haha, each target * 3 (stable, beta, nightly). Tokei is now a pretty old project in the Rust space, and has been/is used to help catch regressions in the compiler. I commented out netbsd, and there's issue with that target, if you rebase it should pass. |
Rebased and force-pushed. |
Thank you for your PR, and congrats on your first contribution! 🎉 |
Currently,
LanguageType::from_shebang()
usesBufReader
andBufRead::read_line()
to fetch the first line of the file.BufReader::new()
allocates 8 KiB of buffer by defaultBufRead::read_line()
can end up reading the entire fileA typical shebang line ranges from 10 to 30 characters (e.g.
#!/usr/bin/env python3
is only 22B),and it is very unlikely the file contains a valid shebang syntax if we don't find a newline character within the first few dozens of bytes.
This PR modifies the code so that it only reads the first 128B of the file, into the fixed sized
u8
array, without involving the unnecessaryBufReader
andString
.