-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to cat that accepts the name of a file that contains a list of input CSV data file names. #1293
Comments
This is a good feature to have. Added to the backlog. |
Thinking more about this, something like It should also be implemented for |
If not too much effort, and for commands like |
I'm afraid changing The large How many files were you benchmarking BTW? Was it thousands of files? Also, Anyway, I'll consider streamlining it while implementing |
If not file name argument processing, might the slower CSV file concatenation be due to slower file input-output handling? Excessive CSV data validation?
I concatenated thousands of files. Project benchmark_cat_csv compares the CSV file concatenation performance of my naive custom CSV concatenation shell script
Note that all of the shell scripts with the "_2" suffix invoke a sub-shell in order to handicap them in a way similar to shell scripts
I was oversimplifying the implementation of my custom shell script. It actually uses Bash
|
By the way, here are the versions of the various tools that I used in the performance tests.
|
Thanks for pulling together the benchmark_cat_csv project and including qsv in it. I just tweaked I'll still work on the |
Just added |
Unfortunately, the
|
Interesting. There should be a noticeable performance bump... Can you try running it with export CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'
cargo build --release --locked -F lite |
|
I specified If you're so inclined @derekmahar , you can actually compile qsv with a samply profile and actually see where the bottlenecks are using samply. export CARGO_BUILD_RUSTFLAGS='-C target-cpu=native'
cargo build --profile release-samply --locked -F lite |
How much difference should |
Depends on the platform. But you can find out what addl CPU features are enabled by going here https://github.com/jqnatividad/qsv/blob/master/docs/PERFORMANCE.md#cpu-optimization |
… `--flexible` option is enabled in connection with #1293
I ran the original test on an AMD Ryzen 9 6900HX at 3.293GHz, NVMe drive. |
I'm not certain, but I think I can't use samply because I'm running qsv and my benchmarks only in command line shells in Windows Subsystem for Linux 2 and on remote Linux servers on my home network. I may be mistaken, but I think that a process that runs in WSL 2 can't launch a local browser on its Windows host. If WSL 2 can launch a browser in the host, I don't know how to do it. |
"infile-list" files is qsv's flavor of the "infile-list" support of csvtk as per #1293 In our implementation, providing a file with the ".infile-list" extension to commands that support it (currently, `sqlp` and `to`) will read the file as a list of input files to use for the command. Will add ".infile-list" support to `cat` and `headers` command as well
Thank you for implementing this feature! |
By the way,
|
Thanks @derekmahar for compiling these benchmarks! The docopt parser is super-convenient that's why I choose to stay with it (#463 for more details), and its good to have a baseline to keep improving its performance. It's also good to know that qsv is faster than csvtk and just a tad slower than mlr. I may still be able to squeeze some more performance from cat. Do you mind sharing your benchmark so I can use it for tuning? |
qsv's poor performance (aside from the input file list argument) in this benchmark may be an example of docopt/docopt.rs#207 to which you referred in #463.
It's encouraging to know that the sluggishness of |
Hi @derekmahar , just wanted to give you a heads-up that I tweaked qsv-docopt a bit. Hopefully, it'll perform better on your command line parsing benchmarks... |
Your tweaks to qsv-docopt improved qsv's performance by about 5%:
|
In order to bypass the command line length of command shells, please allow
cat
to accept an option that specifies the name of a file that may contain a list of input CSV data file name arguments, each separated by a new line. This option would be similar to option--infile-list string
thatcsvtk concat
accepts:(See the output of command
csvtk concat --help
for the most current syntax ofcsvtk concat
.)The text was updated successfully, but these errors were encountered: