GitHub - davidporter-id-au/stream-exec: a convenience wrapper for executing shell commands from a stream of json-lines

JSON stream-exec

takes a stream of JSON lines and executes a bash command, with the base json variables subsituted in as Envvars.

This is just a convenience function for the typical use-case of looping over recorrds in some data-set, to perform some side-effect such as a network request.

Given a file of json lines:

{"user": "123", "document": 456}
{"user": "124", "document": 457}

Execute an arbitrary shell command with parallelism and retries:

cat records.json | stream-exec --exec 'curl -X POST http://example.com/$document -d "{\"user\": \"$user\" }"' --concurrency 10 --continue

In this example the json keys 'user' and 'document' have been set as environment variables and are accessible by the subprocess.

For JSON subkeys, strings, integers are assigned directly. JSON objects and arrays are JSON stringified and set to the JSON key.

Justification

Use-case: This is strictly for:

safe, trusted inputs only. Absolutely no security guarantees on untrusted inputs
JSON lines only, separated by newlines. No single long arrays either, break them up with cat bigfile | jq ".[]" > jsonlines first
unix/linux environments, no windows support
For big or small files, it streams input, so it doesn't load it fully into memory

Comparision vs other tools:

Why reinvent the perfectly-good xargs -P? Because, xargs -I doesn't allow for passing multiple parameters (unless I'm wildly mistaken - happy to be corrected) and you nearly always end up writing the nearly identical bash script wrapper for pulling out any params with jq. Also logging and retry-handling are diffult with this.

Why not use a bash for-loop with some combination of wait? Because it's bash - it's a bundle of sharp edges and nearly every time I write the script it's nearly identical. It's also very imperative and tedious.

Why not use gnu-parallel? Because I don't think it's integrated with JSON lines. it's near-certainly more powerful, and in nearly every other use-case I'd suggest preferring it or xargs over this.

Installation

osx:

curl -L https://github.com/davidporter-id-au/stream-exec/releases/download/0.0.7/stream-exec_darwin \
  -o /usr/local/bin/stream-exec && chmod +x /usr/local/bin/stream-exec

linux:

curl -L https://github.com/davidporter-id-au/stream-exec/releases/download/0.0.7/stream-exec_linux \
  -o /usr/local/bin/stream-exec && chmod +x /usr/local/bin/stream-exec

Flags

  -concurrency int
        number of concurrent operations (default 10)
  -continue
        continue on error
  -debug
        enable debug logging
  -dry-run
        show what would run
  -err-log-path string
        where to write the error log, leave as '' for none (default "error-output-2022-03-06__00_05_20-08.log")
  -exec string
        the thing to run
  -output-log-path string
        where to write the output log, leave as '' for none
  -retries int
        the number of attempts to retry failures

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
stream-exec		stream-exec
.gitignore		.gitignore
LICENSE		LICENSE
go.mod		go.mod
go.sum		go.sum
main.go		main.go
makefile		makefile
readme.md		readme.md
test.json		test.json
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JSON stream-exec

Justification

Installation

Flags

About

Releases

Packages

Languages

License

davidporter-id-au/stream-exec

Folders and files

Latest commit

History

Repository files navigation

JSON stream-exec

Justification

Installation

Flags

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages