Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

miller evaluates all records even when not needed #1653

Open
balki opened this issue Sep 19, 2024 · 5 comments
Open

miller evaluates all records even when not needed #1653

balki opened this issue Sep 19, 2024 · 5 comments

Comments

@balki
Copy link
Contributor

balki commented Sep 19, 2024

In the below example, only first 5 records are needed. But system in put has run for all the records as we can see in the tmp file.

❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p  put '$v = system("echo hello; echo err >> /tmp/1")' then head -n 5; nl /tmp/1
index v
1     hello
2     hello
3     hello
4     hello
5     hello
     1  err
     2  err
     3  err
     4  err
     5  err
     6  err
     7  err
     8  err
     9  err
    10  err

When in head is moved ahead of put, it works fine.

❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p head -n 5 then put '$v = system("echo hello; echo err >> /tmp/1")' ; nl /tmp/1 
index v
1     hello
2     hello
3     hello
4     hello
5     hello
     1  err
     2  err
     3  err
     4  err
     5  err

It appears that each verb is run on all records before moving to rest. Can miller be made lazy? I understand it will not be possible when stats/grouping is used. But for simple case I thought it wold work lazy.

@johnkerl
Copy link
Owner

There is indeed laziness and some early-out logic when head is in the verb list -- however there is some batching (default 500 rows at a time) which was necessary for performance in the port from C to Go ....

If we're getting readahead of over 500 records then that's a bug though ...

@johnkerl
Copy link
Owner

(In C it was record-at-a-time lazy ... in Go it's 500-records-at-a-time lazy ....)

@johnkerl
Copy link
Owner

OTOH this looks odd to me:

❯ {rm /tmp/1; echo index; seq 10} | mlr --c2p head -n 5 then put '$v = system("echo hello; echo err >> /tmp/1")' ; nl /tmp/1 

🤔 👀

@balki
Copy link
Contributor Author

balki commented Sep 19, 2024

(In C it was record-at-a-time lazy ... in Go it's 500-records-at-a-time lazy ....)

Thanks for clarifying. Makes sense. I was running below in the logs and found it took a long time (11 seconds) when head was used after put but the other way was instant. I think I should just move filter and head as early as possible.

❯ mlr --l2p --tz America/Toronto put '$ts = sec2localtime($ts); $cn = system(format("geoiplookup {} | grep Country", $request.remote_ip))' then filter '$status == 200' then flatten t
hen cut -of ts,cn,request.remote_ip,request.uri then head caddy.log | wc -l 
11

~/tmp/millerexp took 11s
❯ mlr --l2p --tz America/Toronto filter '$status == 200' then head then put '$ts = sec2localtime($ts); $cn = system(format("geoiplookup {} | grep Country", $request.remote_ip))' then
 filter '$status == 200' then flatten then cut -of ts,cn,request.remote_ip,request.uri caddy.log | wc -l                                                                                      
11

@balki balki closed this as completed Sep 22, 2024
@johnkerl
Copy link
Owner

it took a long time (11 seconds) when head was used after put but the other way was instant

@balki this needs fixing for sure.

@johnkerl johnkerl reopened this Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants