-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up in the presence of timeouts. #58
Conversation
Hi. Sorry to intervene and I may be missing something but would it be more efficient just to do time.AfterFunc() (which doesn't even start a goroutine) that would set the atomic value after the deadline has been reached, rather than waking up every 10ms? |
I haven't measured AfterFunc, but it is likely to be slower by a lot (fast matches are only 20-30ns, AfterFunc does allocation, channel creation, priority queue management etc.). Furthermore, the AfterFunc result will have to be canceled when the search finishes early (the common case). I can measure it if necessary. |
It does an allocation, that's true, but it does not create a channel and does not start a goroutine, and the queue management is done in the runtime, quite efficiently I would hope... It also does not need to be cancelled, if the atomic is set after nobody needs it, it's no big deal. Overall I think you're probably right, but I'd still measure it just in case... |
BTW, do I get it right that with your solution if I set the deadline to a minute and the search finishes in 20ms, it will still be waking up every 10ms for the remaining almost a minute? I can't see |
|
I still think it's rather wasteful, even if it's just one goroutine. Maybe add a means to reset |
I don't understand how to safely reset clockEnd in a simple way. There may be many regexp matches in flight with different deadlines. To safely reset clockEnd, we would need to keep track of all live deadlines (including removal when done). That would add back a priority queue which is one of the main costs in the time.AfterFunc implementation. |
Maybe instead of resetting clockEnd have an atomic usage counter and stop the goroutine if it goes to zero? |
Actually, this counter could replace clockEnd. |
The counter is perhaps a good idea; it could simplify the code. Two things that worry me: (1) The logic for starting the clock updater will become more complicated (makeDeadline will need to detect when it is the first thing added to a zero counter). (2) Runners will modify shared state which will make the cache line containing the counter bounce back and forth between cores. I can try to measure it when I get a chance but I would like to have Doug give his opinion first. |
I'm fine with 1 extra goroutine total if the consumer uses a regex with a timeout. It's a plus if the goroutine goes away after the last timeout expires. We should document that behavior though in case people see it in debug dumps. |
Goroutine in itself is not a problem, the problem is that it is waking up every 10ms doing something rather resource consuming. And, with how the PR stands now, it will continue doing so for the whole duration of the timeout, even if the search finishes earlier. The change is optimised for a very specific case, but it's no silver bullet: the work is shifted to another goroutine and unless you're constantly running multiple regexes in parallel, the overall impact on performance is most likely going to be negative. Luckily I'm not using timeouts, but if I were I'd be inclined to fork the project if the change is accepted as is. |
@dop251 I appreciate the perspective and I think it's worth digging into to make sure I understand why this PR would cause you to fork if you used timeouts. The current timeout code checks the wall-clock time every X iterations of the regex engine. This is both very slow and has a non-instant granularity even with a single regex running. My understanding is that this PR has the benefit of improving the timeout check performance by ~6000% for the cost of having an extra goroutine and some additional code complexity. If you were using Timeouts then in theory you'd see your regexp2 performance dramatically improve if this PR was merged. It sounds like that's not how you see it so I'd definitely like to understand. @ghemawat it makes sense to add a new benchmark to confirm the worst-case performance. Theory is nice, but I'd feel more comfortable with some data. How do small, super fast regexes with expanding, long timeouts compare between the 2 versions? I.e. compare the lowest-cost scenario in the old code (not enough cycles in the match to trigger the date check) with the highest-cost version in the new code (lots of "move the last date check" synchronization). If there are other "worst case" scenarios you can imagine feel free to add them as well. |
I'll try to explain where I'm coming from. The timeout functionality is most likely used as a safety measure to protect against 'catastrophic backtracks' (at least this is how I would use it), i.e. the timeouts are never meant to trigger and therefore I would set it to a rather large value (maybe a minute). Now, imagine there is a normal regex that finishes relatively quickly (let's say 1ms) not triggering the timeout, which will be a typical case. Assume this was a one-off run, i.e. the code is not running any regexes for quite some time after this. After the regex has finished the goroutine will continue to run for the whole minute (and a bit more) waking up every 10ms doing all the stuff that was causing the slowdown in the first place (like getting the current time and scheduling). But what's worse is that in the described scenario this work is done for nothing. So, this:
does not hold in this scenario. The cost is not just an extra goroutine and some additional code complexity. The overall performance will suffer. Not the regex itself, but the application in general, because there will be a goroutine running in parallel, wasting the CPU. To improve things I would consider terminating the goroutine as soon as possible, i.e. when nobody needs it rather than when the last timeout expires. My proposal with the usage counter could help although I share Sanjay's concerns about the cache. It needs to be benchmarked, but I'd argue that even if it results in a slightly worse latency it's still worth doing because the overall performance will be better in the scenario I described above. Also, I would consider reducing the granularity from 10ms to 50 or maybe even 100, bearing in mind that timeouts are safety measure and not meant to trigger. |
Worst case performance
Good idea. It probably won't happen in realistic scenarios, but knowing the worst case is useful. I added a case to the benchmark that increases the timeout on each iteration to force makeDeadline to go into the extendClock path every time. The results:
So yes, it does add about 20ns, but still much better than where we started from. Does that cover the case you were thinking of? Increasing the clock period
I agree with this. Wakeup costI measured the cost of having a goroutine that wakes up every 10ms. Compared to a program that sleeps without doing anything, the periodic wakeup seems to add about 0.9s of CPU for a 60 second run. That's 0.9s for 6000 wakeups which works out to 150µs per wakeup. Alternative approachI had another thought based on the observation that all runners are tracked by getRunner/putRunner. We are already paying the cache traffic cost for the shared state used by those two functions. If mixing concerns in the code is not an issue, we could change getRunner/putRunner to maintain a count of live runners that have a timeout (protected by re.muRun Mutex). The background clock updater would stop if it observes a zero count. This would mean at most one extra time the goroutine wakes up. Though we have to be careful that we don't fall into the following pattern:
Goroutine creation will happen at most 1/clockPeriod times per second (e.g., 100x/s with current 10ms clock). This is another reason to increase the clockPeriod. WDYT? |
I realized a mistake in my suggested alternative approach: re.muRun is per regexp, so it is unsuitable for protecting a count of active deadlines across all runners. I tried out the counter approach (stored in an atomic variable). Serial costs:
Parallel costs (one copy of benchmark running on each of the 12 hyperthreads on my machine):
I prefer the clockEnd approach with the clock period increased to 100ms (which will bring the background goroutine overhead to 0.15% when running). But I can live with either approach and will let Doug decide which one he prefers. |
A few thoughts:
|
Thanks. I will make the changes as suggested (including tests). I won't be able to get to this until Friday though. |
Quick question though: would you prefer me to make further changes in separate commits added to this PR, or do you prefer everything in the PR squashed together in a single commit? |
I prefer single commits (mainly to avoid merge commits), but it's not a requirement. |
Currently regexp2 can be quite slow when a MatchTimeout is supplied (a micro-benchmark shows 3000ns compared to 45ns when no timeout is supplied). This slowdown is caused by repeated timeout checks which call time.Now(). The new approach introduces a fast but approximate clock that is just an atomic variable updated by a goroutine once very 100ms. The new timeout check just compares this variable to the precomputed deadline. Removed "timeout check skip" mechanism since a timeout check is now very cheap. Added a simple micro-benchmark that compares the speed of searching 100 byte text with and without a timeout. Performance impact: 1. A micro-benchmark that looks for an "easy" regexp in a 100 byte string goes from ~3000ns to ~45ns. 2. Chroma (syntax highlighter) speeds up from ~500ms to ~50ms on a 24KB source file. 3. A background CPU load of ~0.15% is present until the end of of all match deadlines (even for matches that have finished).
I pushed a new commit with the requested changes. The new changes are:
|
LGTM |
Currently regexp2 can be quite slow when a MatchTimeout is supplied (a micro-benchmark shows 2300ns compared to 34ns when no timeout is supplied). This slowdown is caused by repeated timeout checks which call time.Now().
The new approach introduces a fast but approximate clock that is just an atomic variable updated by a goroutine once very 10ms. The timeout check can now compare this variable.
Removed "timeout check skip" mechanism since a timeout check is now very cheap.
Added a simple micro-benchmark that compares the speed of searching 100 byte text with and without a timeout.
Performance impact: