-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large amounts of values result in addPatterns() slowdown #363
Comments
No other work, definite interest. To be honest, I assumed the current scheme would run out of gas at some point, but the performance was surprisingly good on reasonable numbers of patterns/values, so I never got around to looking at it. A couple of points:
Do you see The multithreading potential is clearly interesting for this sort of large-scale operation. Anyhow, I encourage you to have a shot at this. Sounds like it will be reasonably easy for you to create unit-test cases One very minor worry; the Quamina unit tests are up to ~14 seconds on my M2 Mac, I hate having long-running unit tests because people should run them every time they lean back in their chair to think, etc. And it sounds like this PR could have a couple superheavyweight unit tests. Which would force me to clean up some of the others that are unnecessarily heavy, should probably do that anyhow tbh. Thanks for the interest! If it's not a secret, what's the application? |
My concern regarding the I will do my best to implement the other patterns as well, but I am not sure how much of what I've written thus far will translate over. I think this will require a look at the structures currently being created by the other patterns before I know for sure. Right now my focus is on strings, though. Regarding the tests, I've currently got them implemented as a comparison between the output of my new functions vs the existing functions. Before I do a PR I'll move all the intensive tests to benchmarks, so they shouldn't impact the total time it takes to run standard tests. This is for matching and filtering events to geographical areas, where the event is a single point (lat/lon) and the patterns are geohash regions that need to be matched against. There's probably an opportunity for an implementation that relies strictly on the coordinates (ie, PiP), but the issue is that the polygons are of variable complexity (could be MBs in size). Maybe if the need arises we'll contribute some |
Interesting. The most straightforward approach for As for restricting this to Maybe related: Quamina has an API to delete patterns, which is done in a brute-force way by remembering all the patterns, simply suppressing the matches to deleted patterns, and after a while rebuilding. That's the code in Another issue is, I did some profiling of automaton merging, found some wins, but eventually lost interest because it was efficient enough to not be a bottleneck. Would love to do some profiling of your million-value "takes minutes" scenario, or at least look at the |
I've gone ahead and forked this repo to https://github.com/DigitalPath-Inc/quamina with my changes. I think the performance improvements are too narrow at this point, so I am not going to open a PR for them right now. After getting a little more understanding regarding how everything works, I also think that it would be possible to forego the trie (I just used one to make it easier for me to understand/build with) and assemble the matcher directly, although for now the performance is good enough for us to ship internally. For reference, here are a bunch of test scenarios:
|
Hmm, don't want to be simple-minded, but it looks like there's some number N where if there's a field with more than that many values, the trie (or a direct-build successor) is basically always a win, and in many cases a big win. We don't know what N is but this data suggests <= 1000. I'm no purist, I've got no objection to putting this kind of heuristic into the code. I'm in the middle of moving house at the moment so kinda distracted but now you've got me curious, going to have a peek at your code. Plus I've had a couple of ideas for improving merge efficiency that I'll have a look at once I get my head back above water. Would love to see a PR from you once you've explored the territory a little more. I know from my experience with Ruler, Quamina's predecessor, that adding some combination of many patterns and big patterns is not an uncommon use case, so this is interesting stuff. |
I apologize, I'm in the middle of moving house and won't have time to look at this for another week or so. But, this is fascinating stuff and I'm super optimistic that we can work this in. The memory behavior is a little bit scary. Last time I looked at the code, there was no overview comment explaining generally how it works. For everything else in Quamina, particularly where the algorithms and data structures are a little weird, I've tried to make sure there is such an explanation. |
No worries, I've added in some comments, but should probably clean it up a bit more before trying to merge back, not to mention implementing support for The memory behavior pictured above is a result of calculating a hash over all the trie structures, including the leaves, then when building the If you have a chance to review and provide some feedback, I can see about getting some time allocated on my next sprint to implement everything. |
Well, finally got time to look through this and have lots of opinions.
Super appreciative of your work here, and sorry I've been absent-due-to-moving-houses. Sitting in my pretty decent new office now with time to think. |
I've gone ahead and scheduled some time next week to go through everything here and to see about implementing the numeric type (possibly the full version like event-ruler has?), so I'll let you know how that goes. In the meantime, feel free to work of my branch if you'd like, and enjoy your new place! |
Hello,
I have a use case that relies on a large amount of values in the patterns - up to around 1-2k values per pattern. With the current method of building the
valueMatcher
(creating a new DFA and merging it with the existing DFA), this can result in multiple patterns taking minutes to complete theaddPattern()
calls. This is further exacerbated when using thousands of patterns, for a total of millions of possible values / states.I took a couple hours today to work out a PoC for first constructing a trie for all patterns/values, then converting that trie to the DFA with the same structure as is currently being used. In benchmark results, it's roughly a 60% reduction in time (10.6s --> 4.0s on my laptop) with a ~20% peak memory penalty (1.2gb --> 1.47gb) for 1 million total values, and is faster for all scenarios with more than 250 values. I also see some potential to multithread this versus the current method is single threaded.
I'm thinking about a couple options for implementation:
addPatterns()
method that allows the same functionality asaddPattern()
currently does, but it will use this new methodology (and merge the existing + new DFAs at the end asaddPattern()
currently does)Is there any interest in seeing this merged in or is there other work currently being done on this issue?
The text was updated successfully, but these errors were encountered: