Large amounts of values result in addPatterns() slowdown #363

ehiggins0 · 2024-10-01T23:06:59Z

Hello,

I have a use case that relies on a large amount of values in the patterns - up to around 1-2k values per pattern. With the current method of building the valueMatcher (creating a new DFA and merging it with the existing DFA), this can result in multiple patterns taking minutes to complete the addPattern() calls. This is further exacerbated when using thousands of patterns, for a total of millions of possible values / states.

I took a couple hours today to work out a PoC for first constructing a trie for all patterns/values, then converting that trie to the DFA with the same structure as is currently being used. In benchmark results, it's roughly a 60% reduction in time (10.6s --> 4.0s on my laptop) with a ~20% peak memory penalty (1.2gb --> 1.47gb) for 1 million total values, and is faster for all scenarios with more than 250 values. I also see some potential to multithread this versus the current method is single threaded.

I'm thinking about a couple options for implementation:

On init have an option for all the known patterns up front
An addPatterns() method that allows the same functionality as addPattern() currently does, but it will use this new methodology (and merge the existing + new DFAs at the end as addPattern() currently does)

Is there any interest in seeing this merged in or is there other work currently being done on this issue?

The text was updated successfully, but these errors were encountered:

timbray · 2024-10-01T23:47:42Z

Is there any interest in seeing this merged in or is there other work currently being done on this issue?

No other work, definite interest. To be honest, I assumed the current scheme would run out of gas at some point, but the performance was surprisingly good on reasonable numbers of patterns/values, so I never got around to looking at it. A couple of points:

I don't quite understand your "on init have an option…" idea
My initial reaction is positive to the addPatterns() idea.

Do you see addPatterns() working for all the different kinds of patterns or just string literals?

The multithreading potential is clearly interesting for this sort of large-scale operation.

Anyhow, I encourage you to have a shot at this. Sounds like it will be reasonably easy for you to create unit-test cases One very minor worry; the Quamina unit tests are up to ~14 seconds on my M2 Mac, I hate having long-running unit tests because people should run them every time they lean back in their chair to think, etc. And it sounds like this PR could have a couple superheavyweight unit tests. Which would force me to clean up some of the others that are unnecessarily heavy, should probably do that anyhow tbh.

Thanks for the interest! If it's not a secret, what's the application?

ehiggins0 · 2024-10-02T19:40:31Z

My concern regarding the addPatterns() implementation would be the case where that is called multiple times (say, batched 1M value sets of patterns), which would result in a merge between multiple massive FAs. If this was implemented as an option to quamina.New(), it would force only a single call to addPatterns(), then addPattern() could be used for the smaller changes made during runtime.

I will do my best to implement the other patterns as well, but I am not sure how much of what I've written thus far will translate over. I think this will require a look at the structures currently being created by the other patterns before I know for sure. Right now my focus is on strings, though.

Regarding the tests, I've currently got them implemented as a comparison between the output of my new functions vs the existing functions. Before I do a PR I'll move all the intensive tests to benchmarks, so they shouldn't impact the total time it takes to run standard tests.

This is for matching and filtering events to geographical areas, where the event is a single point (lat/lon) and the patterns are geohash regions that need to be matched against. There's probably an opportunity for an implementation that relies strictly on the coordinates (ie, PiP), but the issue is that the polygons are of variable complexity (could be MBs in size). Maybe if the need arises we'll contribute some geo patterns to handle geographic events...

timbray · 2024-10-03T00:14:42Z

Interesting. The most straightforward approach for addPatterns() is you do your trie-magic on all the pure string patterns and then merge the result with any that are more complicated. There are lots of functions like makeWildcardFA() and makePrefixFA() and so on. We'll document that this is optimized for pure strings and maybe we'll figure out a way to apply your stuff for other kinds of patterns later.

As for restricting this to quamina.New(), I'm a bit dubious… If you have 250K values now and and another 250K in an hour, is it more efficient to build from scratch with all 500k or merge the two 250k automata?

Maybe related: Quamina has an API to delete patterns, which is done in a brute-force way by remembering all the patterns, simply suppressing the matches to deleted patterns, and after a while rebuilding. That's the code in pruner.go. I wonder if it could be used. Obviously the memory cost of remembering the patterns would be significant. Hmm…

Another issue is, I did some profiling of automaton merging, found some wins, but eventually lost interest because it was efficient enough to not be a bottleneck. Would love to do some profiling of your million-value "takes minutes" scenario, or at least look at the .prof output. There may be low-hanging fruit in mergeFAs().

ehiggins0 · 2024-10-08T01:07:26Z

I've gone ahead and forked this repo to https://github.com/DigitalPath-Inc/quamina with my changes. I think the performance improvements are too narrow at this point, so I am not going to open a PR for them right now. After getting a little more understanding regarding how everything works, I also think that it would be possible to forego the trie (I just used one to make it easier for me to understand/build with) and assemble the matcher directly, although for now the performance is good enough for us to ship internally. For reference, here are a bunch of test scenarios:

# of Patterns	# of Fields	Length of Fields	Time with Trie	Time with Merge
100	1	[ 1000 ]	342.15ms	975.6ms
100	2	[ 1000, 1000 ]	32.48s	64.07s
100	3	[ 3, 3, 3 ]	16.7ms	14.90ms
100	4	[ 3, 3, 3, 3 ]	49.4ms	39.49ms
100	5	[ 3, 3, 3, 3, 3 ]	155.57ms	122.05ms
100	6	[ 3, 3, 3, 3, 3, 3 ]	443.11ms	326.61ms
1000	1	[ 1000 ]	3.25s	12.37s
1000	2	[ 1000, 3 ]	16.59s	21.78s
1000	3	[ 1000, 3, 3 ]	51.88s	55.34s
1000	5	[ 3, 3, 3, 3, 3 ]	1.49s	1.17s
1000	6	[ 3, 3, 3, 3, 3, 3 ]	4.55s	3.53s
10000	1	[ 1000 ]	31.46s	168.43s

timbray · 2024-10-08T03:35:22Z

Hmm, don't want to be simple-minded, but it looks like there's some number N where if there's a field with more than that many values, the trie (or a direct-build successor) is basically always a win, and in many cases a big win. We don't know what N is but this data suggests <= 1000. I'm no purist, I've got no objection to putting this kind of heuristic into the code. I'm in the middle of moving house at the moment so kinda distracted but now you've got me curious, going to have a peek at your code. Plus I've had a couple of ideas for improving merge efficiency that I'll have a look at once I get my head back above water. Would love to see a PR from you once you've explored the territory a little more. I know from my experience with Ruler, Quamina's predecessor, that adding some combination of many patterns and big patterns is not an uncommon use case, so this is interesting stuff.

ehiggins0 · 2024-10-11T04:46:42Z

Well, I've gone ahead and implemented prefix and equals-ignore-case with the new methodology, and they both see similar performance improvements, so I expect that it'll be possible to implement everything. I've also added in a hash algorithm over each leaf of the trie, which has significantly improved performance as a lot of the end states can be re-used instead of having to be created multiple times. There's an issue where the order matters in trie creation right now, so I will be digging into that tomorrow, but here's some new numbers:

# of Patterns	# of Fields	Length of Fields	Previous Time with Trie	New Time with Trie	Time with Merge
100	1	[ 1000 ]	342.15ms	256.92ms	967.62ms
100	2	[ 1000, 1000 ]	32.48s	OOM	OOM
100	3	[ 3, 3, 3 ]	16.7ms	10.02ms	16.59ms
100	4	[ 3, 3, 3, 3 ]	49.4ms	29.63ms	39.49ms
100	5	[ 3, 3, 3, 3, 3 ]	155.57ms	74.07ms	122.05ms
100	6	[ 3, 3, 3, 3, 3, 3 ]	443.11ms	208.42ms	326.61ms
1000	1	[ 1000 ]	3.25s	2.56s	12.37s
1000	2	[ 1000, 3 ]	16.59s	8.86s	23.53s
1000	3	[ 1000, 3, 3 ]	51.88s	25.47s	55.11s
1000	5	[ 3, 3, 3, 3, 3 ]	1.49s	729.38ms	1.17s
1000	6	[ 3, 3, 3, 3, 3, 3 ]	4.55s	2.06s	3.53s
10000	1	[ 1000 ]	31.46s	30.41s	164.67s

Note

OOM - It looks like these are right on the edge of my available memory of ~38gb so a little extra overhead from some background procs was enough to kill them, and I am too lazy to kill the other procs

One thing that would be important to note here is that the performance improvements very heavily come from us being able to pre-compute hashes from the trie leaves, so I do not think that an addPatterns method would be the best way to implement as we will lose a lot of the performance gains. For now I've implemented it as an option on initialization, although the function can be called from anywhere so if there is a need it could be moved to an addPatterns method with a merge.

One other thing - I've implemented a way to visualize the matchers via mermaid. This has made debug much easier when looking at the hashing functionality. Here's an example:

timbray · 2024-10-16T03:11:02Z

I apologize, I'm in the middle of moving house and won't have time to look at this for another week or so. But, this is fascinating stuff and I'm super optimistic that we can work this in. The memory behavior is a little bit scary. Last time I looked at the code, there was no overview comment explaining generally how it works. For everything else in Quamina, particularly where the algorithms and data structures are a little weird, I've tried to make sure there is such an explanation.

ehiggins0 · 2024-10-22T23:05:54Z

No worries, I've added in some comments, but should probably clean it up a bit more before trying to merge back, not to mention implementing support for exists and anything-but.

The memory behavior pictured above is a result of calculating a hash over all the trie structures, including the leaves, then when building the coreMatcher I can lookup the precomputed hash to see if we've already created the object. When I put it on the profiler, calculating the hashes was much faster than creating all the objects in memory, so I opted to go that route with it.

If you have a chance to review and provide some feedback, I can see about getting some time allocated on my next sprint to implement everything.

timbray · 2024-10-28T18:19:22Z

Well, finally got time to look through this and have lots of opinions.

I like this approach a lot and probably want to pull the code into Quamina mainline
The trie is nice for its own sake, but from my PoV most of all because it facilitates hashing/deduping. You'll notice that mergeFAStates() memoizes itself, but I was left with a gut feeling that I was wastefully creating a lot of extra states and burning memory. A couple of times I tried to think of a better way to hash/dedupe, came up empty. But AFAICT your hash setup should do exactly the right thing.
Could I ask a favor? Have a look at matcherStats(m *coreMatcher) - it yields a description of the number of smallTables and how full they are by recursing around all the states with an accumulator. I'd be super interested to see the difference between what it says about the merge and trie techniques. I'm optimistic that it'll show smaller numbers.
If this were a formal review I'd ask you to shorten up some of the comment lines so I don't need a mongo-huge IDE window. But I suspect that golangci-lint will do that for me.
https://github.com/DigitalPath-Inc/quamina/blob/main/trie.go#L36 var globalStateCache maybe this should have trie in its name, there are lots of different kinds of states around Quamina
Never heard of DJB2 before, looked it up. Wow, it's old, nothing better has come along since? But if it's working well, no biggie.
https://github.com/DigitalPath-Inc/quamina/blob/main/trie.go#L72 this surprised me, aren't sort.Slice() and sort.Strings() smart enough to do that?
Would it all look nicer if we did type trieHash uint64 and have fewer uint64 variables?
It's not obvious to me why the pattern membership needs to be included in the hash, i.e. why you need memberOfPatterns. An explanation would be helpful.
https://github.com/DigitalPath-Inc/quamina/blob/main/trie.go#L197 I'm old fashioned and always use i or fooIndex or some such for array-index variables, so does all the other code in Quamina, but no biggie.
https://github.com/DigitalPath-Inc/quamina/blob/main/trie.go#L394 noted - I could fill this in later if you don't want to
The whole point of this function is to create the smallTable in one shot - excellent. I've been meaning to write general-purpose smallTable appender and iterator, when I got around to doing some serious optimization on this stuff. But as noted earlier, so far it hadn't been a bottleneck.
func getOrCreateState. OK, this is kind of hilarious. Have you looked at the code in aws/event-ruler - that's the Java pattern matcher that I was the originator of back when I worked at AWS. It has a function findOrMakeNextByteState() for exactly the reason that you wrote getOrCreateState().
https://github.com/DigitalPath-Inc/quamina/blob/main/trie.go#L550 return cachedState.states[0] - why [0] - gonna have to deep-dive to figure that out. But if the answer is obvious, maybe a comment?
BTW, you're the first contributor to dive into the NFA stuff and actually work directly with the smallTable structure. If you see anything that looks crazy or could use improvement, don't be shy about saying so.
Would love to see pprof output from your big run.
I'm seeing why you would prefer to do this as a one-time initializer rather than addPatterns() and I guess I'm OK with that. Except for, if the hash/dedupe that falls out of the trie structure significantly reduces the memory burn, I'm going to want to pull it into the mainline. Anyhow, we can talk that over later.

Super appreciative of your work here, and sorry I've been absent-due-to-moving-houses. Sitting in my pretty decent new office now with time to think.

ehiggins0 · 2024-10-28T19:28:55Z

Yep, originally I was just using it to reduce the amount of stuff I had to understand, but with the deduping decided to leave it. The only other way I tried was hashing the objects when we were creating the matcher itself, but it was slow since there was a lot of recalculating when a branch would change. By getting a complete picture up front (in a lighter-weight object), we can remove that time
Here's the results on a small (8 patterns, 2 fields of variable scenarios):
/home/ehiggins/github/dp/quamina/trie_test.go:174: Matcher stats: Field matchers: 15 (avg size 1.250, max 2)
Value matchers: 5
SmallTables 78 (unique 67, avg 3.418, max 12, epsilon avg 0.036, max 1) singletons 0
/home/ehiggins/github/dp/quamina/trie_test.go:175: Old matcher stats: Field matchers: 19 (avg size 1.200, max 2)
Value matchers: 6
SmallTables 92 (unique 83, avg 3.309, max 12, epsilon avg 0.029, max 1) singletons 0

I would imagine this would scale with the complexity of all the patterns as well.
Noted, I will try to go through this week
I'll add that in
Yeah, that was just the one that came to the top of my head when thinking of a relatively fast hashing algo. I'd have to do some benchmarks but something like murmur could probably be used instead.
Yeah, this is done intentionally due to speed. In some performance tests having the if/else versus just a sort.Slice we'd see a pretty big performance differential (9.9s vs 7.0s in a 1000 pattern, 3 field, 1000 3 3 field length scenario).
Agreed
I can add in a comment, but my thinking was that if we traverse two completely separate paths and get down to the lower levels, the matching section could be entirely identical, except for which patterns we're supposed to match. That could mean that we end up choosing the wrong pattern. A basic example could be just {"field": ["aa"]} and {"field": ["ba"]} - if we don't hash on which pattern we're supposed to match, both would point to the first pattern created, since the trie hashing is bottom-up.
Yeah, since I was iterating on a map where the keys were quamina.X, I liked to use x, but that can go either way.
That should be relatively straightforward, so I can take that one if you'd like
Yeah, this one was a little tricky, since the end states had to be known in advance (especially with wildcards). Lots of back and forth on how to implement wildcards specifically...
Nice, glad that wasn't too crazy of an idea...
I believe the length of the state should always be 1 for this, and haven't found anything to suggest otherwise, but yeah that could be something to look at.
Noted, everything looked pretty self explanatory when I was going through all that, except for the pack/unpack operations, but there's already a comment on that guy
Here's a flamegraph of both runs side by side:

We're basically running up against the limits of using a map on this implementation, and the rest of the time savings comes from reduced need to create objects due to the hashing.
I do think there is an opportunity to do an addPatterns operation, but to be honest I'm not sure that I'm convinced of the use case of adding patterns to an existing matcher. From my perspective (and in the service I'm building), building a matcher and caching it in a container with a way to invalidate would be preferable over broadcasting pattern change events to the containers, as that way you have somewhat guaranteed sync between your pattern store and the matching containers. I guess it makes sense if you have a lot of frequently changing patterns (as a percent of total), but with proper sharding that should be mitigated. Given your time at AWS, you'd probably know better than me on this one, though...

I've gone ahead and scheduled some time next week to go through everything here and to see about implementing the numeric type (possibly the full version like event-ruler has?), so I'll let you know how that goes. In the meantime, feel free to work of my branch if you'd like, and enjoy your new place!

timbray mentioned this issue Oct 31, 2024

Add ByteMatch hashCode() to reduce transitions object count. aws/event-ruler#199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large amounts of values result in addPatterns() slowdown #363

Large amounts of values result in addPatterns() slowdown #363

ehiggins0 commented Oct 1, 2024

timbray commented Oct 1, 2024

ehiggins0 commented Oct 2, 2024

timbray commented Oct 3, 2024

ehiggins0 commented Oct 8, 2024

timbray commented Oct 8, 2024

ehiggins0 commented Oct 11, 2024 •

edited

Loading

timbray commented Oct 16, 2024

ehiggins0 commented Oct 22, 2024

timbray commented Oct 28, 2024

ehiggins0 commented Oct 28, 2024

Large amounts of values result in addPatterns() slowdown #363

Large amounts of values result in addPatterns() slowdown #363

Comments

ehiggins0 commented Oct 1, 2024

timbray commented Oct 1, 2024

ehiggins0 commented Oct 2, 2024

timbray commented Oct 3, 2024

ehiggins0 commented Oct 8, 2024

timbray commented Oct 8, 2024

ehiggins0 commented Oct 11, 2024 • edited Loading

timbray commented Oct 16, 2024

ehiggins0 commented Oct 22, 2024

timbray commented Oct 28, 2024

ehiggins0 commented Oct 28, 2024

ehiggins0 commented Oct 11, 2024 •

edited

Loading