-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the Boyer-Moore search algorithm for String.IndexOf of large strings #6560
Comments
It is inefficient for large strings. It is pretty efficient for small strings that is likely the typical usage pattern for IndexOf... |
For smaller strings, could check first, a middle and last char before checking the intermediate chars (pre-processing will have some setup, so will be cross over point) e.g. for words the entropy increases with each char; however there are also a lot of shared endings, hence checking a middle (e.g. For very short search strings the naive loop would be more efficient (though could use larger compares e.g. ulong on x64) |
Also would only apply to non-culture strings as culture aware strings use OS compare (at least on windows) |
@jkotas Yeah, for small strings it may be problematic since it needs a dynamically allocated buffer for |
@SoftCircuits, saw your varied implementations of Boyer-Moore algorithm via http://stackoverflow.com/a/4916363. Would love to have your take on this. :) |
I wonder if we could bypass doing that if |
In my experience, |
Where did you get that? From what this SO answer has told me, the
Yeah, I realized that earlier. There are a couple of workarounds for this. For example, String has a cached |
Is will be hard to beat 2 strings in cpu cache by bringing in a 3rd data structure. Likely to work better for file size searches and/or much larger search strings. |
@benaadams Hm, maybe. I could see it being less efficient with the bitmap scheme because of all of the bit operations that have to be done (couple of shl/shr/and). However, if we go down the lookup table route that will only take up like 8 cache lines, which is only 1/64 of L1 cache so it won't cause problems with eviction. A lookup table will also introduce no additional branching to the main loop (see implementation here), and it is O(n) to initialize depending on the length of the string so for smaller strings the added overhead may not be that much. |
But if you are looking for a < 32 char string that's only 1 cache line? |
@jamesqo As I mentioned originally, the code used depends on the comparison type. I don't know what the code you referenced is (and don't have time to dig into it right now). What I do know is that the performance of |
This design discussion is over three years old. @jamesqo, you're of course welcome to prototype your suggestion and come back with data, but we don't need the issue open tracking such experiments. |
The current implementation of
String.IndexOf(string)
uses a naive loop to check whether a substring of some given text matches a pattern. Although this implementation is simple and easy to understand, it is very inefficient because it has to iterate through a minimum of m - n + 1 characters of the text, where m is the length of the text and n is the length of the pattern string.We should instead use the Boyer-Moore search algorithm to implement this function. It performs much better for larger patterns, since it allows us to 'skip over' some characters based on what we know from pre-processing the string. Here is an answer on StackOverflow that explains how it works, and the Wikipedia link has sample implementations in C/Java.
edit: Took me a day to wrap my head around the algorithm, but I have an implementation here and it seems to work well.
edit 2: Maybe this would benefit functions like
Replace
orSplit
more. Those functions have to make a pass through the entire string anyway, whereas withIndexOf
we stop at the first match.The text was updated successfully, but these errors were encountered: