Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the Boyer-Moore search algorithm for String.IndexOf of large strings #6560

Closed
jamesqo opened this issue Aug 25, 2016 · 13 comments
Closed
Labels
area-System.Runtime design-discussion Ongoing discussion about design without consensus enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Milestone

Comments

@jamesqo
Copy link
Contributor

jamesqo commented Aug 25, 2016

The current implementation of String.IndexOf(string) uses a naive loop to check whether a substring of some given text matches a pattern. Although this implementation is simple and easy to understand, it is very inefficient because it has to iterate through a minimum of m - n + 1 characters of the text, where m is the length of the text and n is the length of the pattern string.

We should instead use the Boyer-Moore search algorithm to implement this function. It performs much better for larger patterns, since it allows us to 'skip over' some characters based on what we know from pre-processing the string. Here is an answer on StackOverflow that explains how it works, and the Wikipedia link has sample implementations in C/Java.

edit: Took me a day to wrap my head around the algorithm, but I have an implementation here and it seems to work well.

edit 2: Maybe this would benefit functions like Replace or Split more. Those functions have to make a pass through the entire string anyway, whereas with IndexOf we stop at the first match.

@jkotas
Copy link
Member

jkotas commented Aug 25, 2016

it is very inefficient

It is inefficient for large strings. It is pretty efficient for small strings that is likely the typical usage pattern for IndexOf...

@benaadams
Copy link
Member

benaadams commented Aug 25, 2016

For smaller strings, could check first, a middle and last char before checking the intermediate chars (pre-processing will have some setup, so will be cross over point)

e.g. for words the entropy increases with each char; however there are also a lot of shared endings, hence checking a middle (e.g. -ing -ed -s etc)

For very short search strings the naive loop would be more efficient (though could use larger compares e.g. ulong on x64)

@benaadams
Copy link
Member

Also would only apply to non-culture strings as culture aware strings use OS compare (at least on windows)

@jamesqo
Copy link
Contributor Author

jamesqo commented Aug 25, 2016

@jkotas Yeah, for small strings it may be problematic since it needs a dynamically allocated buffer for delta2. For those scenarios I'm planning to use a 'lite' Boyer-Moore and use a tiny statically allocated bitmap, like what's being done with IndexOfAny right now. If any of the 2 bits corresponding to a char are not set in the bitmap, then we will be 100% sure it's not in the pattern and will still be able to apply the 'skip pattern.Length chars if that char isn't in the pattern' optimization. That IMO is the main benefit of this algorithm.

@ghost
Copy link

ghost commented Aug 25, 2016

@SoftCircuits, saw your varied implementations of Boyer-Moore algorithm via http://stackoverflow.com/a/4916363. Would love to have your take on this. :)

@jamesqo
Copy link
Contributor Author

jamesqo commented Aug 25, 2016

@benaadams

Also would only apply to non-culture strings as culture aware strings use OS compare (at least on windows)

I wonder if we could bypass doing that if IsAscii() is true, since I'm pretty sure any sane OS would return uniform results for those kinds of strings.

@SoftCircuits
Copy link

SoftCircuits commented Aug 26, 2016

In my experience, String.IndexOf() with StringComparision.Ordinal will always be faster than Boyer-Moore except for extreme pathological cases. The reason appears to be that String.IndexOf() uses hand-optimized assembly language instructions specifically designed to quickly scan strings. In addition, with Unicode, Boyer-Moore require a very large buffer to be allocated for the skip table unless you perform a couple of tricks, which could potentially make it even slower. For comparisons other than StringComparison.Ordinal, a faster version might be possible. My article Fast Text Search with Boyer-Moore presents a C# implementation of Boyer-Moore and discusses all the issues mentioned here.

@jamesqo
Copy link
Contributor Author

jamesqo commented Aug 26, 2016

@SoftCircuits

hand-optimized assembly language instructions specifically designed to quickly scan strings

Where did you get that? From what this SO answer has told me, the IndexOf implementation is written here in (rather naive) C++, not assembly.

with Unicode, Boyer-Moore require a very large buffer to be allocated for the skip table unless you perform a couple of tricks

Yeah, I realized that earlier. There are a couple of workarounds for this. For example, String has a cached IsAscii flag, so if we only run this algorithm on ASCII strings we will only have to allocate a int[] of size 128 (which can be done on the stack).

@benaadams
Copy link
Member

benaadams commented Aug 26, 2016

Is will be hard to beat 2 strings in cpu cache by bringing in a 3rd data structure. Likely to work better for file size searches and/or much larger search strings.

@jamesqo
Copy link
Contributor Author

jamesqo commented Aug 26, 2016

@benaadams Hm, maybe. I could see it being less efficient with the bitmap scheme because of all of the bit operations that have to be done (couple of shl/shr/and). However, if we go down the lookup table route that will only take up like 8 cache lines, which is only 1/64 of L1 cache so it won't cause problems with eviction. A lookup table will also introduce no additional branching to the main loop (see implementation here), and it is O(n) to initialize depending on the length of the string so for smaller strings the added overhead may not be that much.

@benaadams
Copy link
Member

But if you are looking for a < 32 char string that's only 1 cache line?

@SoftCircuits
Copy link

@jamesqo As I mentioned originally, the code used depends on the comparison type. I don't know what the code you referenced is (and don't have time to dig into it right now). What I do know is that the performance of IndexOf() (in some comparison modes) far surpassed anything I could write in C# (even Boyer-Moore algorithms) at the time I ran my tests. There were other clues at the time that indicated the code was unmanaged, and I have been told by the original .NET developers that a lot of the lower level routines were written in hand-optimized assembler. But I don't have first-hand knowledge of the exact code. And perhaps things have changed since I looked into it.

@danmoseley danmoseley changed the title Use the Boyer-Moore search algorithm for String.IndexOf Use the Boyer-Moore search algorithm for String.IndexOf of large strings Apr 13, 2017
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@stephentoub
Copy link
Member

This design discussion is over three years old. @jamesqo, you're of course welcome to prototype your suggestion and come back with data, but we don't need the issue open tracking such experiments.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime design-discussion Ongoing discussion about design without consensus enhancement Product code improvement that does NOT require public API changes/additions tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants