Use the Boyer-Moore search algorithm for String.IndexOf of large strings #6560

jamesqo · 2016-08-25T18:34:18Z

The current implementation of String.IndexOf(string) uses a naive loop to check whether a substring of some given text matches a pattern. Although this implementation is simple and easy to understand, it is very inefficient because it has to iterate through a minimum of m - n + 1 characters of the text, where m is the length of the text and n is the length of the pattern string.

We should instead use the Boyer-Moore search algorithm to implement this function. It performs much better for larger patterns, since it allows us to 'skip over' some characters based on what we know from pre-processing the string. Here is an answer on StackOverflow that explains how it works, and the Wikipedia link has sample implementations in C/Java.

edit: Took me a day to wrap my head around the algorithm, but I have an implementation here and it seems to work well.

edit 2: Maybe this would benefit functions like Replace or Split more. Those functions have to make a pass through the entire string anyway, whereas with IndexOf we stop at the first match.

The text was updated successfully, but these errors were encountered:

jkotas · 2016-08-25T22:06:21Z

it is very inefficient

It is inefficient for large strings. It is pretty efficient for small strings that is likely the typical usage pattern for IndexOf...

benaadams · 2016-08-25T22:07:45Z

For smaller strings, could check first, a middle and last char before checking the intermediate chars (pre-processing will have some setup, so will be cross over point)

e.g. for words the entropy increases with each char; however there are also a lot of shared endings, hence checking a middle (e.g. -ing -ed -s etc)

For very short search strings the naive loop would be more efficient (though could use larger compares e.g. ulong on x64)

benaadams · 2016-08-25T22:11:10Z

Also would only apply to non-culture strings as culture aware strings use OS compare (at least on windows)

jamesqo · 2016-08-25T22:47:07Z

@jkotas Yeah, for small strings it may be problematic since it needs a dynamically allocated buffer for delta2. For those scenarios I'm planning to use a 'lite' Boyer-Moore and use a tiny statically allocated bitmap, like what's being done with IndexOfAny right now. If any of the 2 bits corresponding to a char are not set in the bitmap, then we will be 100% sure it's not in the pattern and will still be able to apply the 'skip pattern.Length chars if that char isn't in the pattern' optimization. That IMO is the main benefit of this algorithm.

ghost · 2016-08-25T22:47:27Z

@SoftCircuits, saw your varied implementations of Boyer-Moore algorithm via http://stackoverflow.com/a/4916363. Would love to have your take on this. :)

jamesqo · 2016-08-25T22:48:49Z

@benaadams

Also would only apply to non-culture strings as culture aware strings use OS compare (at least on windows)

I wonder if we could bypass doing that if IsAscii() is true, since I'm pretty sure any sane OS would return uniform results for those kinds of strings.

SoftCircuits · 2016-08-26T00:14:43Z

In my experience, String.IndexOf() with StringComparision.Ordinal will always be faster than Boyer-Moore except for extreme pathological cases. The reason appears to be that String.IndexOf() uses hand-optimized assembly language instructions specifically designed to quickly scan strings. In addition, with Unicode, Boyer-Moore require a very large buffer to be allocated for the skip table unless you perform a couple of tricks, which could potentially make it even slower. For comparisons other than StringComparison.Ordinal, a faster version might be possible. My article Fast Text Search with Boyer-Moore presents a C# implementation of Boyer-Moore and discusses all the issues mentioned here.

jamesqo · 2016-08-26T00:26:09Z

@SoftCircuits

hand-optimized assembly language instructions specifically designed to quickly scan strings

Where did you get that? From what this SO answer has told me, the IndexOf implementation is written here in (rather naive) C++, not assembly.

with Unicode, Boyer-Moore require a very large buffer to be allocated for the skip table unless you perform a couple of tricks

Yeah, I realized that earlier. There are a couple of workarounds for this. For example, String has a cached IsAscii flag, so if we only run this algorithm on ASCII strings we will only have to allocate a int[] of size 128 (which can be done on the stack).

benaadams · 2016-08-26T02:36:42Z

Is will be hard to beat 2 strings in cpu cache by bringing in a 3rd data structure. Likely to work better for file size searches and/or much larger search strings.

jamesqo · 2016-08-26T03:04:39Z

@benaadams Hm, maybe. I could see it being less efficient with the bitmap scheme because of all of the bit operations that have to be done (couple of shl/shr/and). However, if we go down the lookup table route that will only take up like 8 cache lines, which is only 1/64 of L1 cache so it won't cause problems with eviction. A lookup table will also introduce no additional branching to the main loop (see implementation here), and it is O(n) to initialize depending on the length of the string so for smaller strings the added overhead may not be that much.

benaadams · 2016-08-26T03:10:37Z

But if you are looking for a < 32 char string that's only 1 cache line?

SoftCircuits · 2016-08-26T14:31:57Z

@jamesqo As I mentioned originally, the code used depends on the comparison type. I don't know what the code you referenced is (and don't have time to dig into it right now). What I do know is that the performance of IndexOf() (in some comparison modes) far surpassed anything I could write in C# (even Boyer-Moore algorithms) at the time I ran my tests. There were other clues at the time that indicated the code was unmanaged, and I have been told by the original .NET developers that a lot of the lower level routines were written in hand-optimized assembler. But I don't have first-hand knowledge of the exact code. And perhaps things have changed since I looked into it.

stephentoub · 2020-02-20T16:21:22Z

This design discussion is over three years old. @jamesqo, you're of course welcome to prototype your suggestion and come back with data, but we don't need the issue open tracking such experiments.

danmoseley changed the title ~~Use the Boyer-Moore search algorithm for String.IndexOf~~ Use the Boyer-Moore search algorithm for String.IndexOf of large strings Apr 13, 2017

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

stephentoub closed this as completed Feb 20, 2020

ghost locked as resolved and limited conversation to collaborators Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the Boyer-Moore search algorithm for String.IndexOf of large strings #6560

Use the Boyer-Moore search algorithm for String.IndexOf of large strings #6560

jamesqo commented Aug 25, 2016

jkotas commented Aug 25, 2016

benaadams commented Aug 25, 2016 •

edited

Loading

benaadams commented Aug 25, 2016

jamesqo commented Aug 25, 2016

ghost commented Aug 25, 2016

jamesqo commented Aug 25, 2016 •

edited

Loading

SoftCircuits commented Aug 26, 2016 •

edited

Loading

jamesqo commented Aug 26, 2016 •

edited

Loading

benaadams commented Aug 26, 2016 •

edited

Loading

jamesqo commented Aug 26, 2016

benaadams commented Aug 26, 2016

SoftCircuits commented Aug 26, 2016

stephentoub commented Feb 20, 2020

Use the Boyer-Moore search algorithm for String.IndexOf of large strings #6560

Use the Boyer-Moore search algorithm for String.IndexOf of large strings #6560

Comments

jamesqo commented Aug 25, 2016

jkotas commented Aug 25, 2016

benaadams commented Aug 25, 2016 • edited Loading

benaadams commented Aug 25, 2016

jamesqo commented Aug 25, 2016

ghost commented Aug 25, 2016

jamesqo commented Aug 25, 2016 • edited Loading

SoftCircuits commented Aug 26, 2016 • edited Loading

jamesqo commented Aug 26, 2016 • edited Loading

benaadams commented Aug 26, 2016 • edited Loading

jamesqo commented Aug 26, 2016

benaadams commented Aug 26, 2016

SoftCircuits commented Aug 26, 2016

stephentoub commented Feb 20, 2020

benaadams commented Aug 25, 2016 •

edited

Loading

jamesqo commented Aug 25, 2016 •

edited

Loading

SoftCircuits commented Aug 26, 2016 •

edited

Loading

jamesqo commented Aug 26, 2016 •

edited

Loading

benaadams commented Aug 26, 2016 •

edited

Loading