Fix a performance bug that leads to worse time complexity and cripplingly slow runtime in some cases #411
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On master,
diff
keeps track of an array ofcomponents
(changes) for the current best path terminating on each diagonal of the edit graph. Whenever a vertical move happens, a new copy of this array gets created usingpath.components.slice(0)
.The trouble is that this array copying takes time proportional to the size of the array. This increases the worst case time complexity of
diff
such that it contains anO(d³)
term, which shouldn't be the case for the Myers algorithm. To illustrate this, consider the following case:If you log every
execEditLength
call, and also log everyclonePath
call along with the size of thecomponents
array being cloned, you will note that for every edit length from 100-200, 100 calls toclonePath
happen, with the average size of thecomponents
array being 100. Since cloning thecomponents
array with.slice
requires iterating over every item of it, that's 100 * 100 * 100 array accesses - i.e. anO(d³)
term in our time complexity.The fix for this is to record components in a way that doesn't necessitate any array cloning. An easy way to do this that I see is to store the components on the
basePath
object as a linked list, with the most recently added component as the head and (0,0) as the tail; this ensures that adding a component to a path object is always an O(1) operation. It doesn't even really make the code more complicated; we just need a little bit of logic at the start ofbuildValues
to traverse the list, build an array from it, and reverse it.A nice benchmark to illustrate the size of the speedup this achieves is to diff these two files using
diffChars
:words.txt
words-formatted.txt
The first contains a JavaScript array of 1379 words, as single-quoted strings on one line. The latter contains the same array, but formatted using Prettier, a JavaScript formatter. After making the change I propose in this PR, jsdiff can diff them in 5 seconds on my machine; using the code from master, it takes over 10 minutes. That's over a 100-fold speedup!
(Aside: this was actually a real-world scenario I encountered, when my editor tried to use Prettier to autoformat a dictionary of words for a word game, since Prettier uses jsdiff under the hood to figure out where to position the text cursor after formatting. See prettier/prettier#4801.)
There's more performance to be gained, to be sure; @gliese1337's fast-myers-diff is another JavaScript Myers implementation that manages my benchmark above in well under a second, and Git (configured with CLI flags to do a character diff) can do it in under 30ms! In due course I will look at @gliese1337's implementation and see if I can speed jsdiff up more. In the meantime, though, this is a fairly surgical change that already provides a 100x speedup in some cases, so it seems worthwhile. :)
All 178 tests pass when I run
npm test
, which seems to confirm that I haven't changed the diff results at all - just made things faster, as hoped!