-
Notifications
You must be signed in to change notification settings - Fork 28
Use LCS to Rank Matches #20
Comments
Interesting idea. Found the algo for CoffeeScript: http://rosettacode.org/wiki/Longest_common_subsequence#CoffeeScript Will try to implement it in my PR |
LCS alone will not help for the order of item in the screenshot. We have What you need is to normalize matches against the length of both string. I myself have a project using LCS and use something like this:
Note that you can probably use current algorithm to calculate m too. This is because current algorithm looks like a greedy approximation of LCS, with special score depending of the nature of the character. If you want to see behavior of above scoring among other things, I have a demo page |
Tried the LCS, it's way too slow.
Instead of
Maybe I'm missing something... I also added a test in my PR for the use case above. |
Well spotted @jeancroy, I see now that LCS won't handle my use case. Your scoring function can be greatly simplified into "prefer shorter hits" if used to compare already-filtered hits. Since the hits are already filtered, we actually already have the LCS, and:
So if we replace Which I think sounds just fine actually :) |
@plrenaudin, thanks for trying this and benchmarking it! After the above discussion however I'm not sure this is actually the way to go any more, and I'm not convinced about pure substring matches either. Real-world example, searching for A substring match would put "Find And Replace: Select All" at the bottom (good), but it would also consider "uninstall", "install" and "installed" as equally good (bad). So I'm starting to think that the way to go about this is to:
That's probably better than starting with a solution like I suggested here initially. |
You're right, People would also want to score starting substring with a higher score... I'd like to have similar suggestion as IntelliJ or Sublime text but it would probably require to rewrite the whole scoring system. We really need improvement on this because for projects with lots of files, the pertinence is just so bad that it's rare to have what we search for in first result. |
@walles the part you simplified out is actually important for accuracy should you use LCS. Now I because your scoring function have a no typo policy I'd agree with your simplification. |
I have put something together that I believe account for some of the issues mentioned.
|
@jeancroy Thanks for the idea!
It's missing some results so even if it worked, it would still be quite slow compared to the actual implementation. But there is definitely a potential improvement in the tokenizing of the query / subject. |
@plrenaudin can I play with your implementation of that benchmark? It should not miss an "index" so i'm curious to debug that, also query is not meant to be re-computed each time. Also open question ... Why do this project score special character higher ? I mean any character of query not in subject squash the score to 0. Isn't not being squashed to 0 a good enough outcome ? Also every result not being 0 will have every char including every special char and their bonus, so it doesn't help the sort very much. |
@jeancroy My "implementation" is just to change the score function by your gist, nothing more nothing less. This is where the magic should happen and the score put the relevant results in front of the rest. I'm sure you could improve thist but it looks like the algo has much more complexity than the one in place so having a slower benchmark is not that surprising. I personally wouldn't mind having a slower fuzzy algo for more pertinence though. |
@plrenaudin sorry I did not meant you have changed something in a wrong way. Just I never played with coffeescript / node only before today so I didn't know where to start. I corrected something in my code. Filtering 66672 entries for 'index' took 877ms for 2040 results I'm also OK with matches that are missed. I believe any token implementation (idea 1) will miss them. index vs and some more |
@jeancroy You didn't change anything in a wrong way, it just requires more changes to match the expected results of tests. It bothers me to see that for such an important feature there is so few people concerned in improving it. |
@plrenaudin This issue has many "lurkers", I assure you. But I guess many, like myself, are very busy with other stuff. Wish I had time to invest in improving this... |
I applaud the effort put into this so far by the way. Love it when the community (even if only a handful of people) come together like this ❤️ |
I just got off of a work bender ... pushing for release after my vacation time the last couple weeks. I'm still trying to catch up on all the work you all are doing. One thing I'd like to note though ... I'd like to see more testing of "fuzzy" matches, just to make sure that all bases are covered, such as: This is the way a lot of people search for things, not when they don't know what they're looking for ... but when they do. As another example, when I need to set the grammar of a file by hand, I use my set-syntax package which allows the Command Palette to mimic how I did things in Sublime: |
The question then is: should ssrb match "Set Syntax: Ruby" or a class called "SsrbController" for example? |
Definitely there will be. I just didn't want things getting completely optimized for whole or partial words when scattered single-character matches are also an important use case. |
Indeed, I completely agree with the use case as for the one with CamelCaseFiles |
Ok a new day, a new algorithm :) This one is based on the idea of splitting the speed/accuracy trade off in two.
For 1, i simplified current algorithm will all score = 1. Output is match yes/no.. This mean new algorithm report same positive result than original. For 2, i used a local sequence alignment algorithm with afine gap penalty (Smith Waterman / Gotoh). I also used the measure of similarity between characters to implement lower/upper case bonus, CamelCase and This is an acronym What this mean in layman term a. We get a lot of knob to play with. I've downloaded your updated |
Nicely done @jeancroy |
added test, it pass. |
Ok this version pass the whole suite. Anyone has any other test case ? Once we have cleared some other test/ideas I might do a PR. |
@jeancroy, I'd like to have "Settings View: View Installed Themes" added to the "install" test. |
@jeancroy congrats! I think it covers already a good bunch of them now. |
@walles it now rank like so, make sens? 0: "install" I believe the rule that come into play is both "rank start of match higher" and "prefer smaller subject", there will be some point for "installed" because the proper letter "i" follow a token separator (space) |
That's a question about exact case substring score vs start of the word score. |
note that because the install test is made like so Uninstall have the exact case but not Installed |
Filtering 66672 entries for 'index' took 658ms for 6168 results |
Yes, what I mean is the "exact case substring" score should not beat the "ignore case substring at start of the word" score. But that's totally personal, I'm already fine with the result 😃 Great stuff for the benchmark! |
@jeancroy I think your last list seems splendid! Also I think that a PR wouldn't need to solve everything, so as long as the "installed" test is in the PR I think it's a good (new) starting point for making more refinements from. PR! |
What do I do with credit from |
I assume it's safe to remove it as for the line (https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L12) if tests passes afterward |
@lee-dohm I considered your example. They are particular in the sens the "take first concurrence of the matching char after previous match" algorithm magically align with word boundary. By magically I mean, I'm pretty sure someone with experience can craft a query that match, but Sublime, Chrome dev tool and others software setup a higher expectation of user-friendliness . Consider this example
Scattered letter is the best we can do
Here we have to undo 3 greedy matches to get intended result. For that I provide two mechanism:
let
so we have 2 opening 5 extension
so we have 10 extension but only 1 opening So basically:
So for me that is the main contribution of PR #22, to get extra information about the match structure. Once we have that information it's easier to fine-tune for particular cases. But that information have a computing power cost, so we provide quick exit for low quality matches (same==0 rule as before) as well as trivial matches (for now this is exact substring and 100% camel case initialism) To that I'll add that @plrenaudin wanted to nail the exact substring case and provided plenty of real life test case to the spec, his contribution became the base for the "exact substring" part. In addition to exact substring, CamelCase got a special handler to ensure we nail them. Also @walles was very useful in asking the right questions, scouring issues for test cases, and providing a patch for a substantial speedup. Lastly how does this relate to LCS ? |
I suggest using LCS to rank matches:
https://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Solution_for_two_sequences
Matches with lower number of substrings would come before matches with higher number of substrings.
This would be a lot more generic than the substring() based approach suggested in some PRs.
If you want to play with this, here's an online thing demonstrating the concept (I didn't write this, just found it):
http://lcs-demo.sourceforge.net/
Start by upping the "Max Size" a bit, then type in two strings and press "Execute LCS Lengths" to see how it works.
One real-world use case where this would come in handy is when searching for "git push", note how it's third place even though "git" and "push" are exact matches. Single-word matches would be caught by this as well.
The text was updated successfully, but these errors were encountered: