-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add method to remove any entries from a .voc file from a string. #2987
Conversation
Useful to clean utterances of common articles or other noise when parsing for specific information. This method is not case sensitive.
8d27785
to
25156b8
Compare
Voight Kampff Integration Test Succeeded (Results) |
I think I've missed something... where are the other 2 approaches? I noticed the extraneous spaces thing too but didn't see a reason to remove them. In one way it helps to show that things have been removed, rather than operating on what is now probably at least grammatically incorrect, and potentially has even changed the meaning slightly. This is kind of silly but the best I can come up with at 5am... cleaned_string = self.remove_voc("john likes the football", "football")
if "john the football" in cleaned_string:
... On the flipside, I could see that extra characters may impact a confidence rating based on keyword frequency etc. In terms of case sensitivity the docstring says "is not case sensitive" but maybe this should be worded as "is case insensitive" to match the flag. |
There were some discussion around the implementation when I tried to point out a bug in #2986. I couldn't find Ken's versions there now but I put up a gist with a collection of the ones I could remember: https://gist.github.com/forslund/c01636a18ddef0a9bef84dae609f4511 The results for 100000 runs of each: The
' '.join(w for w in phrase.split() if w.lower() not in words) The lowering makes it lose about 0.1 seconds in the test case but it's still one of the fastest of the versions above. Edit: Ken provided the missing version and I have updated with the dont_preserve_order one. Not sure it's ideal since it will rearrange the sentence. |
Nice comparison - thanks! I love the simplicity of the
remove_voc("Never Forget it", "forget it") # Never I've added a "help wanted" flag to see if anyone else has a good suggestion. You can find unit tests in this PR to validate it. |
Closing PR since we're archiving the repo |
Description
The
remove_voc
method complements the existing vocab matching tools.It is useful to clean utterances of common articles or other noise when parsing for specific information.
This method is not case sensitive unlike the other vocab matching methods. Is the case sensitivity intentional? Should we make all methods case insensitive?
Context: I was needing this type of method today, then saw the same need had been identified in #2986 and an existing method to do it already existed in OVOS
How to test
Unit tests included
Contributor license agreement signed?