Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

utils.full_process executed when processor=None #319

Open
sdennler opened this issue Aug 1, 2021 · 1 comment
Open

utils.full_process executed when processor=None #319

sdennler opened this issue Aug 1, 2021 · 1 comment

Comments

@sdennler
Copy link

sdennler commented Aug 1, 2021

Great and very helpful tool! Thank you!

One thing I noticed is that even when process.extractOne (and others) have processor set to None, utils.full_process is still executed several times. Probably because of

pre_processor = partial(utils.full_process, force_ascii=True)

This generates two times the same output:

from fuzzywuzzy import process

query = "123   ....  "
choices = ["123", query]

print(process.extract(query, choices))
print(process.extract(query, choices, processor=None))

Output:

[('123', 100), ('123   ....  ', 100)]
[('123', 100), ('123   ....  ', 100)]

Expected would be that without a processor the 1:1 match is better. So some thing like this:

[('123', 100), ('123   ....  ', 100)]
[('123   ....  ', 100), ('123', 90)]
@maxbachmann
Copy link

In Fuzzywuzzy the processor argument only allows the usage of additional preprocessing. However, it does not provide a way to disable the preprocessing inside the scorer. So when calling

process.extract(query, choices, processor=None)

The string is still preprocessed, since the default scorer fuzz.WRatio preprocesses strings by default. To disable this you would have to use:

process.extract(query, choices, processor=None, scorer=partial(fuzz.WRatio, full_process=False))

I agree that this is very counter-intuitive, which is why I use the behavior you expected in RapidFuzz.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants