[PR #5930/2d5597e6 backport][3.8] Switch default fallback encoding detection lib to charset-normalizer
#6108
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a backport of PR #5930 as merged into master (2d5597e).
This change improves the performance of the encoding
detection by substituting the backend lib with the
new
Charset-Normalizer
(used to beChardet
).The patch is backward-compatible API wise, except
that the dependency is different.
PR #5930
Co-authored-by: Sviatoslav Sydorenko [email protected]
(cherry picked from commit 2d5597e)
What do these changes do?
Switch Chardet dependency to Charset-Normalizer for the fallback encoding detection.
Are there changes in behavior for the user?
This change is mostly backward-compatible, exception of a thing:
Why should you bother with such a change? Is it worth it?
Short answer, absolutely.
Long answer:
Windows-1252
Windows-1254
ISO-8859-7
utf_8
for charset-normalizerWindows-1252
utf_8
for charset-normalizerrequests
did integrate it first and for total transparency, the lib needed some minors adjustments. But it is going well so far.It's still a heuristic lib, therefore cannot be trusted blindly of course.
Is UTF-8 everywhere already?
Not really, that is a dangerous assumption. Looking at https://w3techs.com/technologies/overview/character_encoding may seem like encoding detection is a thing of the past but not really. Solo based on 33k websites, you will find
3,4k responses without predefined encoding. 1,8k websites were not UTF-8, merely half! (Top 1000 sites from 80 countries in the world according to Data for SEO) https://github.com/potiuk/test-charset-normalizer
This statistic (w3techs) does not offer any ponderation, so one should not read it as
"I have a 97 % chance of hitting UTF-8 content on HTML content".
First of all, neither aiohttp, chardet or charset-normalizer are dedicated to HTML content. The detection concern every text document (SubRip Subtitle for ex.).
It is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making assumptions is unwise.
The real debate is to state if the detection is an HTTP client matter or not. That is more complicated and not my field.
Related issue number
No related issue.
Checklist
CONTRIBUTORS.txt
CHANGES
folder