[PR #5930/2d5597e6 backport][3.8] Switch default fallback encoding detection lib to `charset-normalizer` #6108

webknjaz · 2021-10-20T11:40:19Z

This is a backport of PR #5930 as merged into master (2d5597e).

This change improves the performance of the encoding
detection by substituting the backend lib with the
new Charset-Normalizer (used to be Chardet).

The patch is backward-compatible API wise, except
that the dependency is different.

PR #5930

Co-authored-by: Sviatoslav Sydorenko [email protected]
(cherry picked from commit 2d5597e)

What do these changes do?

Switch Chardet dependency to Charset-Normalizer for the fallback encoding detection.

Are there changes in behavior for the user?

This change is mostly backward-compatible, exception of a thing:

This new library support way more code pages (x3) than its counterpart Chardet.
- Based on the 30-ich charsets that Chardet support, expect roughly 90% BC results https://github.com/Ousret/charset_normalizer/pull/77/checks?check_run_id=3244585065

Why should you bother with such a change? Is it worth it?

Short answer, absolutely.

Long answer:

On average x5 faster than Chardet and countless times faster in the larger payload cases (>1MB)
- https://github.com/Ousret/charset_normalizer/pull/119/checks?check_run_id=3726052653
Put a definitive end to the licensing debate around Chardet and LGPL. Remove ANY license ambiguity/restriction for projects bundling aiohttp.
Never return an encoding if not suited for the given decoder. Eg. Never get UnicodeDecodeError!
The package size is X4 lower than Chardet’s (4.0)!
Binary files should never be detected as potential texts.
- https://github.com/Ousret/char-dataset/blob/master/None/sample-1.gif
  - Chardet and cChardet sees it as Windows-1252
  - None for charset-normalizer
Especially, because the Unicode detection is far better than Chardet (or cChardet).
- Some representative examples:
  - https://github.com/Ousret/char-dataset/blob/master/utf_8/reddit_wsb.csv
  - https://github.com/Ousret/char-dataset/blob/master/utf_8/_ude_1.md
    - Chardet sees it as Windows-1254
    - cChardet sees it as ISO-8859-7
    - utf_8 for charset-normalizer
  - https://github.com/Ousret/char-dataset/blob/master/utf_8/_ude_1.rst
    - Chardet and cChardet sees it as Windows-1252
    - utf_8 for charset-normalizer
https://github.com/Ousret/charset_normalizer/pull/119/checks?check_run_id=3726052761
Actively maintained, watching closely for any concerns.
- Many workflows are set up to ensure quality continuous delivery.
requests did integrate it first and for total transparency, the lib needed some minors adjustments. But it is going well so far.

It's still a heuristic lib, therefore cannot be trusted blindly of course.

Is UTF-8 everywhere already?

Not really, that is a dangerous assumption. Looking at https://w3techs.com/technologies/overview/character_encoding may seem like encoding detection is a thing of the past but not really. Solo based on 33k websites, you will find
3,4k responses without predefined encoding. 1,8k websites were not UTF-8, merely half! (Top 1000 sites from 80 countries in the world according to Data for SEO) https://github.com/potiuk/test-charset-normalizer

This statistic (w3techs) does not offer any ponderation, so one should not read it as
"I have a 97 % chance of hitting UTF-8 content on HTML content".

First of all, neither aiohttp, chardet or charset-normalizer are dedicated to HTML content. The detection concern every text document (SubRip Subtitle for ex.).
It is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making assumptions is unwise.

The real debate is to state if the detection is an HTTP client matter or not. That is more complicated and not my field.

Related issue number

No related issue.

Checklist

I think the code is well written
Documentation reflects the changes
If you provide code modification, please add yourself to CONTRIBUTORS.txt
Add a new news fragment into the CHANGES folder

This change improves the performance of the encoding detection by substituting the backend lib with the new `Charset-Normalizer` (used to be `Chardet`). The patch is backward-compatible API wise, except that the dependency is different. PR aio-libs#5930 Co-authored-by: Sviatoslav Sydorenko <[email protected]> (cherry picked from commit 2d5597e)

codecov · 2021-10-20T11:40:42Z

Codecov Report

Merging #6108 (2d5597e) into 3.8 (66e281f) will decrease coverage by 4.21%.
The diff coverage is 90.31%.

❗ Current head 2d5597e differs from pull request most recent head cfecafa. Consider uploading reports for the commit cfecafa to get more accurate results

@@            Coverage Diff             @@
##              3.8    #6108      +/-   ##
==========================================
- Coverage   97.52%   93.31%   -4.22%     
==========================================
  Files          44      102      +58     
  Lines        8865    30062   +21197     
  Branches     1429     2690    +1261     
==========================================
+ Hits         8646    28053   +19407     
- Misses        103     1833    +1730     
- Partials      116      176      +60

Flag	Coverage Δ
unit	`93.24% <90.11%> (-4.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiohttp/web_fileresponse.py	`98.62% <ø> (-1.38%)`	⬇️
aiohttp/web_log.py	`100.00% <ø> (ø)`
aiohttp/web_middlewares.py	`100.00% <ø> (ø)`
aiohttp/web_protocol.py	`86.41% <ø> (-4.20%)`	⬇️
aiohttp/web_request.py	`95.99% <ø> (-1.57%)`	⬇️
aiohttp/web_response.py	`98.21% <ø> (-0.06%)`	⬇️
aiohttp/web_routedef.py	`98.11% <ø> (+0.01%)`	⬆️
aiohttp/web_runner.py	`92.34% <ø> (-5.29%)`	⬇️
aiohttp/web_server.py	`94.28% <ø> (-5.72%)`	⬇️
aiohttp/web_urldispatcher.py	`97.54% <ø> (-0.06%)`	⬇️
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04d9ac4...cfecafa. Read the comment docs.

webknjaz requested a review from asvetlov as a code owner October 20, 2021 11:40

webknjaz self-assigned this Oct 20, 2021

psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Oct 20, 2021

webknjaz enabled auto-merge (squash) October 20, 2021 11:40

webknjaz added the enhancement label Oct 20, 2021

webknjaz added this to the 3.8 milestone Oct 20, 2021

webknjaz mentioned this pull request Oct 20, 2021

Switch default fallback encoding detection lib to charset-normalizer #5930

Merged

4 tasks

webknjaz merged commit c81c56d into aio-libs:3.8 Oct 20, 2021

webknjaz deleted the patchback/backports/3.8/2d5597e6743bb4c579adb6f9b67482d5d35978c7/pr-5930 branch April 19, 2024 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PR #5930/2d5597e6 backport][3.8] Switch default fallback encoding detection lib to `charset-normalizer` #6108

[PR #5930/2d5597e6 backport][3.8] Switch default fallback encoding detection lib to `charset-normalizer` #6108

webknjaz commented Oct 20, 2021

codecov bot commented Oct 20, 2021 •

edited

Loading

[PR #5930/2d5597e6 backport][3.8] Switch default fallback encoding detection lib to charset-normalizer #6108

[PR #5930/2d5597e6 backport][3.8] Switch default fallback encoding detection lib to charset-normalizer #6108

Conversation

webknjaz commented Oct 20, 2021

What do these changes do?

Are there changes in behavior for the user?

Why should you bother with such a change? Is it worth it?

Is UTF-8 everywhere already?

Related issue number

Checklist

codecov bot commented Oct 20, 2021 • edited Loading

Codecov Report

[PR #5930/2d5597e6 backport][3.8] Switch default fallback encoding detection lib to `charset-normalizer` #6108

[PR #5930/2d5597e6 backport][3.8] Switch default fallback encoding detection lib to `charset-normalizer` #6108

codecov bot commented Oct 20, 2021 •

edited

Loading