Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PR #5930/2d5597e6 backport][3.8] Switch default fallback encoding detection lib to charset-normalizer #6108

Conversation

webknjaz
Copy link
Member

This is a backport of PR #5930 as merged into master (2d5597e).

This change improves the performance of the encoding
detection by substituting the backend lib with the
new Charset-Normalizer (used to be Chardet).

The patch is backward-compatible API wise, except
that the dependency is different.

PR #5930

Co-authored-by: Sviatoslav Sydorenko [email protected]
(cherry picked from commit 2d5597e)

What do these changes do?

Switch Chardet dependency to Charset-Normalizer for the fallback encoding detection.

Are there changes in behavior for the user?

This change is mostly backward-compatible, exception of a thing:

Why should you bother with such a change? Is it worth it?

Short answer, absolutely.

Long answer:

It's still a heuristic lib, therefore cannot be trusted blindly of course.

Is UTF-8 everywhere already?

Not really, that is a dangerous assumption. Looking at https://w3techs.com/technologies/overview/character_encoding may seem like encoding detection is a thing of the past but not really. Solo based on 33k websites, you will find
3,4k responses without predefined encoding. 1,8k websites were not UTF-8, merely half! (Top 1000 sites from 80 countries in the world according to Data for SEO) https://github.com/potiuk/test-charset-normalizer

This statistic (w3techs) does not offer any ponderation, so one should not read it as
"I have a 97 % chance of hitting UTF-8 content on HTML content".

First of all, neither aiohttp, chardet or charset-normalizer are dedicated to HTML content. The detection concern every text document (SubRip Subtitle for ex.).
It is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making assumptions is unwise.

The real debate is to state if the detection is an HTTP client matter or not. That is more complicated and not my field.

Related issue number

No related issue.

Checklist

  • I think the code is well written
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
  • Add a new news fragment into the CHANGES folder

This change improves the performance of the encoding
detection by substituting the backend lib with the
new `Charset-Normalizer` (used to be `Chardet`).

The patch is backward-compatible API wise, except
that the dependency is different.

PR aio-libs#5930

Co-authored-by: Sviatoslav Sydorenko <[email protected]>
(cherry picked from commit 2d5597e)
@webknjaz webknjaz requested a review from asvetlov as a code owner October 20, 2021 11:40
@webknjaz webknjaz self-assigned this Oct 20, 2021
@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Oct 20, 2021
@webknjaz webknjaz enabled auto-merge (squash) October 20, 2021 11:40
@codecov
Copy link

codecov bot commented Oct 20, 2021

Codecov Report

Merging #6108 (2d5597e) into 3.8 (66e281f) will decrease coverage by 4.21%.
The diff coverage is 90.31%.

❗ Current head 2d5597e differs from pull request most recent head cfecafa. Consider uploading reports for the commit cfecafa to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##              3.8    #6108      +/-   ##
==========================================
- Coverage   97.52%   93.31%   -4.22%     
==========================================
  Files          44      102      +58     
  Lines        8865    30062   +21197     
  Branches     1429     2690    +1261     
==========================================
+ Hits         8646    28053   +19407     
- Misses        103     1833    +1730     
- Partials      116      176      +60     
Flag Coverage Δ
unit 93.24% <90.11%> (-4.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiohttp/web_fileresponse.py 98.62% <ø> (-1.38%) ⬇️
aiohttp/web_log.py 100.00% <ø> (ø)
aiohttp/web_middlewares.py 100.00% <ø> (ø)
aiohttp/web_protocol.py 86.41% <ø> (-4.20%) ⬇️
aiohttp/web_request.py 95.99% <ø> (-1.57%) ⬇️
aiohttp/web_response.py 98.21% <ø> (-0.06%) ⬇️
aiohttp/web_routedef.py 98.11% <ø> (+0.01%) ⬆️
aiohttp/web_runner.py 92.34% <ø> (-5.29%) ⬇️
aiohttp/web_server.py 94.28% <ø> (-5.72%) ⬇️
aiohttp/web_urldispatcher.py 97.54% <ø> (-0.06%) ⬇️
... and 135 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04d9ac4...cfecafa. Read the comment docs.

@webknjaz webknjaz added this to the 3.8 milestone Oct 20, 2021
@webknjaz webknjaz merged commit c81c56d into aio-libs:3.8 Oct 20, 2021
@webknjaz webknjaz deleted the patchback/backports/3.8/2d5597e6743bb4c579adb6f9b67482d5d35978c7/pr-5930 branch April 19, 2024 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot:chronographer:provided There is a change note present in this PR enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants