Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fake request headers to reduce bot detection #263

Merged
merged 1 commit into from
May 21, 2022

Conversation

sissbruecker
Copy link
Owner

Fixes #262

@sissbruecker sissbruecker merged commit e08bf9f into master May 21, 2022
@sissbruecker sissbruecker deleted the fix/fake_request_headers branch May 21, 2022 11:25

# Use charset_normalizer to determine encoding that best matches the response content
# Several sites seem to specify the response encoding incorrectly, so we ignore it and use custom logic instead
# This is different from Response.text which does respect the encoding specified in the response first,
# before trying to determine one
results = from_bytes(r.content)
return str(results.best())


def fake_request_headers():
Copy link
Contributor

@bachya bachya May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a future PR, why not make this a constant (since its contents are always the same)?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. One article recommended to rotate the user agent, which would have required a function, but I never got that far. It could be changed to a constant.

@sissbruecker sissbruecker mentioned this pull request Aug 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Getting title not working for articles from nytimes.com
2 participants