This is a Python library to retrieve the contents of a given URL via HTTP (or HTTPS) and hash the processed contents.
If an encoding is detected, this package will convert content into the UTF-8 encoding before proceeding.
Additional content processing is currently implemented for the following types of content:
- HTML
- JSON
HTML content is processed by leveraging the pyppeteer package to execute any JavaScript on a retrieved page. The result is then parsed by Beautiful Soup to reduce the content to the human visible portions of a page.
JSON content is processed by using the
json
library that is part of
the Python standard library. It is read in and then output in a deterministic
manner to adjust for any styling differences between content.
We welcome contributions! Please see CONTRIBUTING.md
for
details.
This project is in the worldwide public domain.
This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.
All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.