Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an admin page for adding sites to Checkmate's block-list #921

Open
5 tasks
seanh opened this issue Nov 28, 2024 · 0 comments
Open
5 tasks

Add an admin page for adding sites to Checkmate's block-list #921

seanh opened this issue Nov 28, 2024 · 0 comments

Comments

@seanh
Copy link
Contributor

seanh commented Nov 28, 2024

Context

For additional context see #919.

As well as an allow-list (only sites on the allow-list are allowed to be proxied with Via) Checkmate also has a separate block-list (any site on the block-list is not allowed to be proxied, even if it's on the allow-list). IIRC the block-list serves a few different purposes:

  1. If a site is on the allow-list it can still be blocked by adding the same site to the block-list as well. This can be easier than removing the site from the allow-list. In fact we don't have a documented process for removing a site from the allow-list and I'm not sure it would be trivial to do so.

  2. School assignments created by instructors using Hypothesis's LMS app (https://github.com/hypothesis/lms) bypass the allow-list: instructors are allowed to create assignments to annotate any URL they want even if it's not on the allow-list. But URLs on the block list will still be blocked even if use in an LMS assignment.

  3. Sub-resources of HTML pages bypass the allow-list. For example if nytimes.com is on the allow-list and someone tries to proxy that page, in order for the page to load properly Via also needs to proxy many requests for JS, CSS, images, fonts, API calls, ads, etc etc. These requests can cover many different URLs and domains. Adding them all the the allow-list would be impractical. For that reason any sub-resource requests made by an allow-listed HTML page are themselves allowed to bypass the allow-list. But if one of those sub-resource URLs is on the block list it will still be blocked.

Problem

The process for adding a site to or removing a site from the blocklist is time-consuming and cumbersome. This process wastes the time of Hypothesis developers.

  1. The blocklist is saved in a text file in an S3 bucket. The developer has to download this file from S3, edit it, and then re-upload it.
  2. Checkmate has a Celery task that downloads the blocklist from S3 and imports it into Checkmate's DB:
    @app.task
    def sync_blocklist():
    """Download the online version of the blocklist."""
    # pylint: disable=no-member
    # PyLint doesn't know about the `request_context` method that we add
    with app.request_context() as request:
    url = request.registry.settings["checkmate_blocklist_url"]
    if not url:
    LOG.warning("Not updating blocklist as no URL is present")
    return
    LOG.info("Updating blocklist from '%s'", url)
    with request.tm:
    try:
    raw_rules = CustomRules(request.db).load_simple_rule_url(url)
    except RequestException as err:
    LOG.exception("Could not update blocklist: %s", err)
    return
    LOG.info("Updated %s custom rules", len(raw_rules))
  3. This task is run once per minute by h-periodic: https://github.com/hypothesis/h-periodic/blob/3824e67934a2967428bd149ebbb5e8daaf37ef69/h_periodic/checkmate_beat.py#L21-L25

This process is documented in How do I block particular URLs in Via?

Solution

Add a <textarea> to Checkmate's admin pages (https://checkmate.hypothes.is/ui/admin) that shows the current contents of the blocklist and allows an admin to edit the blocklist and save their changes directly into Checkmate's DB, without going through an S3 bucket and Celery task.

Done when

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant