Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HMA] Error fetching large object file during reload #1673

Open
prenner opened this issue Oct 30, 2024 · 1 comment
Open

[HMA] Error fetching large object file during reload #1673

prenner opened this issue Oct 30, 2024 · 1 comment
Assignees
Labels
bug hma Items related to the hasher-matcher-actioner system successful reproduction This bug has a consistent reproduction

Comments

@prenner
Copy link
Contributor

prenner commented Oct 30, 2024

Hey,

We're currently running HMA in dev and saw this error in our application logs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/apscheduler/executors/base.py", line 125, in run_job
    retval = job.func(*job.args, **job.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/build/OpenMediaMatch/blueprints/matching.py", line 73, in periodic_task
    self.reload_if_needed(storage)
  File "/build/OpenMediaMatch/blueprints/matching.py", line 62, in reload_if_needed
    new_index = store.get_signal_type_index(self.signal_type)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/build/OpenMediaMatch/storage/postgres/impl.py", line 192, in get_signal_type_index
    return db_record.load_signal_index() if db_record is not None else None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/build/OpenMediaMatch/storage/postgres/database.py", line 434, in load_signal_index
    l_obj = raw_conn.lobject(oid, "rb")  # type: ignore[attr-defined]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
psycopg2.OperationalError: ERROR:  large object 22491 does not exist

load_signal_index is unable to find a large object so database.py:434 is failing. This is called by reload_if_needed in blueprints/matching.py when trying to rebuild the SignalIndexInMemoryCache

Ideally there's a more graceful error that happens but I'm still trying to determine how our db got into that state

@Dcallies Dcallies added bug hma Items related to the hasher-matcher-actioner system labels Oct 30, 2024
@Dcallies
Copy link
Contributor

Thanks for the error report @prenner !

I wrote this code, the large object db stuff I'm less familiar with and mostly just followed the postgres docs. The index objects can get very large (potentially gigabytes), and so I tried to use the Large object API to store them. The way it works is that you open a bytestream connection, send all the bytes, and then you save the id of the created object in the index record. In the matchers, you check the index table, get the object id, load all the bytes into memory, and then deserialize them into the index again.

Your stack trace is inside of the periodic task, where it should be attempting to load the previously built large object, but it can't find it. This is strange because we only create the index object after it's already written to the large object here:

https://github.com/facebook/ThreatExchange/blob/main/hasher-matcher-actioner/src/OpenMediaMatch/storage/postgres/database.py#L410-L412

So that implies something deleted your large object without deleting the record. This could be a partial failure in the deallocation of signal_type_index, which may not follow transactions the same way as the main record. I assumed the transactional logic would keep us safe, but it could be that unlink happens outside a transaction.

@Dcallies Dcallies added the successful reproduction This bug has a consistent reproduction label Oct 30, 2024
@Dcallies Dcallies self-assigned this Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug hma Items related to the hasher-matcher-actioner system successful reproduction This bug has a consistent reproduction
Projects
None yet
Development

No branches or pull requests

2 participants