Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Metropolitan reingestion days fail due to DuplicateTable #1322

Closed
stacimc opened this issue Dec 13, 2022 · 4 comments · Fixed by WordPress/openverse-catalog#1009
Closed
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@stacimc
Copy link
Collaborator

stacimc commented Dec 13, 2022

Description

On each run of the metropolitan_reingestion_workflow, a handful of reingestion days fail during the create_loading_table step due to the error:

psycopg2.errors.DuplicateTable: relation "provider_data_image_metropolitan_museum_reingestion_20221204t00" already exists

In each case I checked, this is the first error in the create_loading_table task (ie it's not only throwing this on retries).

Reproduction

See day_shift_2 or day_shift_1 for the 2022-12-04 run in production.

@stacimc stacimc added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository labels Dec 13, 2022
@stacimc
Copy link
Collaborator Author

stacimc commented Dec 13, 2022

Adding to the DAG Stability milestone as this is high priority.

@AetherUnbound
Copy link
Collaborator

Similar to #1417, I wonder if adding the day shift to the table name would make the value sufficiently unique enough to prevent this issue.

@stacimc
Copy link
Collaborator Author

stacimc commented Jan 4, 2023

We do already append the day_shift to the load table name, via the identifier param created here, though 🤔

I could see there being an issue specifically on day_shift 0 because that ends up being appended for both a normal ingestion run and for the first run of a reingestion workflow, so it makes sense to add some logic to only append the day_shift here if it's a reingestion run. But I'm not immediately sure why this is an issue, much less only on certain days. From the logs for the 2022-10-30 run, create_loading_table_day_shift_4 task:

[2022-11-02, 05:03:42 UTC] {sql.py:315} INFO - Running statement: 
  CREATE TABLE public.provider_data_image_metropolitan_museum_reingestion_20221023T000000_4(
  foreign_identifier character varying(3000),
  [...]

[2022-11-02, 05:03:42 UTC] {taskinstance.py:1851} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 175, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 193, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/openverse_catalog/dags/common/loader/sql.py", line 95, in create_loading_table
    postgres.run(table_creation_query)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/common/sql/hooks/sql.py", line 295, in run
    self._run_command(cur, sql_statement, parameters)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/providers/common/sql/hooks/sql.py", line 320, in _run_command
    cur.execute(sql_statement)
psycopg2.errors.DuplicateTable: relation "provider_data_image_metropolitan_museum_reingestion_20221023t00" already exists

So it looks like identifier is passed in appropriately in the logged CREATE command, yet the table name in the DuplicateTable error is truncated.

That particular example is notable because it succeeded on its third attempt.

@AetherUnbound
Copy link
Collaborator

AetherUnbound commented Jan 5, 2023

Oh dear 😅 if it's getting truncated then we must be hitting the maximum table name length for Postgres1... (emphasis mine)

The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier length is 63 bytes. If this limit is problematic, it can be raised by changing the NAMEDATALEN constant in src/include/pg_config_manual.h.

Looks like we may need to figure out a shorthand for these temp tables 😬

Footnotes

  1. https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS

@stacimc stacimc self-assigned this Jan 27, 2023
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@obulat obulat moved this from 📋 Backlog to ✅ Done in Openverse Backlog Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants