Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest): Fix parse error when profiling BQ monthly partitioned tables #4591

Closed

Conversation

sebkim
Copy link
Contributor

@sebkim sebkim commented Apr 6, 2022

I encountered ParserError while ingesting BigQuery profiling. This error occurs only when profiling monthly partitioned table. The error message is following:

File "/Users/seb.kim/github/datahub/dh/lib/python3.8/site-packages/datahub/ingestion/run/pipeline.py", line 181, in run
    for wu in itertools.islice(
File "/Users/seb.kim/github/datahub/dh/lib/python3.8/site-packages/datahub/ingestion/source/sql/bigquery.py", line 655, in get_workunits
    for wu in super().get_workunits():
File "/Users/seb.kim/github/datahub/dh/lib/python3.8/site-packages/datahub/ingestion/source/sql/sql_common.py", line 656, in get_workunits
    profile_requests += list(
File "/Users/seb.kim/github/datahub/dh/lib/python3.8/site-packages/datahub/ingestion/source/sql/sql_common.py", line 1157, in loop_profiler_requests
    (partition, custom_sql) = self.generate_partition_profiler_query(
File "/Users/seb.kim/github/datahub/dh/lib/python3.8/site-packages/datahub/ingestion/source/sql/bigquery.py", line 585, in generate_partition_profiler_query
    partition_datetime = parser.parse(partition.partition_id)
File "/Users/seb.kim/github/datahub/dh/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1368, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
File "/Users/seb.kim/github/datahub/dh/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 651, in parse
    six.raise_from(ParserError(str(e) + ": %s", timestr), e)
File "<string>", line 3, in raise_from

ParserError: month must be in 1..12: 202203

bigquery partition_id of the MONTHLY partition is a format of {year}{month} such as 202204.
dateutil.parser.parse considers six digits string as {day}{month}{year}.
Thus we need to append a seperator like '-' between 2022 and 04 to fix this error.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.

@github-actions
Copy link

github-actions bot commented Apr 6, 2022

Unit Test Results (metadata ingestion)

       5 files         5 suites   1h 6m 21s ⏱️
   430 tests    429 ✔️   0 💤 1
2 075 runs  2 028 ✔️ 46 💤 1

For more details on these failures, see this check.

Results for commit 5facd74.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Apr 6, 2022

Unit Test Results (build & test)

  96 files  ±0    96 suites  ±0   24m 28s ⏱️ + 5m 50s
689 tests ±0  630 ✔️ ±0  59 💤 ±0  0 ±0 

Results for commit 5facd74. ± Comparison against base commit c5b8f57.

♻️ This comment has been updated with latest results.

"""
partition_id_for_parse = partition.partition_id
if len(partition_id_for_parse) == 6:
partition_id_for_parse = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like this which I think is much clearer?:

partition_datetime = datetime.strptime(partition_id_for_parse,"%Y%M")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use the 'dateutil.parser' anyway because we can have YEARLY / DAILY partition as well. @treff7es

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use 'datetime.strptime' only for monthly case, we have to use two parsing functions ('datetime.strptime' and 'dateutil.parser.parse') to cover all cases. I think we'd better to use one function for the consistency.
Also, if we use 'datetime.strptime', there will be a difference. For example, when we parse '202204' with strptime, the result will be '2022-04-01 00:00:00', and if we parse '2022-04' with 'dateutil.parser.parse' in 2022-04-21 (the actual date), the result will be '2022-04-21 00:00:00'. @treff7es

@sb-sebkim
Copy link

It seems like these test fails are not triggered from the code change in this PR.
How can I make this pass the test? @treff7es

@sb-sebkim
Copy link

Can you also check this PR? @anshbansal

@anshbansal
Copy link
Collaborator

@sb-sebkim Let me talk with @treff7es who is assigned to this PR.

@sb-sebkim
Copy link

sb-sebkim commented Apr 27, 2022

I will close this PR given that #4658 PR fixes the issue here.
I do not have a permission to close this. Please close this PR @treff7es .

@anshbansal anshbansal closed this Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants