Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion): improve logging, docs for bigquery, snowflake, redshift #4344

Merged
merged 24 commits into from
Mar 14, 2022

Conversation

anshbansal
Copy link
Collaborator

@anshbansal anshbansal commented Mar 8, 2022

Changes

  • Add new environment variables CLI_TELEMETRY_ENABLED:true and INGESTION_REPORTING_ENABLED:false on GMS to control whether to send telemetry or not. CLI currently respects CLI_TELEMETRY_ENABLED. Handling INGESTION_REPORTING_ENABLED will be done in later PR.
  • New configs shown in gms config managedIngestion and telemetry so we can get them on CLI side
{
  "models" : { },
  "versions" : {
    "linkedin/datahub" : {
      "version" : "v0.8.17",
      "commit" : "6d81a96fafa15b27dc47f79ba00d27c91ccbf5c3"
    }
  },
  "managedIngestion" : {
    "defaultCliVersion" : "0.8.26.6",
    "enabled" : true
  },
  "statefulIngestionCapable" : true,
  "supportsImpactAnalysis" : true,
  "telemetry" : {
    "enabledCli" : true,
    "enabledIngestion" : false
  },
  "retention" : "true",
  "noCode" : "true"
}
  • Basic debugging information in all source reports
 'cli_version': 'unavailable (installed in develop mode)',
 'cli_entry_location': '/Users/aseembansal/code/datahub/metadata-ingestion/src/datahub/__init__.py',
 'py_version': '3.8.9 (default, Oct 26 2021, 07:25:53) \n[Clang 13.0.0 (clang-1300.0.29.30)]',
 'py_exec_path': '/Users/aseembansal/code/datahub/metadata-ingestion/venv/bin/python',
 'os_details': 'macOS-12.2.1-arm64-arm-64bit',
  • On any exception we get basic debugging information at the very end always (gms_config comes in case we were able to connect to GMS). e.g.
(Background on this error at: http://sqlalche.me/e/13/4xp6)

[2022-03-10 23:06:40,251] INFO     {datahub.entrypoints:168} - DataHub CLI version: 0.0.0.dev0 at /Users/aseembansal/code/datahub/metadata-ingestion/src/datahub/__init__.py
[2022-03-10 23:06:40,251] INFO     {datahub.entrypoints:171} - Python version: 3.8.9 (default, Oct 26 2021, 07:25:53) 
[Clang 13.0.0 (clang-1300.0.29.30)] at /Users/aseembansal/code/datahub/metadata-ingestion/venv/bin/python on macOS-12.2.1-arm64-arm-64bit
[2022-03-10 23:06:40,251] INFO     {datahub.entrypoints:174} - Global debug {'gms_config': {'supportsImpactAnalysis': True, 'models': {}, 'versions': {'linkedin/datahub': {'version': 'v0.8.28', 'commit': 'd474387eeb2e092bdfc50654363bc5f6edc2b7a2'}}, 'statefulIngestionCapable': True, 'retention': 'true', 'noCode': 'true'}}
  • Fix docs for bigquery, snowflake and redshift for the sidebar to make sense
  • Change wording of snowflake source for clarity and avoid using accountadmin in example config files as that is more like aws root account and should not be used everywhere for security reasons
  • Placed some placeholders for source reports for sources
  • Improve redshift source logging
  • Add warning of missing email_domain which was in the snowflake-usage source into the Snowflake usage source report
  • Move some variables to be inside the __init__ methods to avoid class variables
  • Move snowflake and snowflake-usage configs into separate files (prep to move all configs/reports to separate file to avoid circular dependencies)
  • Add validation of host_port in snowflake config
  • Add provision_role block in snowflake recipe to automatically do the provisioning of the role required. The recipe looks like
source:
  type: "snowflake"
  config:
    provision_role:
      enabled: true
      admin_role: ACCOUNTADMIN
      admin_username: "${SNOWFLAKE_ADMIN_USER}"
      admin_password: "${SNOWFLAKE_ADMIN_PASS}"

    host_port: "${SNOWFLAKE_HOST}"
    warehouse: "COMPUTE_WH"
    role: "datahub_role"

The dry run shows logs like this

[2022-03-09 19:31:50,472] INFO     {datahub.ingestion.source.sql.snowflake:530} - Creating connection for provision_role
[2022-03-09 19:31:57,237] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql CREATE ROLE IF NOT EXISTS datahub_role_aseem_test
[2022-03-09 19:31:57,237] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql grant operate, usage on warehouse COMPUTE_WH to role datahub_role_aseem_test
[2022-03-09 19:31:57,237] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql grant usage on DATABASE aseem_test_db to role datahub_role_aseem_test
[2022-03-09 19:31:57,237] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql grant usage on all schemas in database aseem_test_db to role datahub_role_aseem_test
[2022-03-09 19:31:57,237] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql grant select on all tables in database aseem_test_db to role datahub_role_aseem_test
[2022-03-09 19:31:57,238] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql grant select on all external tables in database aseem_test_db to role datahub_role_aseem_test
[2022-03-09 19:31:57,238] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql grant select on all views in database aseem_test_db to role datahub_role_aseem_test
[2022-03-09 19:31:57,238] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql grant usage on future schemas in database aseem_test_db to role datahub_role_aseem_test
[2022-03-09 19:31:57,238] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql grant select on future tables in database aseem_test_db to role datahub_role_aseem_test
[2022-03-09 19:31:57,238] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql grant role datahub_role_aseem_test to user service_account_tests
[2022-03-09 19:31:57,238] INFO     {datahub.ingestion.source.sql.snowflake:564} - [DRY RUN]  Attempting to run sql grant imported privileges on database snowflake to role datahub_role_aseem_test

The source report has these additions for debugging for setup

'provision_role_done': False,
'provision_role_success': False,
  • Add more stats for snowflake-usage source reports
'min_access_history_time': datetime.datetime(2021, 6, 11, 18, 35, 30, 734000, tzinfo=datetime.timezone.utc),
 'max_access_history_time': datetime.datetime(2022, 3, 9, 17, 54, 42, 919000, tzinfo=datetime.timezone.utc),
 'access_history_range_query_secs': 18.88,
 'access_history_query_secs': 11.79,
 'rows_processed': 649,
 'rows_missing_query_text': 0,
 'rows_zero_base_objects_accessed': 573,
 'rows_zero_direct_objects_accessed': 573,
 'rows_missing_email': 0,
 'rows_parsing_error': 0}

  • Had to refactor snowflake-usage a bit when adding debug information due to linting of code complexity

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable)

@github-actions
Copy link

github-actions bot commented Mar 8, 2022

Unit Test Results (metadata ingestion)

       5 files         5 suites   45m 27s ⏱️
   361 tests    361 ✔️   0 💤 0
1 652 runs  1 621 ✔️ 31 💤 0

Results for commit 8bd9ef1.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Mar 8, 2022

Unit Test Results (build & test)

  92 files  ±0    92 suites  ±0   16m 43s ⏱️ - 6m 3s
672 tests ±0  611 ✔️  - 2  59 💤 ±0  2 +2 

For more details on these failures, see this check.

Results for commit 8bd9ef1. ± Comparison against base commit bb413be.

♻️ This comment has been updated with latest results.

@anshbansal anshbansal marked this pull request as draft March 9, 2022 11:02
@anshbansal anshbansal marked this pull request as ready for review March 9, 2022 14:13
@anshbansal anshbansal marked this pull request as draft March 9, 2022 15:45
@anshbansal anshbansal marked this pull request as ready for review March 9, 2022 15:52
@anshbansal
Copy link
Collaborator Author

@treff7es Pushed my fixes based on code review comments. Can you check if this LGTM?

@treff7es
Copy link
Contributor

fyi, 3 tests are currently failing:
FAILED tests/unit/test_snowflake_source.py::test_snowflake_uri_default_authentication
FAILED tests/unit/test_snowflake_source.py::test_snowflake_uri_external_browser_authentication
FAILED tests/unit/test_snowflake_source.py::test_snowflake_uri_key_pair_authentication

@treff7es
Copy link
Contributor

@treff7es Pushed my fixes based on code review comments. Can you check if this LGTM?
LGTM but please fix the test failures, I think there are some which is due to these code changes.

@shirshanka shirshanka merged commit 4bcc2b3 into datahub-project:master Mar 14, 2022
@anshbansal anshbansal deleted the improve_logging_and_docs branch March 14, 2022 15:53
aditya-radhakrishnan pushed a commit to aditya-radhakrishnan/datahub that referenced this pull request Mar 14, 2022
aditya-radhakrishnan pushed a commit to aditya-radhakrishnan/datahub that referenced this pull request Mar 14, 2022
maggiehays pushed a commit to maggiehays/datahub that referenced this pull request Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants