feat(integration/fivetran): Fivetran connector integration #9018

shubhamjagtap639 · 2023-10-16T08:07:14Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…into fivetran-connector-integration

mayurinehate

I would like to suggest a few changes to code structure -

Separate current data_classes.py into data_classes.py (classes Connector, Job) and fivetran_log_api.py (class FivetranLogDataDictionary or rename as FivetranPlatformConnectorApi)
Classes in fivetran_log_api act as interface for interacting with fivetran log api and return models defined in data_classes.py. I prefer to have separate public methods for each data model extraction rather than nested calls to fivetran log apis but don't feel strongly here.
All orchestration logic required for metadata events generation is handled by FivetranSource. - e.g. pattern-based filtering. It would help not to pass entire FivetranSourceConfig to fivetran_api but only what's required to invoke the api. That usually helps in separation of responsibilities.
If the logic in FivetranSource is too complex or long then we can introduce extractor layer in the middle that shares the orchestration responsibility of a particular feature.

metadata-ingestion/src/datahub/ingestion/source/fivetran/fivetran.py

mayurinehate · 2023-10-20T11:10:06Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py

+
+class FivetranLogConfig(ConfigModel):
+    destination_platform: str = pydantic.Field(
+        default="snowflake",


For my understanding,

Is snowflake the only supported destination ?

Is destination platform related to connector's destination ?

What would change in to DestinationConfig class when we want to support more destination platforms ?

No, there are other supported destinations but for now we are just starting with snowflake destination.

No, not all connector's destination. This destination platform only relates to fivetran log connector destination from where we are fetcing all metadata.

In future if we want to support more destination platform, we will rename DestinationConfig to SnowflakeDestinationConfig class. And there will be seperate class from new destination platform as two different destination platform requires different configurations to create connection.

mayurinehate · 2023-10-20T11:19:43Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/data_classes.py

+        }
+        for sync_id in sync_start_logs.keys():
+            if sync_end_logs.get(sync_id) is None:
+                # If no sync-end event log for this sync id that means sync is still in progress


How about ingesting in progress jobs as DataProcessInstance with status STARTED ?

As we are not ingesting in progress jobs in other sources like Airflow, I didn't ingested here as well.

metadata-ingestion/src/datahub/ingestion/source/fivetran/data_classes.py

metadata-ingestion/src/datahub/ingestion/source_config/sql/snowflake.py

metadata-ingestion/src/datahub/ingestion/source/fivetran/fivetran.py

hsheth2

Some initial thoughts, but overall looks good

I haven't actually tested this locally yet though

hsheth2 · 2023-10-25T02:51:26Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/fivetran_log_api.py

+    def _get_log_destination_engine(self) -> Any:
+        destination_platform = self.fivetran_log_config.destination_platform
+        engine = None
+        if destination_platform == "snowflake":


as long as the config type has a get_sql_alchemy_url and get_options method, we should be able to call create_engine() right? what's the reason we need to do this per destination platform?

Yes, As creation of sqlalchemy engine using sqlalchemy URL to execute the queries is the common thing for other destination as well. Hence, surely we will need one common method that is get_sql_alchemy_url.
We can remove get_options method as its not required compulsary.

hsheth2 · 2023-10-25T02:51:58Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/fivetran_query.py

+    @staticmethod
+    def get_connectors_query() -> str:
+        return """
+        SELECT connector_id as "CONNECTOR_ID",


do these queries work multiple underlying databases?

Haven't checked that but syntactically it should work as most of the destinations use SQL query language.
If there is some syntax which is not common, we can modify logic here in future to get different query for different destination.

hsheth2 · 2023-10-25T02:53:52Z

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

+            not isinstance(wu.metadata, MetadataChangeEventClass)
+            and wu.metadata.entityType in skip_entities
+        ):
+            # If any entity does not support aspect 'status' then skip that entity from adding status aspect.


we have a map of exactly what entity types support what aspects - can we look this information up there instead?

I can provide some pointers on how to do this

Pls provide some pointers on this

Use the helper method from here #9120

hsheth2 · 2023-10-25T02:57:08Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/__init__.py

@@ -0,0 +1 @@
+from datahub.ingestion.source.fivetran.fivetran import FivetranSource


this probably isn't necessary - make init empty, and make setup.py point directly at the fivetran source class instead of re-exporting it here

hsheth2 · 2023-10-30T18:20:34Z

metadata-ingestion/docs/sources/fivetran/fivetran_pre.md

+
+Source and destination are mapped to Dataset as an Input and Output of Connector.
+
+## Snowflake destination Configuration Guide


explicitly call out that this only works with snowflake for now

hsheth2 · 2023-10-30T18:31:12Z

metadata-ingestion/setup.py

@@ -389,6 +389,7 @@
    "powerbi-report-server": powerbi_report_server,
    "vertica": sql_common | {"vertica-sqlalchemy-dialect[vertica-python]==0.0.8.1"},
    "unity-catalog": databricks | sqllineage_lib,
+    "fivetran": {"requests"},


this isn't really accurate - fivetran requires snowflake, right?

hsheth2 · 2023-10-30T18:37:21Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py

we need to add fivetran to datahub/metadata-service/war/src/main/resources/boot/data_platforms.json (plus a fivetran logo/icon)

For example, a similar change was made here https://github.com/datahub-project/datahub/pull/7971/files

hsheth2 · 2023-10-30T18:37:44Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/config.py

+
+import pydantic
+from pydantic import Field
+from pydantic.class_validators import root_validator


Suggested change

from pydantic.class_validators import root_validator

from pydantic import root_validator

hsheth2 · 2023-10-30T18:42:20Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/fivetran_log_api.py

+                continue
+
+            message_data = json.loads(sync_end_logs[sync_id][Constant.MESSAGE_DATA])
+            if type(message_data) is str:


use isinstance, not type(...) is ...

hsheth2 · 2023-10-30T18:43:32Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/fivetran_log_api.py

+                    user_name=self._get_user_name(
+                        connector[Constant.CONNECTING_USER_ID]
+                    ),
+                    table_lineage=self._get_table_lineage(


they said column lineage is live a few weeks ago - how easy is it to capture that too?

hsheth2 · 2023-10-30T18:47:56Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/fivetran.py

+                    )
+                )
+            output_dataset_urn_list.append(
+                DatasetUrn.create_from_ids(


it looks like this produces urns of this form:

urn:li:dataset:(urn:li:dataPlatform:snowflake,[<optional_platform_instance.]<schema>.<table>,PROD)

However, that's not correct - snowflake urns (and in fact, urns for all "three-tier" sources) should be urn:li:dataset:(urn:li:dataPlatform:snowflake,[<optional_platform_instance.]<database>.<schema>.<table>,PROD)

hsheth2 · 2023-10-30T18:51:10Z

metadata-ingestion/src/datahub/ingestion/source/fivetran/fivetran.py

+        ):
+            yield mcp.as_workunit()
+
+    def _get_connector_workunit(


make this plural

Suggested change

def _get_connector_workunit(

def _get_connector_workunits(

…into fivetran-connector-integration

shubhamjagtap639 added 8 commits October 4, 2023 14:08

Fivetran source connector initial code added

abe52ae

Code to generate connector workunit added

b04ff8c

Code to execute query through Sqlalchemy engine added

3f392f4

test cases for fivetran source added

0c295e2

fivetran source added in base dev requirement

b6897d1

Config description modified

7dddda3

SnowflakeDestinationConfig class renamed to DestinationConfig

81cf867

Merge branch 'master' of https://github.com/shubhamjagtap639/datahub …

389c16f

…into fivetran-connector-integration

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 16, 2023

vercel bot deployed to Preview October 16, 2023 08:25 View deployment

Test case golden file updated

561d2b5

vercel bot deployed to Preview October 16, 2023 09:18 View deployment

mayurinehate reviewed Oct 20, 2023

View reviewed changes

Code changes as per review comment

d1a13f0

vercel bot deployed to Preview October 23, 2023 14:27 View deployment

hsheth2 reviewed Oct 25, 2023

View reviewed changes

Fivetran doc file added

19234ee

vercel bot had a problem deploying to Preview October 25, 2023 13:56 Failure

maggiehays added the hacktoberfest-accepted Acceptance for hacktoberfest https://hacktoberfest.com/participation/ label Oct 26, 2023

yml file modified

2bba626

vercel bot deployed to Preview October 26, 2023 07:07 View deployment

Merge branch 'master' into fivetran-connector-integration

2e045d2

vercel bot deployed to Preview October 26, 2023 07:46 View deployment

hsheth2 added 2 commits October 30, 2023 12:00

clarify example recipe

f17aaf2

fix status aspects

af11bcc

hsheth2 requested changes Oct 30, 2023

View reviewed changes

vercel bot deployed to Preview October 30, 2023 19:47 View deployment

Merge branch 'master' of https://github.com/shubhamjagtap639/datahub …

5c2afd9

…into fivetran-connector-integration

vercel bot deployed to Preview October 31, 2023 06:56 View deployment

Fivetran source integration changes

1958c3e

vercel bot deployed to Preview October 31, 2023 09:23 View deployment

Code changes as per review comments

f1fdb04

vercel bot deployed to Preview November 1, 2023 13:27 View deployment

destination dataset urn corrected

c92d4fc

vercel bot deployed to Preview November 1, 2023 13:57 View deployment

lint error fixed

42dd9ef

vercel bot deployed to Preview November 1, 2023 14:26 View deployment

source to database config added

5e0f3f5

vercel bot deployed to Preview November 2, 2023 12:41 View deployment

always link to upstreams

b8883ff

hsheth2 approved these changes Nov 3, 2023

View reviewed changes

hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Nov 3, 2023

vercel bot deployed to Preview November 3, 2023 22:45 View deployment

Merge branch 'master' into fivetran-connector-integration

425a74b

vercel bot deployed to Preview November 8, 2023 04:46 View deployment

hsheth2 merged commit e73e926 into datahub-project:master Nov 8, 2023
53 checks passed

shubhamjagtap639 deleted the fivetran-connector-integration branch March 14, 2024 11:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(integration/fivetran): Fivetran connector integration #9018

feat(integration/fivetran): Fivetran connector integration #9018

shubhamjagtap639 commented Oct 16, 2023

mayurinehate left a comment

mayurinehate Oct 20, 2023

shubhamjagtap639 Oct 23, 2023

mayurinehate Oct 20, 2023

shubhamjagtap639 Oct 23, 2023

hsheth2 left a comment

hsheth2 Oct 25, 2023

shubhamjagtap639 Oct 25, 2023

hsheth2 Oct 25, 2023

shubhamjagtap639 Oct 25, 2023

hsheth2 Oct 25, 2023

hsheth2 Oct 25, 2023

shubhamjagtap639 Oct 25, 2023

hsheth2 Oct 26, 2023

hsheth2 Oct 25, 2023

hsheth2 Oct 30, 2023

hsheth2 Oct 30, 2023

hsheth2 Oct 30, 2023

hsheth2 Oct 30, 2023

hsheth2 Oct 30, 2023

hsheth2 Oct 30, 2023

hsheth2 Oct 30, 2023

hsheth2 Oct 30, 2023

		@@ -0,0 +1 @@
		from datahub.ingestion.source.fivetran.fivetran import FivetranSource


		Source and destination are mapped to Dataset as an Input and Output of Connector.

		## Snowflake destination Configuration Guide

	from pydantic.class_validators import root_validator
	from pydantic import root_validator

feat(integration/fivetran): Fivetran connector integration #9018

feat(integration/fivetran): Fivetran connector integration #9018

Conversation

shubhamjagtap639 commented Oct 16, 2023

Checklist

mayurinehate left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment