Skip to content

Commit

Permalink
feat(ingest): replace custom hive-etl with sql-based ETL (#1713)
Browse files Browse the repository at this point in the history
This offloads most of the heavy lifting to SQLAlchemy.
Also add a docker file for testing
  • Loading branch information
mars-lan authored Jun 26, 2020
1 parent 5da55fe commit 682bb87
Show file tree
Hide file tree
Showing 7 changed files with 91 additions and 95 deletions.
16 changes: 0 additions & 16 deletions metadata-ingestion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,22 +81,6 @@ The ldap_etl provides you ETL channel to communicate with your LDAP server.
```
This will bootstrap DataHub with your metadata in the LDAP server as an user entity.

## Ingest metadata from Hive to DataHub
The hive_etl provides you ETL channel to communicate with your hive store.
```
➜ Config your hive store environmental variable in the file.
HIVESTORE # Your store host.
➜ Config your Kafka broker environmental variable in the file.
AVROLOADPATH # Your model event in avro format.
KAFKATOPIC # Your event topic.
BOOTSTRAP # Kafka bootstrap server.
SCHEMAREGISTRY # Kafka schema registry host.
➜ python hive_etl.py
```
This will bootstrap DataHub with your metadata in the hive store as a dataset entity.

## Ingest metadata from Kafka to DataHub
The kafka_etl provides you ETL channel to communicate with your kafka.
```
Expand Down
75 changes: 0 additions & 75 deletions metadata-ingestion/hive-etl/hive_etl.py

This file was deleted.

4 changes: 0 additions & 4 deletions metadata-ingestion/hive-etl/requirements.txt

This file was deleted.

30 changes: 30 additions & 0 deletions metadata-ingestion/sql-etl/hive.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
HIVE_SITE_CONF_javax_jdo_option_ConnectionURL=jdbc:postgresql://hive-metastore-postgresql/metastore
HIVE_SITE_CONF_javax_jdo_option_ConnectionDriverName=org.postgresql.Driver
HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName=hive
HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword=hive
HIVE_SITE_CONF_datanucleus_autoCreateSchema=false
HIVE_SITE_CONF_hive_metastore_uris=thrift://hive-metastore:9083
HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false

CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*

HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false

YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
52 changes: 52 additions & 0 deletions metadata-ingestion/sql-etl/hive.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Based on https://github.com/big-data-europe/docker-hive
version: "3"

services:
namenode:
image: bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
volumes:
- namenode:/hadoop/dfs/name
environment:
- CLUSTER_NAME=test
env_file:
- ./hive.env
ports:
- "50070:50070"
datanode:
image: bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8
volumes:
- datanode:/hadoop/dfs/data
env_file:
- ./hive.env
environment:
SERVICE_PRECONDITION: "namenode:50070"
ports:
- "50075:50075"
hive-server:
image: bde2020/hive:2.3.2-postgresql-metastore
env_file:
- ./hive.env
environment:
HIVE_CORE_CONF_javax_jdo_option_ConnectionURL: "jdbc:postgresql://hive-metastore/metastore"
SERVICE_PRECONDITION: "hive-metastore:9083"
ports:
- "10000:10000"
hive-metastore:
image: bde2020/hive:2.3.2-postgresql-metastore
env_file:
- ./hive.env
command: /opt/hive/bin/hive --service metastore
environment:
SERVICE_PRECONDITION: "namenode:50070 datanode:50075 hive-metastore-postgresql:5432"
ports:
- "9083:9083"
hive-metastore-postgresql:
image: bde2020/hive-metastore-postgresql:2.3.0
presto-coordinator:
image: shawnzhu/prestodb:0.181
ports:
- "8080:8080"

volumes:
namenode:
datanode:
8 changes: 8 additions & 0 deletions metadata-ingestion/sql-etl/hive_etl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from common import run

# See https://github.com/dropbox/PyHive for more details
URL = '' # e.g. hive://username:password@hostname:port
OPTIONS = {} # e.g. {"connect_args": {"configuration": {"hive.exec.reducers.max": "123"}}
PLATFORM = 'hive'

run(URL, OPTIONS, PLATFORM)
1 change: 1 addition & 0 deletions metadata-ingestion/sql-etl/hive_etl.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pyhive[hive]==0.6.1

0 comments on commit 682bb87

Please sign in to comment.