Skip to content

Commit

Permalink
feat(ingest): support hive over http (#2486)
Browse files Browse the repository at this point in the history
  • Loading branch information
hsheth2 authored May 4, 2021
1 parent 7948226 commit 6f1f0a4
Show file tree
Hide file tree
Showing 3 changed files with 51 additions and 5 deletions.
42 changes: 38 additions & 4 deletions metadata-ingestion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,14 +223,48 @@ Extracts:
source:
type: hive
config:
username: user
password: pass
# For more details on authentication, see the PyHive docs:
# https://github.com/dropbox/PyHive#passing-session-configuration.
# LDAP, Kerberos, etc. are supported using connect_args, which can be
# added under the `options` config parameter.
#scheme: 'hive+http' # set this if Thrift should use the HTTP transport
#scheme: 'hive+https' # set this if Thrift should use the HTTP with SSL transport
username: user # optional
password: pass # optional
host_port: localhost:10000
database: DemoDatabase
database: DemoDatabase # optional, defaults to 'default'
# table_pattern/schema_pattern is same as above
# options is same as above
```

<details>
<summary>Using ingestion with Azure HDInsight</summary>

HDInsight [does not expose](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-port-settings-for-services#hive-ports) the HiveServer2 port 10001 publicly. There are two possible workarounds:

1. Run `datahub` directly on the cluster's node.
2. Use ssh to forward the Hive server's port 10001 to the local machine before running ingestion.
```sh
# In first terminal window. Keep this running during ingestion.
ssh -L 10001:localhost:10001 'sshuser@<clusterName>-ssh.azurehdinsight.net'
# In a second terminal window.
datahub ingest -c ...
```

In both cases, the config is fairly similar:

```yml
# Connecting to Microsoft HDInsight. See above for required setup steps.
source:
type: hive
config:
scheme: "hive+http"
host_port: localhost:10001
# other options from above are still available as well
```

</details>

### PostgreSQL `postgres`

Extracts:
Expand Down Expand Up @@ -288,7 +322,7 @@ source:
connect_uri: http://localhost:8088
```
See documentation for superset's `/security/login` at https://superset.apache.org/docs/rest-api for more details on superset's login api.
See documentation for superset's `/security/login` at https://superset.apache.org/docs/rest-api for more details on superset's login api.

### Oracle `oracle`

Expand Down
7 changes: 7 additions & 0 deletions metadata-ingestion/scripts/install_editable_versions.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/bin/bash

set -euxo pipefail

pip install -e 'git+https://github.com/hsheth2/avro_gen#egg=avro-gen3'
pip install -e 'git+https://github.com/hsheth2/PyHive#egg=acryl-pyhive[hive]'
pip install -e '.[dev]'
7 changes: 6 additions & 1 deletion metadata-ingestion/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,12 @@ def get_long_description():
"sqlalchemy": sql_common,
"athena": sql_common | {"PyAthena[SQLAlchemy]"},
"bigquery": sql_common | {"pybigquery >= 0.6.0"},
"hive": sql_common | {"pyhive[hive]"},
"hive": sql_common
| {
# Acryl maintains a fork of PyHive, which adds support for table comments
# and column comments, and also releases HTTP and HTTPS transport schemes.
"acryl-pyhive[hive]"
},
"mssql": sql_common | {"sqlalchemy-pytds>=0.3"},
"mysql": sql_common | {"pymysql>=1.0.2"},
"postgres": sql_common | {"psycopg2-binary", "GeoAlchemy2"},
Expand Down

0 comments on commit 6f1f0a4

Please sign in to comment.