The Apache Hive data warehouse software facilitates reading, writing, and managing datasets residing in distributed storage using SQL.
First download the Hive Cloudera JDBC driver.
In order to enable the ExaLoader to fetch data from the external database you must register the driver for ExaLoader as described in the Installation procedure for JDBC drivers.
-
ExaLoader expects the driver in BucketFS folder
default/drivers/jdbc
.If you uploaded the driver for UDF to a different folder, then you need to upload the driver again.
-
Additionally you need to create file
settings.cfg
and upload it to the same folder in BucketFS:DRIVERNAME=HIVE JAR=HiveJDBC42.jar DRIVERMAIN=com.cloudera.hive.jdbc.HS2Driver PREFIX=jdbc:hive2: FETCHSIZE=100000 INSERTSIZE=-1 NOSECURITY=YES
Please ensure that the file contains a trailing newline.
-
Download the Hive Cloudera JDBC driver.
-
Upload the driver to BucketFS, see the BucketFS documentation for details.
Hint: Put the driver into folder
default/drivers/jdbc/
to register it for ExaLoader, too.
We are using Cloudera Hive JDBC driver because it provides metadata query support to build the Virtual Schema tables, column names and column types.
With this in mind, please use the Cloudera JDBC properties when building a connection URL string. For example, use SSLTrustStorePwd
instead of sslTrustStorePassword
. You can checkout the Cloudera Hive JDBC Manual for all available properties.
Upload the latest available release of Hive Virtual Schema to Bucket FS.
Then create a schema to hold the adapter script.
CREATE SCHEMA ADAPTER;
The SQL statement below creates the adapter script, defines the Java class that serves as entry point and tells the UDF framework where to find the libraries (JAR files) for Virtual Schema and database driver.
--/
CREATE OR REPLACE JAVA ADAPTER SCRIPT ADAPTER.JDBC_ADAPTER AS
%scriptclass com.exasol.adapter.RequestDispatcher;
%jar /buckets/<BFS service>/<bucket>/jars/virtual-schema-dist-12.0.0-hive-3.0.0.jar;
%jar /buckets/<BFS service>/<bucket>/jars/HiveJDBC42.jar;
/
Define the connection to Hive as shown below.
CREATE OR REPLACE CONNECTION HIVE_CONNECTION
TO 'jdbc:hive2://<Hive host>:<port>'
USER '<user>'
IDENTIFIED BY '<password>';
Below you see how a Hive Virtual Schema is created. Please note that you have to provide the name of a schema.
CREATE VIRTUAL SCHEMA <virtual schema name>
USING ADAPTER.JDBC_ADAPTER
WITH
CONNECTION_NAME = 'HIVE_CONNECTION'
SCHEMA_NAME = '<schema name>';
Connecting to a Kerberos secured Hive service only differs in one aspect: You have a CONNECTION
object which contains all the relevant information for the Kerberos authentication. This section describes how Kerberos authentication works and how to create such a CONNECTION
.
Both the adapter script and the internally used IMPORT FROM JDBC
statement support Kerberos authentication. They detect, that the connection is a Kerberos connection by a special prefix in the IDENTIFIED BY
field. In such case, the authentication will happen using a Kerberos keytab and Kerberos config file (using the JAAS Java API).
The CONNECTION
object stores all relevant information and files in its fields:
- The
TO
field contains the JDBC connection string - The
USER
field contains the Kerberos principal - The
IDENTIFIED BY
field contains the Kerberos configuration file and keytab file (base64 encoded) along with an internal prefixExaAuthType=Kerberos;
to identify theCONNECTION
as a KerberosCONNECTION
.
In order to simplify the creation of Kerberos CONNECTION
objects, the create_kerberos_conn.py
Python script has been provided. The script requires 5 arguments:
CONNECTION
name (arbitrary name for the newCONNECTION
)- Kerberos principal for Hadoop (i.e., Hadoop user)
- Kerberos configuration file path (e.g.,
krb5.conf
) - Kerberos keytab file path, which contains keys for the Kerberos principal
- JDBC connection string
Example command:
python tools/create_kerberos_conn.py krb_conn [email protected] /etc/krb5.conf ./krbuser.keytab
This outputs a create connection statement:
CREATE CONNECTION krb_conn
TO ''
USER '[email protected]'
IDENTIFIED BY 'ExaAuthType=Kerberos;enp6Cg==;YWFhCg==';
The regular Kerberos configuration /etc/krb5.conf
file may contain includedir
directives that include additional properties. To accommodate all properties in a single file, please make a copy of /etc/krb5.conf
file and update it accordingly.
Example of /etc/krb5.conf
contents:
includedir /etc/krb5.conf.d/
includedir /var/lib/sss/pubconf/krb5.include.d/
[libdefaults]
dns_lookup_kdc = false
dns_lookup_realm = false
...
Copied and changed krb5.conf
file after adding properties from includedir
directories:
# includedir /etc/krb5.conf.d/
# includedir /var/lib/sss/pubconf/krb5.include.d/
[libdefaults]
dns_lookup_kdc = false
dns_lookup_realm = false
udp_preference_limit = 1
[capaths]
...
[plugins]
...
The includedir
folders contain a file with a setting [libdefaults]
setting udp_preference_limit = 1
, two new categories [capaths]
and [plugins]
that were not in the original krb5.conf
file. We add such settings into the copied krb5.conf
file and use it to create the connection object.
This is required since the Virtual Schema runs in a sandboxed container that does not have an access to the host operating system's files.
Next, we add the JDBC connection URL to the TO
part of the connection object.
Add the JDBC connection URL to the TO
part of the connection string:
CREATE OR REPLACE CONNECTION krb_conn
TO 'jdbc:hive2://<Hive host>:<port>;AuthMech=1;KrbAuthType=1;KrbRealm=EXAMPLE.COM;KrbHostFQDN=_HOST;KrbServiceName=hive'
USER '[email protected]'
IDENTIFIED BY 'ExaAuthType=Kerberos;enp6Cg==;YWFhCg==';
Similar to the Thrift protocol, update the TO
part of the connection string with HTTP enabled URL:
CREATE OR REPLACE CONNECTION krb_conn
TO 'jdbc:hive2://<Hive host>:<port>;AuthMech=1;KrbAuthType=1;KrbRealm=EXAMPLE.COM;KrbHostFQDN=_HOST;KrbServiceName=hive;transportMode=http;httpPath=cliservice'
USER '[email protected]'
IDENTIFIED BY 'ExaAuthType=Kerberos;enp6Cg==;YWFhCg==';
As you can see we are using KrbAuthType=1
. This allows the Hive JDBC driver to check the java.security.auth.login.config
system property for a JAAS configuration. If a JAAS configuration is specified, the driver uses that information to create a LoginContext
and then uses the Subject
associated with it. Exasol Virtual Schema creates a JAAS
configuration at runtime using the information from the IDENTIFIED BY
part of the connection object.
You have to execute the generated CREATE CONNECTION
statement directly in EXASOL to actually create the Kerberos CONNECTION
object. For more detailed information about the script, use the help option:
You can now create a virtual schema using the Kerberos connection created before.
CREATE VIRTUAL SCHEMA <virtual schema name>
USING ADAPTER.JDBC_ADAPTER
WITH
CONNECTION_NAME = 'KRB_CONN'
SCHEMA_NAME = '<schema name>';
You can add LogLevel
and LogPath
parameters to the JDBC URL (the TO
part of connection object) to enable Hive JDBC driver and connection logs.
For example:
TO 'jdbc:hive2://<Hive host>:<port>;AuthMech=1;KrbAuthType=1;KrbRealm=EXAMPLE.COM;KrbHostFQDN=_HOST;KrbServiceName=hive;transportMode=http;httpPath=cliservice;LogLevel=6;LogPath=/tmp/'
This will create log files in a sandbox /tmp/
folder. Please check the exasol_container_sockets
folder for HiveJDBC_driver.log
and HiveJDBC_connection_*.log
files during the Virtual Schema creation or query run. These files will contain exception logs if there is any.
You can set the LogLevel
to number from 0
(Disables all logging) to 6
(Logs all driver activity). For more information please checkout the Cloudera Hive JDBC Manual section on "Configuring Logging".
It is not possible to create a Virtual Schema with Hive version 1.1.0, because the JDBC drivers for it do not support the method we use for retrieving the metadata.
The unsupported driver we have tested with: jdbc-driver hive-jdbc-1.1.0-cdh5.16.2-standalone.jar
Hive is operating on schemaless data. Virtual Schemas — as the name suggests — require a schema. In the case of string variables this creates the situation that the Hive JDBC driver cannot tell how long strings are when asked for schema data. To achieve this it would have to scan all data first, which is not an option.
Instead the Driver simply reports a configurable fixed value as the maximum string size. By default this is 255 characters. In the Virtual Schema this is mapped to VARCHAR(255) columns.
If you need larger strings, you have to override the default setting of the Hive JDBC driver by adding the following parameter to the JDBC connection string when creating the Virtual Schema:
DefaultStringColumnLength=<length>
So an example for a connection string would look like:
jdbc:hive2://localhost:10000;DefaultStringColumnLength=32767;
Please also note that 32KiB are the maximum string size the driver accepts.
See also:
Sometimes the Virtual Schema is not created, but it also does not fail with any exceptions. In such cases you can enable the remote logging and check the logs.
2020-06-12 06:53:12.309 WARNING [c.e.a.j.BaseTableMetadataReader] Table scan did not find any tables. This can mean that either a) the source does not contain tables (yet), b) the table type is not supported or c) the table scan filter criteria is incorrect. Please check that the source actually contains tables. Also check the spelling and exact case of any catalog or schema name you provided.
If you see the warning message as above, please ensure that the user specified in the connection string has correct Hive access privileges in addition to above suggestions. Otherwise, the Virtual Schema may not return any metadata.
This error means you have specified the Thrift protocol, and it cannot establish connection using Thrift. Check if the Hive cluster accepts a Thrift connection. Or if the HTTP protocol enabled, then change the JDBC connection string as specified for HTTP protocol.
An example error message as below:
2020-05-19 13:56:54.573 FINE [c.e.a.RequestDispatcher] Stack trace:
com.exasol.adapter.jdbc.RemoteMetadataReaderException: Unable to create Hive remote metadata reader.
at com.exasol.adapter.dialects.hive.HiveSqlDialect.createRemoteMetadataReader(HiveSqlDialect.java:126)
at com.exasol.adapter.dialects.AbstractSqlDialect.readSchemaMetadata(AbstractSqlDialect.java:138)
at com.exasol.adapter.jdbc.JdbcAdapter.readMetadata(JdbcAdapter.java:56)
at com.exasol.adapter.jdbc.JdbcAdapter.createVirtualSchema(JdbcAdapter.java:33)
at com.exasol.adapter.RequestDispatcher.dispatchCreateVirtualSchemaRequestToAdapter(RequestDispatcher.java:110)
at com.exasol.adapter.RequestDispatcher.processRequest(RequestDispatcher.java:70)
at com.exasol.adapter.RequestDispatcher.executeAdapterCall(RequestDispatcher.java:52)
at com.exasol.adapter.RequestDispatcher.adapterCall(RequestDispatcher.java:41)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at com.exasol.ExaWrapper.runSingleCall(ExaWrapper.java:95)
Caused by: java.sql.SQLException: Could not open client transport with JDBC Uri: <JDBC_URL>: Invalid status 72
at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:333)
at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:107)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:677)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189)
... 12 more
Caused by: org.apache.hive.org.apache.thrift.transport.TTransportException: Invalid status 72
at org.apache.hive.org.apache.thrift.transport.TSaslTransport.sendAndThrowMessage(TSaslTransport.java:232)
at org.apache.hive.org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:184)
at org.apache.hive.org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:307)
at org.apache.hive.org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:420)
at org.apache.hive.jdbc.HiveConnection.<init>(HiveConnection.java:301)
... 19 more
The GSS initiate issue happens when the internal Hive host is unreachable. Please check and ensure that the Hive host specified using KrbHostFQDN
parameter is correct.
An example failure exception as below:
2020-05-25 14:44:02.707 FINE [c.e.a.RequestDispatcher] Stack trace:
com.exasol.adapter.jdbc.RemoteMetadataReaderException: Unable to create Hive remote metadata reader.
at com.exasol.adapter.dialects.hive.HiveSqlDialect.createRemoteMetadataReader(HiveSqlDialect.java:126)
at com.exasol.adapter.dialects.AbstractSqlDialect.readSchemaMetadata(AbstractSqlDialect.java:138)
at com.exasol.adapter.jdbc.JdbcAdapter.readMetadata(JdbcAdapter.java:56)
at com.exasol.adapter.jdbc.JdbcAdapter.createVirtualSchema(JdbcAdapter.java:33)
at com.exasol.adapter.RequestDispatcher.dispatchCreateVirtualSchemaRequestToAdapter(RequestDispatcher.java:110)
at com.exasol.adapter.RequestDispatcher.processRequest(RequestDispatcher.java:70)
at com.exasol.adapter.RequestDispatcher.executeAdapterCall(RequestDispatcher.java:52)
at com.exasol.adapter.RequestDispatcher.adapterCall(RequestDispatcher.java:41)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at com.exasol.ExaWrapper.runSingleCall(ExaWrapper.java:95)
Caused by: java.sql.SQLException: [Cloudera][HiveJDBCDriver](500164) Error initialized or created transport for authentication: [Cloudera][HiveJDBCDriver](500169) Unable to connect to server: GSS initiate failed.
at com.cloudera.hiveserver2.hivecommon.api.HiveServer2ClientFactory.createTransport(Unknown Source)
at com.cloudera.hiveserver2.hivecommon.api.ServiceDiscoveryFactory.createClient(Unknown Source)
at com.cloudera.hiveserver2.hivecommon.core.HiveJDBCCommonConnection.establishConnection(Unknown Source)
at com.cloudera.hiveserver2.jdbc.core.LoginTimeoutConnection.connect(Unknown Source)
at com.cloudera.hiveserver2.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
at com.cloudera.hiveserver2.jdbc.common.AbstractDriver.connect(Unknown Source)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:677)
at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:189)
at com.exasol.adapter.jdbc.RemoteConnectionFactory.establishConnectionWithKerberos(RemoteConnectionFactory.java:74)
at com.exasol.adapter.jdbc.RemoteConnectionFactory.createConnection(RemoteConnectionFactory.java:55)
at com.exasol.adapter.jdbc.RemoteConnectionFactory.getConnection(RemoteConnectionFactory.java:38)
at com.exasol.adapter.dialects.hive.HiveSqlDialect.createRemoteMetadataReader(HiveSqlDialect.java:124)
at com.exasol.adapter.dialects.AbstractSqlDialect.readSchemaMetadata(AbstractSqlDialect.java:138)
at com.exasol.adapter.jdbc.JdbcAdapter.readMetadata(JdbcAdapter.java:56)
at com.exasol.adapter.jdbc.JdbcAdapter.createVirtualSchema(JdbcAdapter.java:33)
at com.exasol.adapter.RequestDispatcher.dispatchCreateVirtualSchemaRequestToAdapter(RequestDispatcher.java:110)
at com.exasol.adapter.RequestDispatcher.processRequest(RequestDispatcher.java:70)
at com.exasol.adapter.RequestDispatcher.executeAdapterCall(RequestDispatcher.java:52)
at com.exasol.adapter.RequestDispatcher.adapterCall(RequestDispatcher.java:41)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
... 23 more
In the Java JDK version 1.8.0_242
, we found out that the Kerberos client alias is used instead of the Kerberos principal. This can cause permission issues since alias does not have access privileges, for example, when connecting to the Hive server.
You can read about similar issues checking these links:
- https://bugs.launchpad.net/ubuntu/+source/openjdk-8/+bug/1861883
- https://bugs.openjdk.java.net/browse/JDK-8239385
One possible solution to this problem is to disable Kerberos principal name aliasing by setting JVM property sun.security.krb5.disableReferrals
to true. This property should be set in both Virtual Schema adapter and ExaLoader.
In Virtual Schema adapter:
--/
CREATE OR REPLACE JAVA ADAPTER SCRIPT ADAPTER.JDBC_ADAPTER AS
%jvmoption -Dsun.security.krb5.disableReferrals=true;
%scriptclass com.exasol.adapter.RequestDispatcher;
%jar /buckets/<BFS service>/<bucket>/jars/virtual-schema-dist-12.0.0-hive-3.0.0.jar;
%jar /buckets/<BFS service>/<bucket>/jars/HiveJDBC42.jar;
/
In ExaLoader:
-etlJdbcJavaEnv=-Dsun.security.krb5.disableReferrals=true
Setting this property in ExaLoader requires database restart and short downtime. It also only affects the Kerberos related Java user defined functions and adapters.
More references about the Kerberos principal aliasing:
- The introduction of Kerberos cross-realm support: https://bugs.openjdk.java.net/browse/JDK-8215032
- Cross realm RFC, in it you can read also how aliasing works, https://datatracker.ietf.org/doc/rfc6806/?include_text=1
- Introduction of the system property to disable the referrals: https://bugs.openjdk.java.net/browse/JDK-8223172
- Similar issues and suggestion: https://stackoverflow.com/questions/60041120/impersonation-issue-after-migrating-from-oracle-jdk-8-to-open-jdk-8-in-cloudera
This table includes data types available for Hive versions < 3.0.0.
Hive Data Type | Supported | Converted Exasol Data Type | Known limitations |
---|---|---|---|
ARRAY | ✓ | VARCHAR | |
BIGINT | ✓ | DECIMAL(19,0) | |
BINARY | ✓ | VARCHAR(2000000) | |
BOOLEAN | ✓ | BOOLEAN | |
CHAR | ✓ | CHAR | |
DECIMAL | ✓ | DECIMAL, VARCHAR | *DECIMAL with precision > 36 is casted to VARCHAR to prevent a loss of precision. |
DATE | ✓ | DATE | |
DOUBLE | ✓ | DOUBLE PRECISION | |
FLOAT | ✓ | DOUBLE PRECISION | |
INT | ✓ | DECIMAL(10,0) | |
INTERVAL | ? | Not tested | |
MAP | ✓ | VARCHAR | |
SMALLINT | ✓ | DECIMAL(5,0) | |
STRING | ✓ | VARCHAR | |
STRUCT | ✓ | VARCHAR | |
TIMESTAMP | ✓ | TIMESTAMP | |
TINYINT | ✓ | DECIMAL(3,0) | |
UNIONTYPE | ? | Not tested | |
VARCHAR | ✓ | VARCHAR |
-
If you want to return a DECIMAL type instead of VARCHAR when the precision > 36, you can set the property HIVE_CAST_NUMBER_TO_DECIMAL_WITH_PRECISION_AND_SCALE:
HIVE_CAST_NUMBER_TO_DECIMAL_WITH_PRECISION_AND_SCALE='36,20'
This will cast DECIMAL with precision > 36, DECIMAL without precision to DECIMAL(36,20). Please keep in mind that this will yield errors if the data in the Hive database does not fit into the specified DECIMAL type.
In the following matrix you find combinations of JDBC driver and dialect version that we tested. The dialect was tested with the Cloudera Hive JDBC driver available on the Cloudera downloads page. The driver is also available directly from Simba technologies, who developed the driver.
Virtual Schema Version | Hive Version | Driver Name | Driver Version |
---|---|---|---|
4.0.3 | 2.3.2 | HiveJDBC | 4.1 |
2.0.5 | 2.3.2 | HiveJDBC42.jar | 2.6.23 |