I’m building a data lakehouse using Docker, and I’m attempting to create a Hive table via Spark SQL. I initialize my Spark session with:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import os
from pyspark.sql.functions import *
from delta import configure_spark_with_delta_pip
from delta.tables import DeltaTable
spark = SparkSession.builder \
.appName("appname") \
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "true") \
.config("spark.hadoop.fs.s3a.impl", ".apache.hadoop.fs.s3a.S3AFileSystem")\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", ".apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.hadoop.fs.s3a.endpoint", "http://minio-host:9000") \
.config("spark.hadoop.fs.s3a.access.key", "minio") \
.config("spark.hadoop.fs.s3a.secret.key", "minio123") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")\
.config("spark.sql.legacy.charVarcharAsString", True)\
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
.config("hive.metastore.uris", "thrift://metastore:9083") \
.config("spark.sql.hive.metastore.version","3.1.3")\
.config("spark.sql.hive.metastore.jars","maven") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict")\
.enableHiveSupport() \
.getOrCreate()
and even configure the metastore connection (using a custom hive-site.xml): For Spark:
<property>
<name>hive.metastore.uris</name>
<value>thrift://metastore:9083</value>
<description></description>
</property>
However, during the table creation, Spark appears to generate its own Thrift endpoint
15:34:35.380 [Thread-4] ERROR .apache.hadoop.hive.metastore.utils.MetaStoreUtils - Got exception: java.URISyntaxException Illegal character in hostname at index 23: thrift://metastore.data_lakehouse2_lakehouse:9083
which conflicts with my Hive container’s connection. "thrift://metastore.data_lakehouse2_lakehouse:9083`" "data_lakehouse2" is the folder name/compose stack and "lakehouse" is the network name. i don't know how this thrift is generated.
if needed this is my sql query:
spark.sql("DROP SCHEMA IF EXISTS test CASCADE")
spark.sql("CREATE DATABASE IF NOT EXISTS test")
spark.sql("USE test")
can someone tell me the cause of this?
edit: when updating with .config("spark.sql.hive.metastore.jars","maven")
maybe some jars are generating the thrift port? if i use the default "2.3.9" version with builtin jars it works, but got other issues unrelated to this topic. thats why i want to resolve this.
I’m building a data lakehouse using Docker, and I’m attempting to create a Hive table via Spark SQL. I initialize my Spark session with:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import os
from pyspark.sql.functions import *
from delta import configure_spark_with_delta_pip
from delta.tables import DeltaTable
spark = SparkSession.builder \
.appName("appname") \
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "true") \
.config("spark.hadoop.fs.s3a.impl", ".apache.hadoop.fs.s3a.S3AFileSystem")\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", ".apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.hadoop.fs.s3a.endpoint", "http://minio-host:9000") \
.config("spark.hadoop.fs.s3a.access.key", "minio") \
.config("spark.hadoop.fs.s3a.secret.key", "minio123") \
.config("spark.hadoop.fs.s3a.path.style.access", "true") \
.config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")\
.config("spark.sql.legacy.charVarcharAsString", True)\
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
.config("hive.metastore.uris", "thrift://metastore:9083") \
.config("spark.sql.hive.metastore.version","3.1.3")\
.config("spark.sql.hive.metastore.jars","maven") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict")\
.enableHiveSupport() \
.getOrCreate()
and even configure the metastore connection (using a custom hive-site.xml): For Spark:
<property>
<name>hive.metastore.uris</name>
<value>thrift://metastore:9083</value>
<description></description>
</property>
However, during the table creation, Spark appears to generate its own Thrift endpoint
15:34:35.380 [Thread-4] ERROR .apache.hadoop.hive.metastore.utils.MetaStoreUtils - Got exception: java.URISyntaxException Illegal character in hostname at index 23: thrift://metastore.data_lakehouse2_lakehouse:9083
which conflicts with my Hive container’s connection. "thrift://metastore.data_lakehouse2_lakehouse:9083`" "data_lakehouse2" is the folder name/compose stack and "lakehouse" is the network name. i don't know how this thrift is generated.
if needed this is my sql query:
spark.sql("DROP SCHEMA IF EXISTS test CASCADE")
spark.sql("CREATE DATABASE IF NOT EXISTS test")
spark.sql("USE test")
can someone tell me the cause of this?
edit: when updating with .config("spark.sql.hive.metastore.jars","maven")
maybe some jars are generating the thrift port? if i use the default "2.3.9" version with builtin jars it works, but got other issues unrelated to this topic. thats why i want to resolve this.
1 Answer
Reset to default 1fixed this issue by changing hive.metastore.uris
value into "thrift://host.docker.internal:9083" in the hive-site.xml in spark.