最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pyspark - Spark SQL creates Hive table but generates conflicting Thrift port preventing connection to Hive container - Stack Ove

programmeradmin5浏览0评论

I’m building a data lakehouse using Docker, and I’m attempting to create a Hive table via Spark SQL. I initialize my Spark session with:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import os
from pyspark.sql.functions import *
from delta import configure_spark_with_delta_pip
from delta.tables import DeltaTable

spark = SparkSession.builder \
    .appName("appname") \
    .config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "true") \
    .config("spark.hadoop.fs.s3a.impl", ".apache.hadoop.fs.s3a.S3AFileSystem")\
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", ".apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio-host:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minio") \
    .config("spark.hadoop.fs.s3a.secret.key", "minio123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")\
    .config("spark.sql.legacy.charVarcharAsString", True)\
    .config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
    .config("hive.metastore.uris", "thrift://metastore:9083") \
    .config("spark.sql.hive.metastore.version","3.1.3")\
    .config("spark.sql.hive.metastore.jars","maven") \
    .config("hive.exec.dynamic.partition", "true") \
    .config("hive.exec.dynamic.partition.mode", "nonstrict")\
    .enableHiveSupport() \
    .getOrCreate()

and even configure the metastore connection (using a custom hive-site.xml): For Spark:

<property>
      <name>hive.metastore.uris</name>
      <value>thrift://metastore:9083</value>
      <description></description>
  </property>

However, during the table creation, Spark appears to generate its own Thrift endpoint

15:34:35.380 [Thread-4] ERROR .apache.hadoop.hive.metastore.utils.MetaStoreUtils - Got exception: java.URISyntaxException Illegal character in hostname at index 23: thrift://metastore.data_lakehouse2_lakehouse:9083

which conflicts with my Hive container’s connection. "thrift://metastore.data_lakehouse2_lakehouse:9083`" "data_lakehouse2" is the folder name/compose stack and "lakehouse" is the network name. i don't know how this thrift is generated.

if needed this is my sql query:

spark.sql("DROP SCHEMA IF EXISTS test CASCADE")
spark.sql("CREATE DATABASE IF NOT EXISTS test")
spark.sql("USE test")

can someone tell me the cause of this?

edit: when updating with .config("spark.sql.hive.metastore.jars","maven") maybe some jars are generating the thrift port? if i use the default "2.3.9" version with builtin jars it works, but got other issues unrelated to this topic. thats why i want to resolve this.

I’m building a data lakehouse using Docker, and I’m attempting to create a Hive table via Spark SQL. I initialize my Spark session with:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import os
from pyspark.sql.functions import *
from delta import configure_spark_with_delta_pip
from delta.tables import DeltaTable

spark = SparkSession.builder \
    .appName("appname") \
    .config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "true") \
    .config("spark.hadoop.fs.s3a.impl", ".apache.hadoop.fs.s3a.S3AFileSystem")\
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", ".apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio-host:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minio") \
    .config("spark.hadoop.fs.s3a.secret.key", "minio123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")\
    .config("spark.sql.legacy.charVarcharAsString", True)\
    .config("spark.sql.sources.partitionOverwriteMode", "dynamic")\
    .config("hive.metastore.uris", "thrift://metastore:9083") \
    .config("spark.sql.hive.metastore.version","3.1.3")\
    .config("spark.sql.hive.metastore.jars","maven") \
    .config("hive.exec.dynamic.partition", "true") \
    .config("hive.exec.dynamic.partition.mode", "nonstrict")\
    .enableHiveSupport() \
    .getOrCreate()

and even configure the metastore connection (using a custom hive-site.xml): For Spark:

<property>
      <name>hive.metastore.uris</name>
      <value>thrift://metastore:9083</value>
      <description></description>
  </property>

However, during the table creation, Spark appears to generate its own Thrift endpoint

15:34:35.380 [Thread-4] ERROR .apache.hadoop.hive.metastore.utils.MetaStoreUtils - Got exception: java.URISyntaxException Illegal character in hostname at index 23: thrift://metastore.data_lakehouse2_lakehouse:9083

which conflicts with my Hive container’s connection. "thrift://metastore.data_lakehouse2_lakehouse:9083`" "data_lakehouse2" is the folder name/compose stack and "lakehouse" is the network name. i don't know how this thrift is generated.

if needed this is my sql query:

spark.sql("DROP SCHEMA IF EXISTS test CASCADE")
spark.sql("CREATE DATABASE IF NOT EXISTS test")
spark.sql("USE test")

can someone tell me the cause of this?

edit: when updating with .config("spark.sql.hive.metastore.jars","maven") maybe some jars are generating the thrift port? if i use the default "2.3.9" version with builtin jars it works, but got other issues unrelated to this topic. thats why i want to resolve this.

Share Improve this question edited yesterday Gianny Wise asked 2 days ago Gianny WiseGianny Wise 133 bronze badges 0
Add a comment  | 

1 Answer 1

Reset to default 1

fixed this issue by changing hive.metastore.uris value into "thrift://host.docker.internal:9083" in the hive-site.xml in spark.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论