最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Dataproc PySpark Job Fails with BigQuery Connector Issue - java.util.ServiceConfigurationError - Stack Overflow

programmeradmin0浏览0评论

I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error:

java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated

Caused by: java.lang.IllegalStateException: This connector was made for Scala null, it was not meant to run on Scala 2.12

What I Have Tried So Far:

1️⃣ Verified Dataproc Version and Spark Version

I checked the Dataproc image version and Scala/Spark version using:

gcloud dataproc clusters describe cluster-be90 --region us-central1 | grep imageVersion

It returned:

imageVersion: 2.2.43-debian12

And the cluster uses:

  • Spark 3.5.1 / Scala 2.12.18

2️⃣ Used Correct BigQuery Connector JAR

Initially, I used:

gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12.jar

This JAR was missing, so I tried:

gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.41.1.jar

Still getting the same error.

3️⃣ Manually Uploaded the JAR to Cloud Storage

I tried downloading and uploading the JAR manually:

wget .12-0.41.1.jar
gsutil cp spark-bigquery-with-dependencies_2.12-0.41.1.jar gs://my-bucket/libs/

Then ran the job using:

gcloud dataproc jobs submit pyspark gs://my-bucket/dataproc_bigquery.py \
    --cluster=cluster-be90 \
    --region=us-central1 \
    --jars=gs://my-bucket/libs/spark-bigquery-with-dependencies_2.12-0.41.1.jar

Still no luck.

4️⃣ Verified IAM Permissions for Dataproc Service Account

I granted sufficient roles to my service account,

Yet, the error persists.

What Else Can I Try?

I've tried everything I could think of. Could it be an issue with Scala 2.12 or Dataproc image compatibility? Or is there a different way to integrate BigQuery with Dataproc?

Any help would be greatly appreciated!

My log :

25/01/29 16:27:46 INFO SparkEnv: Registering MapOutputTracker
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMaster
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/01/29 16:27:46 INFO SparkEnv: Registering OutputCommitCoordinator
25/01/29 16:27:46 INFO MetricsConfig: Loaded properties from hadoop-metrics2.properties
25/01/29 16:27:46 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
25/01/29 16:27:46 INFO MetricsSystemImpl: google-hadoop-file-system metrics system started
25/01/29 16:27:47 INFO DataprocSparkPlugin: Registered 188 driver metrics
25/01/29 16:27:47 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8032
25/01/29 16:27:47 INFO AHSProxy: Connecting to Application History server at my-cluster.local./10.128.0.2:10200
25/01/29 16:27:48 INFO Configuration: resource-types.xml not found
25/01/29 16:27:48 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/01/29 16:27:49 INFO YarnClientImpl: Submitted application application_1234567890123_0009
25/01/29 16:27:50 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8030
25/01/29 16:27:51 INFO GhfsGlobalStorageStatistics: periodic connector metrics: {gcs_api_client_non_found_response_count=1, gcs_api_client_side_error_count=1, gcs_api_time=316, gcs_api_total_request_count=2, gcs_connector_time=398, gcs_list_file_request=1, gcs_list_file_request_duration=158, gcs_list_file_request_max=158, gcs_list_file_request_mean=158, gcs_list_file_request_min=158, gcs_metadata_request=1, gcs_metadata_request_duration=158, gcs_metadata_request_max=158, gcs_metadata_request_mean=158, gcs_metadata_request_min=158, gs_filesystem_create=3, gs_filesystem_initialize=2, op_get_file_status=1, op_get_file_status_duration=398, op_get_file_status_max=398, op_get_file_status_mean=398, op_get_file_status_min=398, uptimeSeconds=6} [CONTEXT ratelimit_period="5 MINUTES" ]
25/01/29 16:27:51 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/01/29 16:27:52 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://my-temp-bucket/dataproc-job-history/application_1234567890123_0009.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]

Traceback (most recent call last):
  File "/tmp/1b7be59758994608b4125ab846fb6826/submitting_job.py", line 14, in <module>
    .load()
     ^^^^^^
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 314, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
    at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:582)
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:804)
    at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:722)
    ...
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:780)
    ... 31 more

25/01/29 16:27:54 INFO DataprocSparkPlugin: Shutting down driver plugin.

I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error:

java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated

Caused by: java.lang.IllegalStateException: This connector was made for Scala null, it was not meant to run on Scala 2.12

What I Have Tried So Far:

1️⃣ Verified Dataproc Version and Spark Version

I checked the Dataproc image version and Scala/Spark version using:

gcloud dataproc clusters describe cluster-be90 --region us-central1 | grep imageVersion

It returned:

imageVersion: 2.2.43-debian12

And the cluster uses:

  • Spark 3.5.1 / Scala 2.12.18

2️⃣ Used Correct BigQuery Connector JAR

Initially, I used:

gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12.jar

This JAR was missing, so I tried:

gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.41.1.jar

Still getting the same error.

3️⃣ Manually Uploaded the JAR to Cloud Storage

I tried downloading and uploading the JAR manually:

wget https://storage.googleapis/spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.41.1.jar
gsutil cp spark-bigquery-with-dependencies_2.12-0.41.1.jar gs://my-bucket/libs/

Then ran the job using:

gcloud dataproc jobs submit pyspark gs://my-bucket/dataproc_bigquery.py \
    --cluster=cluster-be90 \
    --region=us-central1 \
    --jars=gs://my-bucket/libs/spark-bigquery-with-dependencies_2.12-0.41.1.jar

Still no luck.

4️⃣ Verified IAM Permissions for Dataproc Service Account

I granted sufficient roles to my service account,

Yet, the error persists.

What Else Can I Try?

I've tried everything I could think of. Could it be an issue with Scala 2.12 or Dataproc image compatibility? Or is there a different way to integrate BigQuery with Dataproc?

Any help would be greatly appreciated!

My log :

25/01/29 16:27:46 INFO SparkEnv: Registering MapOutputTracker
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMaster
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/01/29 16:27:46 INFO SparkEnv: Registering OutputCommitCoordinator
25/01/29 16:27:46 INFO MetricsConfig: Loaded properties from hadoop-metrics2.properties
25/01/29 16:27:46 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
25/01/29 16:27:46 INFO MetricsSystemImpl: google-hadoop-file-system metrics system started
25/01/29 16:27:47 INFO DataprocSparkPlugin: Registered 188 driver metrics
25/01/29 16:27:47 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8032
25/01/29 16:27:47 INFO AHSProxy: Connecting to Application History server at my-cluster.local./10.128.0.2:10200
25/01/29 16:27:48 INFO Configuration: resource-types.xml not found
25/01/29 16:27:48 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/01/29 16:27:49 INFO YarnClientImpl: Submitted application application_1234567890123_0009
25/01/29 16:27:50 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8030
25/01/29 16:27:51 INFO GhfsGlobalStorageStatistics: periodic connector metrics: {gcs_api_client_non_found_response_count=1, gcs_api_client_side_error_count=1, gcs_api_time=316, gcs_api_total_request_count=2, gcs_connector_time=398, gcs_list_file_request=1, gcs_list_file_request_duration=158, gcs_list_file_request_max=158, gcs_list_file_request_mean=158, gcs_list_file_request_min=158, gcs_metadata_request=1, gcs_metadata_request_duration=158, gcs_metadata_request_max=158, gcs_metadata_request_mean=158, gcs_metadata_request_min=158, gs_filesystem_create=3, gs_filesystem_initialize=2, op_get_file_status=1, op_get_file_status_duration=398, op_get_file_status_max=398, op_get_file_status_mean=398, op_get_file_status_min=398, uptimeSeconds=6} [CONTEXT ratelimit_period="5 MINUTES" ]
25/01/29 16:27:51 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/01/29 16:27:52 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://my-temp-bucket/dataproc-job-history/application_1234567890123_0009.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]

Traceback (most recent call last):
  File "/tmp/1b7be59758994608b4125ab846fb6826/submitting_job.py", line 14, in <module>
    .load()
     ^^^^^^
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 314, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
    at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:582)
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:804)
    at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:722)
    ...
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:780)
    ... 31 more

25/01/29 16:27:54 INFO DataprocSparkPlugin: Shutting down driver plugin.
Share Improve this question edited Jan 29 at 17:06 Doug Stevenson 319k36 gold badges456 silver badges473 bronze badges Recognized by Google Cloud Collective asked Jan 29 at 16:46 Shima KShima K 1562 gold badges3 silver badges16 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

Quick Fix: Got the answer from member of tge google proc anization:

Submit job (no --jars)

gcloud dataproc jobs submit pyspark gs://your-bucket/script.py \
    --cluster=your-cluster --region=your-region
发布评论

评论列表(0)

  1. 暂无评论