Dataproc PySpark Job Fails with BigQuery Connector Issue - java.util.ServiceConfigurationError

I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error:

java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated

Caused by: java.lang.IllegalStateException: This connector was made for Scala null, it was not meant to run on Scala 2.12

What I Have Tried So Far:

1️⃣ Verified Dataproc Version and Spark Version

I checked the Dataproc image version and Scala/Spark version using:

gcloud dataproc clusters describe cluster-be90 --region us-central1 | grep imageVersion

It returned:

imageVersion: 2.2.43-debian12

And the cluster uses:

Spark 3.5.1 / Scala 2.12.18

2️⃣ Used Correct BigQuery Connector JAR

Initially, I used:

gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12.jar

This JAR was missing, so I tried:

gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.41.1.jar

Still getting the same error.

3️⃣ Manually Uploaded the JAR to Cloud Storage

I tried downloading and uploading the JAR manually:

wget .12-0.41.1.jar
gsutil cp spark-bigquery-with-dependencies_2.12-0.41.1.jar gs://my-bucket/libs/

Then ran the job using:

gcloud dataproc jobs submit pyspark gs://my-bucket/dataproc_bigquery.py \
    --cluster=cluster-be90 \
    --region=us-central1 \
    --jars=gs://my-bucket/libs/spark-bigquery-with-dependencies_2.12-0.41.1.jar

Still no luck.

4️⃣ Verified IAM Permissions for Dataproc Service Account

I granted sufficient roles to my service account,

Yet, the error persists.

What Else Can I Try?

I've tried everything I could think of. Could it be an issue with Scala 2.12 or Dataproc image compatibility? Or is there a different way to integrate BigQuery with Dataproc?

Any help would be greatly appreciated!

My log :

25/01/29 16:27:46 INFO SparkEnv: Registering MapOutputTracker
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMaster
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/01/29 16:27:46 INFO SparkEnv: Registering OutputCommitCoordinator
25/01/29 16:27:46 INFO MetricsConfig: Loaded properties from hadoop-metrics2.properties
25/01/29 16:27:46 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
25/01/29 16:27:46 INFO MetricsSystemImpl: google-hadoop-file-system metrics system started
25/01/29 16:27:47 INFO DataprocSparkPlugin: Registered 188 driver metrics
25/01/29 16:27:47 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8032
25/01/29 16:27:47 INFO AHSProxy: Connecting to Application History server at my-cluster.local./10.128.0.2:10200
25/01/29 16:27:48 INFO Configuration: resource-types.xml not found
25/01/29 16:27:48 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/01/29 16:27:49 INFO YarnClientImpl: Submitted application application_1234567890123_0009
25/01/29 16:27:50 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8030
25/01/29 16:27:51 INFO GhfsGlobalStorageStatistics: periodic connector metrics: {gcs_api_client_non_found_response_count=1, gcs_api_client_side_error_count=1, gcs_api_time=316, gcs_api_total_request_count=2, gcs_connector_time=398, gcs_list_file_request=1, gcs_list_file_request_duration=158, gcs_list_file_request_max=158, gcs_list_file_request_mean=158, gcs_list_file_request_min=158, gcs_metadata_request=1, gcs_metadata_request_duration=158, gcs_metadata_request_max=158, gcs_metadata_request_mean=158, gcs_metadata_request_min=158, gs_filesystem_create=3, gs_filesystem_initialize=2, op_get_file_status=1, op_get_file_status_duration=398, op_get_file_status_max=398, op_get_file_status_mean=398, op_get_file_status_min=398, uptimeSeconds=6} [CONTEXT ratelimit_period="5 MINUTES" ]
25/01/29 16:27:51 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/01/29 16:27:52 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://my-temp-bucket/dataproc-job-history/application_1234567890123_0009.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]

Traceback (most recent call last):
  File "/tmp/1b7be59758994608b4125ab846fb6826/submitting_job.py", line 14, in <module>
    .load()
     ^^^^^^
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 314, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
    at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:582)
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:804)
    at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:722)
    ...
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:780)
    ... 31 more

25/01/29 16:27:54 INFO DataprocSparkPlugin: Shutting down driver plugin.

I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error:

java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated

Caused by: java.lang.IllegalStateException: This connector was made for Scala null, it was not meant to run on Scala 2.12

What I Have Tried So Far:

1️⃣ Verified Dataproc Version and Spark Version

I checked the Dataproc image version and Scala/Spark version using:

gcloud dataproc clusters describe cluster-be90 --region us-central1 | grep imageVersion

It returned:

imageVersion: 2.2.43-debian12

And the cluster uses:

Spark 3.5.1 / Scala 2.12.18

2️⃣ Used Correct BigQuery Connector JAR

Initially, I used:

gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12.jar

This JAR was missing, so I tried:

gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.41.1.jar

Still getting the same error.

3️⃣ Manually Uploaded the JAR to Cloud Storage

I tried downloading and uploading the JAR manually:

wget https://storage.googleapis/spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.41.1.jar
gsutil cp spark-bigquery-with-dependencies_2.12-0.41.1.jar gs://my-bucket/libs/

Then ran the job using:

gcloud dataproc jobs submit pyspark gs://my-bucket/dataproc_bigquery.py \
    --cluster=cluster-be90 \
    --region=us-central1 \
    --jars=gs://my-bucket/libs/spark-bigquery-with-dependencies_2.12-0.41.1.jar

Still no luck.

4️⃣ Verified IAM Permissions for Dataproc Service Account

I granted sufficient roles to my service account,

Yet, the error persists.

What Else Can I Try?

I've tried everything I could think of. Could it be an issue with Scala 2.12 or Dataproc image compatibility? Or is there a different way to integrate BigQuery with Dataproc?

Any help would be greatly appreciated!

My log :

25/01/29 16:27:46 INFO SparkEnv: Registering MapOutputTracker
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMaster
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/01/29 16:27:46 INFO SparkEnv: Registering OutputCommitCoordinator
25/01/29 16:27:46 INFO MetricsConfig: Loaded properties from hadoop-metrics2.properties
25/01/29 16:27:46 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
25/01/29 16:27:46 INFO MetricsSystemImpl: google-hadoop-file-system metrics system started
25/01/29 16:27:47 INFO DataprocSparkPlugin: Registered 188 driver metrics
25/01/29 16:27:47 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8032
25/01/29 16:27:47 INFO AHSProxy: Connecting to Application History server at my-cluster.local./10.128.0.2:10200
25/01/29 16:27:48 INFO Configuration: resource-types.xml not found
25/01/29 16:27:48 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/01/29 16:27:49 INFO YarnClientImpl: Submitted application application_1234567890123_0009
25/01/29 16:27:50 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8030
25/01/29 16:27:51 INFO GhfsGlobalStorageStatistics: periodic connector metrics: {gcs_api_client_non_found_response_count=1, gcs_api_client_side_error_count=1, gcs_api_time=316, gcs_api_total_request_count=2, gcs_connector_time=398, gcs_list_file_request=1, gcs_list_file_request_duration=158, gcs_list_file_request_max=158, gcs_list_file_request_mean=158, gcs_list_file_request_min=158, gcs_metadata_request=1, gcs_metadata_request_duration=158, gcs_metadata_request_max=158, gcs_metadata_request_mean=158, gcs_metadata_request_min=158, gs_filesystem_create=3, gs_filesystem_initialize=2, op_get_file_status=1, op_get_file_status_duration=398, op_get_file_status_max=398, op_get_file_status_mean=398, op_get_file_status_min=398, uptimeSeconds=6} [CONTEXT ratelimit_period="5 MINUTES" ]
25/01/29 16:27:51 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/01/29 16:27:52 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://my-temp-bucket/dataproc-job-history/application_1234567890123_0009.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]

Traceback (most recent call last):
  File "/tmp/1b7be59758994608b4125ab846fb6826/submitting_job.py", line 14, in <module>
    .load()
     ^^^^^^
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 314, in load
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
    at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:582)
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:804)
    at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:722)
    ...
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:780)
    ... 31 more

25/01/29 16:27:54 INFO DataprocSparkPlugin: Shutting down driver plugin.

Share Improve this question edited Jan 29 at 17:06 Doug Stevenson 319k36 gold badges456 silver badges473 bronze badges Recognized by Google Cloud Collective asked Jan 29 at 16:46 Shima K 1562 gold badges3 silver badges16 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Quick Fix: Got the answer from member of tge google proc anization:

Submit job (no --jars)

gcloud dataproc jobs submit pyspark gs://your-bucket/script.py \
    --cluster=your-cluster --region=your-region

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Dataproc PySpark Job Fails with BigQuery Connector Issue - java.util.ServiceConfigurationError - Stack Overflow

What I Have Tried So Far:

1️⃣ Verified Dataproc Version and Spark Version

2️⃣ Used Correct BigQuery Connector JAR

3️⃣ Manually Uploaded the JAR to Cloud Storage

4️⃣ Verified IAM Permissions for Dataproc Service Account

What Else Can I Try?

What I Have Tried So Far:

1️⃣ Verified Dataproc Version and Spark Version

2️⃣ Used Correct BigQuery Connector JAR

3️⃣ Manually Uploaded the JAR to Cloud Storage

4️⃣ Verified IAM Permissions for Dataproc Service Account

What Else Can I Try?

1 Answer 1

与本文相关的文章

评论列表(0)