I would need help debugging a connect time out error while trying to read data from a google cloud storage. In the example I am using a publicly accessible data set so I think I can rule out authentication issues.
I have deployed a jupyterhub instance on KGE. see jupyterhub . This can be deployed easily with
helm upgrade --cleanup-on-fail \
--install jupyterhub-test jupyterhub/jupyterhub \
--namespace $NAMESPACE \
--create-namespace \
--version=4.1.0
with only a basic config file
cull:
enabled: false
hub:
config:
Authenticator:
admin_users:
- test
allowed_users:
- test
# ...
DummyAuthenticator:
password: test
JupyterHub:
authenticator_class: dummy
singleuser:
storage:
dynamic:
storageClass: jupyterhub-user-pd-balanced
memory:
limit: 4G
guarantee: 4G
cpu:
limit: 1
guarantee: 1
extraEnv:
JUPYTERHUB_SINGLEUSER_APP: "jupyter_server.serverapp.ServerApp"
image:
name: jupyter/pyspark-notebook
tag: 5.2.1
cmd: null
debug:
enabled: true
with the following storage class configuration
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: jupyterhub-user-pd-balanced
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-balanced
allowedTopologies:
- matchLabelExpressions:
- key: topology.kubernetes.io/zone
values:
- us-east1-b
To access the files on GCS I need hadoop-connectors. I spawned a new notebook and installed the dependencies there
curl .5_2.12/1.8.0/iceberg-spark-runtime-3.5_2.12-1.8.0.jar -Lo /home/jovyan/.ivy2/jars/iceberg-spark-runtime-3.5_2.12-1.8.0.jar
curl .0.4/gcs-connector-3.0.4-shaded.jar -Lo /home/jovyan/.ivy2/jars/gcs-connector-hadoop3-0.4.jar
Then in the user notebook I tried to access a publicly available dataset with the following configuration.
import pyspark
from pyspark.sql import SparkSession
spark_conf = (pyspark.SparkConf()
.set("spark.jars", "file:///home/jovyan/.ivy2/jars/gcs-connector-hadoop3-0.5.jar,file:///home/jovyan/.ivy2/jars/iceberg-spark-runtime-3.5_2.12-1.8.0.jar")
.set("fs.gs.auth.service.account.enable", "false")
.set("spark.hadoop.fs.gs.auth.null.enable", "true")
.set("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
.set("spark.hadoop.fs.abstractfilesystem.gs.impl","com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
)
spark = (SparkSession
.builder
.appName('test_gcs')
.config(conf=spark_conf)
.getOrCreate()
)
df = spark.read.csv("gs://gcp-public-data-landsat/index.csv.gz")
df.show()
howvever I get a connect timeout error python
Py4JJavaError: An error occurred while calling o36.csv.
: java.io.IOException: Error accessing gs://gcp-public-data-landsat/index.csv.gz
at com.google.cloud.hadoop.repackaged.gcs.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:2181)
at com.google.cloud.hadoop.repackaged.gcs.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:2070)
at com.google.cloud.hadoop.repackaged.gcs.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystemImpl.getFileInfoInternal(GoogleCloudStorageFileSystemImpl.java:1030)
at com.google.cloud.hadoop.repackaged.gcs.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystemImpl.getFileInfoInternal(GoogleCloudStorageFileSystemImpl.java:1001)
at com.google.cloud.hadoop.repackaged.gcs.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystemImpl.getFileInfo(GoogleCloudStorageFileSystemImpl.java:969)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.lambda$getFileStatus$15(GoogleHadoopFileSystem.java:902)
at .apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
at .apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)
at com.google.cloud.hadoop.fs.gcs.GhfsGlobalStorageStatistics.trackDuration(GhfsGlobalStorageStatistics.java:117)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.trackDurationWithTracing(GoogleHadoopFileSystem.java:763)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getFileStatus(GoogleHadoopFileSystem.java:890)
at .apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.exists(GoogleHadoopFileSystem.java:1048)
at .apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:756)
at .apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:754)
at .apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:380)
at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
at scala.util.Success.$anonfun$map$1(Try.scala:255)
at scala.util.Success.map(Try.scala:213)
at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1395)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
Caused by: java.SocketTimeoutException: Connect timed out
at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:551)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:602)
at java.base/java.Socket.connect(Socket.java:633)
at java.base/sun.NetworkClient.doConnect(NetworkClient.java:178)
at java.base/sun.[www.http.HttpClient.openServer(HttpClient.java:533]((httpclient.java:533/))
at java.base/sun.[www.http.HttpClient.openServer(HttpClient.java:638]((httpclient.java:638/))
at java.base/sun.[www.http.HttpClient.<init>(HttpClient.java:281](.HttpClient.<init>(HttpClient.java:281))
at java.base/sun.[www.http.HttpClient.New(HttpClient.java:386]((httpclient.java:386/))
at java.base/sun.[www.http.HttpClient.New(HttpClient.java:408]((httpclient.java:408/))
at java.base/sun.[www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1309]((httpurlconnection.java:1309/))
at java.base/sun.[www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1242]((httpurlconnection.java:1242/))
at java.base/sun.[www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1128]((httpurlconnection.java:1128/))
at java.base/sun.[www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1057]((httpurlconnection.java:1057/))
at com.google.cloud.hadoop.repackaged.gcs.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:151)
at com.google.cloud.hadoop.repackaged.gcs.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
at com.google.cloud.hadoop.repackaged.gcs.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
at com.google.cloud.hadoop.repackaged.gcs.google.auth.oauth2.ComputeEngineCredentials.getMetadataResponse(ComputeEngineCredentials.java:364)
at com.google.cloud.hadoop.repackaged.gcs.google.auth.oauth2.ComputeEngineCredentials.refreshAccessToken(ComputeEngineCredentials.java:271)
at com.google.cloud.hadoop.repackaged.gcs.google.auth.oauth2.OAuth2Credentials$1.call(OAuth2Credentials.java:270)
at com.google.cloud.hadoop.repackaged.gcs.google.auth.oauth2.OAuth2Credentials$1.call(OAuth2Credentials.java:267)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at com.google.cloud.hadoop.repackaged.gcs.google.auth.oauth2.OAuth2Credentials$RefreshTask.run(OAuth2Credentials.java:635)
at com.google.cloud.hadoop.repackaged.gcs.googlemon.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
at com.google.cloud.hadoop.repackaged.gcs.google.auth.oauth2.OAuth2Credentials$AsyncRefreshResult.executeIfNew(OAuth2Credentials.java:582)
at com.google.cloud.hadoop.repackaged.gcs.google.auth.oauth2.OAuth2Credentials.asyncFetch(OAuth2Credentials.java:233)
at com.google.cloud.hadoop.repackaged.gcs.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:183)
at com.google.cloud.hadoop.repackaged.gcs.google.auth.http.HttpCredentialsAdapter.initialize(HttpCredentialsAdapter.java:96)
at com.google.cloud.hadoop.repackaged.gcs.google.cloud.hadoop.util.RetryHttpInitializer.initialize(RetryHttpInitializer.java:80)
at com.google.cloud.hadoop.repackaged.gcs.google.cloud.hadoop.util.ChainingHttpRequestInitializer.initialize(ChainingHttpRequestInitializer.java:52)
at com.google.cloud.hadoop.repackaged.gcs.google.api.client.http.HttpRequestFactory.buildRequest(HttpRequestFactory.java:91)
at com.google.cloud.hadoop.repackaged.gcs.google.api.client.googleapis.services.AbstractGoogleClientRequest.buildHttpRequest(AbstractGoogleClientRequest.java:415)
at com.google.cloud.hadoop.repackaged.gcs.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:525)
at com.google.cloud.hadoop.repackaged.gcs.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:466)
at com.google.cloud.hadoop.repackaged.gcs.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:576)
at com.google.cloud.hadoop.repackaged.gcs.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:2174)
... 28 more