I want my local AWS_PROFILE to write and a different account to read. I therefore use the "hadoop.fs.s3a.bucket" so that locally I will be using DefaultAWSCredentialsProviderChain and I will be using access/secret key for another account.
This works:
spark_session = (SparkSession
.builder
.appName("my_app")
.config("spark.hadoop.fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
.config(f"spark.hadoop.fs.s3a.bucket.MyBigBucket.aws.credentials.provider",
".apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
.config("spark.hadoop.fs.s3a.impl", ".apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3.impl", ".apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3a.access.key", accessKeyId)
.config("spark.hadoop.fs.s3a.secret.key", secretAccessKey)
.config("spark.hadoop.fs.s3a.session.token", sessionToken)
.config("spark.hadoop.fs.s3a.path.style.access", True)
.getOrCreate())
main_df = reader.option("basePath", f"s3://MyBigBucket/").load("s3://MyBigBucket/year=2025/month=3/day=3/{hour=1, hour=2}")
main_df.write.parquet(s3://MyLocalBucket/something/)
But I would rather use a list of string in the load, like so:
main_df = reader.option("basePath", f"s3://MyBigBucket/").load(["s3://MyBigBucket/year=2025/month=3/day=3/hour=1", "s3://MyBigBucket/year=2025/month=3/day=3/hour=2"])
But this fails, for some reason:
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
: java.nio.file.AccessDeniedException: s3://MyBigBucket/year=2025/month=3/day=3/hour=1: .apache.hadoop.fs.s3a.CredentialInitializationException: Provider TemporaryAWSCredentialsProvider has no credentials
I dont understand why. A list would be easier, especially if I try to use path with a change in day:
["s3://MyBigBucket/year=2025/month=3/day=3/hour=23", "s3://MyBigBucket/year=2025/month=3/day=4/hour=0"]
I actually do not know how to do it in the "/{sub1, sub2}/" way.
So, I'd love to know if it is possible to use a list for the load with hadoop.fs.s3a.bucket config or if there is a way to create a string that could between a day (and a month, and a year!!).
Edit: I eventually got load in a loop. Not efficient but at least I get my list of paths working.
log.info(f"The input paths are '{input_paths}'.")
reader = reader.option("basePath", f"s3a://MyBigBucket/")
main_df = reader.parquet(input_paths[0])
for path in input_paths[1:]:
log.info(f"Loading data from '{path}'.")
df = reader.parquet(path)
main_df = main_df.unionByName(df)
df.unpersist(True)