amazon s3 - Pyspark "hadoop.fs.s3a.bucket" config with multiple paths

I want my local AWS_PROFILE to write and a different account to read. I therefore use the "hadoop.fs.s3a.bucket" so that locally I will be using DefaultAWSCredentialsProviderChain and I will be using access/secret key for another account.

This works:

spark_session = (SparkSession
                 .builder
                 .appName("my_app")
                 .config("spark.hadoop.fs.s3a.aws.credentials.provider",
                         "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
                 .config(f"spark.hadoop.fs.s3a.bucket.MyBigBucket.aws.credentials.provider",
                         ".apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
                 .config("spark.hadoop.fs.s3a.impl", ".apache.hadoop.fs.s3a.S3AFileSystem")
                 .config("spark.hadoop.fs.s3.impl", ".apache.hadoop.fs.s3a.S3AFileSystem")
                 .config("spark.hadoop.fs.s3a.access.key", accessKeyId)
                 .config("spark.hadoop.fs.s3a.secret.key", secretAccessKey)
                 .config("spark.hadoop.fs.s3a.session.token", sessionToken)
                 .config("spark.hadoop.fs.s3a.path.style.access", True)
                 .getOrCreate())
main_df = reader.option("basePath", f"s3://MyBigBucket/").load("s3://MyBigBucket/year=2025/month=3/day=3/{hour=1, hour=2}")

main_df.write.parquet(s3://MyLocalBucket/something/)

But I would rather use a list of string in the load, like so:

main_df = reader.option("basePath", f"s3://MyBigBucket/").load(["s3://MyBigBucket/year=2025/month=3/day=3/hour=1", "s3://MyBigBucket/year=2025/month=3/day=3/hour=2"])

But this fails, for some reason:

raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o44.load.
: java.nio.file.AccessDeniedException: s3://MyBigBucket/year=2025/month=3/day=3/hour=1: .apache.hadoop.fs.s3a.CredentialInitializationException: Provider TemporaryAWSCredentialsProvider has no credentials

I dont understand why. A list would be easier, especially if I try to use path with a change in day:

["s3://MyBigBucket/year=2025/month=3/day=3/hour=23", "s3://MyBigBucket/year=2025/month=3/day=4/hour=0"]

I actually do not know how to do it in the "/{sub1, sub2}/" way.

So, I'd love to know if it is possible to use a list for the load with hadoop.fs.s3a.bucket config or if there is a way to create a string that could between a day (and a month, and a year!!).

Edit: I eventually got load in a loop. Not efficient but at least I get my list of paths working.

    log.info(f"The input paths are '{input_paths}'.")
    reader = reader.option("basePath", f"s3a://MyBigBucket/")
    main_df = reader.parquet(input_paths[0])
    for path in input_paths[1:]:
        log.info(f"Loading data from '{path}'.")
        df = reader.parquet(path)
        main_df = main_df.unionByName(df)
        df.unpersist(True)

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

amazon s3 - Pyspark "hadoop.fs.s3a.bucket" config with multiple paths - Stack Overflow

与本文相关的文章

评论列表(0)