amazon web services - AWS Athena is not processing any data from glue table if partition projection is enabled

I have a glue table that is fed by partitioned data in s3. The issue at hand is in Athena that if the partition projection is turned off, and I run MSCK REPAIR TABLE <my table>; and SELECT * FROM <my table>; I get everything as expected.

The same query returns only the column names when I enable partition projection and Athena reports to have scanned no data at all (data scanned: -).

My glue table is defined via CloudFormation as follows and is deployed using serverless:

  MyGlueTable:
    Type: AWS::Glue::Table
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: ${self:custom.glueDB}
      TableInput:
        Name: reporting-${self:provider.stage}
        Description: Table for reporting data
        TableType: EXTERNAL_TABLE
        Parameters:
          classification: "parquet"
          projection.enabled: true
          projection.year.type: "integer"
          projection.year.range: 2025, 2030
          projection.month.type: "integer"
          projection.month.range: 1, 12
          projection.day.type: "integer"
          projection.day.range: 1, 31
        StorageDescriptor:
          Location: "s3://my-bucket/reporting/"
          InputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
          OutputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
          SerdeInfo:
            SerializationLibrary: .apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
          Columns:
            - Name: someColumn
              Type: string
        PartitionKeys:
          - Name: year
            Type: int
          - Name: month
            Type: int
          - Name: day
            Type: int

I have already checked the data in my s3 bucket, and the files are both valid and stored as expected e.g. my-bucket/reporting/year=2025/month=04/day=03/data.parquet but Athena is happy to just not look at them if partition projection is enabled. I'm at a loss.

The same query returns only the column names when I enable partition projection and Athena reports to have scanned no data at all (data scanned: -).

My glue table is defined via CloudFormation as follows and is deployed using serverless:

  MyGlueTable:
    Type: AWS::Glue::Table
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: ${self:custom.glueDB}
      TableInput:
        Name: reporting-${self:provider.stage}
        Description: Table for reporting data
        TableType: EXTERNAL_TABLE
        Parameters:
          classification: "parquet"
          projection.enabled: true
          projection.year.type: "integer"
          projection.year.range: 2025, 2030
          projection.month.type: "integer"
          projection.month.range: 1, 12
          projection.day.type: "integer"
          projection.day.range: 1, 31
        StorageDescriptor:
          Location: "s3://my-bucket/reporting/"
          InputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
          OutputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
          SerdeInfo:
            SerializationLibrary: .apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
          Columns:
            - Name: someColumn
              Type: string
        PartitionKeys:
          - Name: year
            Type: int
          - Name: month
            Type: int
          - Name: day
            Type: int

Share Improve this question asked yesterday Raisin 214 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Answering my own question in case someone stumbles on the same issue:

The problem ended up being a configuration error on my side. The partitions in S3 had zeropadding i.e. April = 04, the projected partitions as can be seen by the glue table parameters do not. I didn't think this would be an issue using the type "integer" but apparently it is.

So the projected partitions would be i.e. month 4, day 3; but in s3 it would be month=04 and day=03 which is the reason Athena couldn't access the data.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

amazon web services - AWS Athena is not processing any data from glue table if partition projection is enabled - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)