最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

amazon web services - AWS Athena is not processing any data from glue table if partition projection is enabled - Stack Overflow

programmeradmin3浏览0评论

I have a glue table that is fed by partitioned data in s3. The issue at hand is in Athena that if the partition projection is turned off, and I run MSCK REPAIR TABLE <my table>; and SELECT * FROM <my table>; I get everything as expected.

The same query returns only the column names when I enable partition projection and Athena reports to have scanned no data at all (data scanned: -).

My glue table is defined via CloudFormation as follows and is deployed using serverless:

  MyGlueTable:
    Type: AWS::Glue::Table
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: ${self:custom.glueDB}
      TableInput:
        Name: reporting-${self:provider.stage}
        Description: Table for reporting data
        TableType: EXTERNAL_TABLE
        Parameters:
          classification: "parquet"
          projection.enabled: true
          projection.year.type: "integer"
          projection.year.range: 2025, 2030
          projection.month.type: "integer"
          projection.month.range: 1, 12
          projection.day.type: "integer"
          projection.day.range: 1, 31
        StorageDescriptor:
          Location: "s3://my-bucket/reporting/"
          InputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
          OutputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
          SerdeInfo:
            SerializationLibrary: .apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
          Columns:
            - Name: someColumn
              Type: string
        PartitionKeys:
          - Name: year
            Type: int
          - Name: month
            Type: int
          - Name: day
            Type: int

I have already checked the data in my s3 bucket, and the files are both valid and stored as expected e.g. my-bucket/reporting/year=2025/month=04/day=03/data.parquet but Athena is happy to just not look at them if partition projection is enabled. I'm at a loss.

I have a glue table that is fed by partitioned data in s3. The issue at hand is in Athena that if the partition projection is turned off, and I run MSCK REPAIR TABLE <my table>; and SELECT * FROM <my table>; I get everything as expected.

The same query returns only the column names when I enable partition projection and Athena reports to have scanned no data at all (data scanned: -).

My glue table is defined via CloudFormation as follows and is deployed using serverless:

  MyGlueTable:
    Type: AWS::Glue::Table
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseName: ${self:custom.glueDB}
      TableInput:
        Name: reporting-${self:provider.stage}
        Description: Table for reporting data
        TableType: EXTERNAL_TABLE
        Parameters:
          classification: "parquet"
          projection.enabled: true
          projection.year.type: "integer"
          projection.year.range: 2025, 2030
          projection.month.type: "integer"
          projection.month.range: 1, 12
          projection.day.type: "integer"
          projection.day.range: 1, 31
        StorageDescriptor:
          Location: "s3://my-bucket/reporting/"
          InputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
          OutputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
          SerdeInfo:
            SerializationLibrary: .apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
          Columns:
            - Name: someColumn
              Type: string
        PartitionKeys:
          - Name: year
            Type: int
          - Name: month
            Type: int
          - Name: day
            Type: int

I have already checked the data in my s3 bucket, and the files are both valid and stored as expected e.g. my-bucket/reporting/year=2025/month=04/day=03/data.parquet but Athena is happy to just not look at them if partition projection is enabled. I'm at a loss.

Share Improve this question asked yesterday RaisinRaisin 214 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

Answering my own question in case someone stumbles on the same issue:

The problem ended up being a configuration error on my side. The partitions in S3 had zeropadding i.e. April = 04, the projected partitions as can be seen by the glue table parameters do not. I didn't think this would be an issue using the type "integer" but apparently it is.

So the projected partitions would be i.e. month 4, day 3; but in s3 it would be month=04 and day=03 which is the reason Athena couldn't access the data.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论