I have a glue table that is fed by partitioned data in s3. The issue at hand is in Athena that if the partition projection is turned off, and I run MSCK REPAIR TABLE <my table>;
and SELECT * FROM <my table>;
I get everything as expected.
The same query returns only the column names when I enable partition projection and Athena reports to have scanned no data at all (data scanned: -).
My glue table is defined via CloudFormation as follows and is deployed using serverless:
MyGlueTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: ${self:custom.glueDB}
TableInput:
Name: reporting-${self:provider.stage}
Description: Table for reporting data
TableType: EXTERNAL_TABLE
Parameters:
classification: "parquet"
projection.enabled: true
projection.year.type: "integer"
projection.year.range: 2025, 2030
projection.month.type: "integer"
projection.month.range: 1, 12
projection.day.type: "integer"
projection.day.range: 1, 31
StorageDescriptor:
Location: "s3://my-bucket/reporting/"
InputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
SerdeInfo:
SerializationLibrary: .apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Columns:
- Name: someColumn
Type: string
PartitionKeys:
- Name: year
Type: int
- Name: month
Type: int
- Name: day
Type: int
I have already checked the data in my s3 bucket, and the files are both valid and stored as expected e.g. my-bucket/reporting/year=2025/month=04/day=03/data.parquet
but Athena is happy to just not look at them if partition projection is enabled. I'm at a loss.
I have a glue table that is fed by partitioned data in s3. The issue at hand is in Athena that if the partition projection is turned off, and I run MSCK REPAIR TABLE <my table>;
and SELECT * FROM <my table>;
I get everything as expected.
The same query returns only the column names when I enable partition projection and Athena reports to have scanned no data at all (data scanned: -).
My glue table is defined via CloudFormation as follows and is deployed using serverless:
MyGlueTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: ${self:custom.glueDB}
TableInput:
Name: reporting-${self:provider.stage}
Description: Table for reporting data
TableType: EXTERNAL_TABLE
Parameters:
classification: "parquet"
projection.enabled: true
projection.year.type: "integer"
projection.year.range: 2025, 2030
projection.month.type: "integer"
projection.month.range: 1, 12
projection.day.type: "integer"
projection.day.range: 1, 31
StorageDescriptor:
Location: "s3://my-bucket/reporting/"
InputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: .apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
SerdeInfo:
SerializationLibrary: .apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Columns:
- Name: someColumn
Type: string
PartitionKeys:
- Name: year
Type: int
- Name: month
Type: int
- Name: day
Type: int
I have already checked the data in my s3 bucket, and the files are both valid and stored as expected e.g. my-bucket/reporting/year=2025/month=04/day=03/data.parquet
but Athena is happy to just not look at them if partition projection is enabled. I'm at a loss.
1 Answer
Reset to default 0Answering my own question in case someone stumbles on the same issue:
The problem ended up being a configuration error on my side. The partitions in S3 had zeropadding i.e. April = 04, the projected partitions as can be seen by the glue table parameters do not. I didn't think this would be an issue using the type "integer" but apparently it is.
So the projected partitions would be i.e. month 4, day 3; but in s3 it would be month=04 and day=03 which is the reason Athena couldn't access the data.