amazon web services - When using Firehose + Iceberg table, there is a problem that data is updated later in Firehose

I’m running into an issue while testing data ingestion into an Iceberg table in the us-east-1 region, so I’d like to ask for some help.

Here are the services I’m currently using:

Lake Formation + Iceberg Table: The table consists of columns datetime, exchange, symbol, open, high, low, close, and volume, and it is partitioned by day.
Data Firehose: Direct Input + Output to Iceberg Table.
Athena: Used for data verification.

The data consists of coin price information from various exchanges. I’m receiving price data for multiple exchanges and coins at once and using Direct Input to put records into Firehose. During the initial data collection phase, historical data is not being ingested sequentially by datetime; instead, it comes in variably by exchange and symbol.

Firehose Settings Initially, when data was being ingested, I could query it in Athena with a delay of about 1–2 minutes. However, as the number of exchanges increased significantly and the amount of data being ingested at once grew, the updates started taking much longer.

Additionally, I’ve noticed that queries in Athena have not been working for over a day, so I’m trying to troubleshoot the issue.

Observations and Issues Here are some strange things I’ve noticed:

Even after pausing data ingestion via Direct Input for an hour or two to investigate, the row count of the table continues to increase when I check.
For a specific exchange, data is queryable in Athena up to 2014-01-10, but the program that ingests the data confirms that it has inserted data up to 2015-12-30. However, Athena queries still cannot retrieve this data.
To troubleshoot, I stopped ingesting data, then only ingested data for specific exchange symbols via Firehose, including data after 2015-12-30. However, in Athena, I can only see data up to 2014-01-13. This is the most confusing part.

It seems that when data is ingested via Firehose -> {buffer} -> Iceberg table, the data in the buffer is updated much later, or it might not be written until a certain amount of data accumulates. Is there a way to reduce this delay? Isn’t the buffer interval setting in Firehose meant to control this?

Additional Question When data is added, it is stored under the data directory in a hash folder, followed by a datetime partitioning folder. I wanted to avoid the hash folder, so when creating the table in EMR Spark, I set the option 'write.distribution-mode'='range'. However, it seems this might not be the correct option to achieve that...?

Also, possibly because of the hash folder structure in the S3 directory, when I try to rewrite data in Spark and specify the datetime, it fails to find the data.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

amazon web services - When using Firehose + Iceberg table, there is a problem that data is updated later in Firehose - Stack Ove

与本文相关的文章

评论列表(0)