最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

amazon web services - When using Firehose + Iceberg table, there is a problem that data is updated later in Firehose - Stack Ove

programmeradmin1浏览0评论

I’m running into an issue while testing data ingestion into an Iceberg table in the us-east-1 region, so I’d like to ask for some help.

Here are the services I’m currently using:

  • Lake Formation + Iceberg Table: The table consists of columns datetime, exchange, symbol, open, high, low, close, and volume, and it is partitioned by day.
  • Data Firehose: Direct Input + Output to Iceberg Table.
  • Athena: Used for data verification.

The data consists of coin price information from various exchanges. I’m receiving price data for multiple exchanges and coins at once and using Direct Input to put records into Firehose. During the initial data collection phase, historical data is not being ingested sequentially by datetime; instead, it comes in variably by exchange and symbol.

Firehose Settings Initially, when data was being ingested, I could query it in Athena with a delay of about 1–2 minutes. However, as the number of exchanges increased significantly and the amount of data being ingested at once grew, the updates started taking much longer.

Additionally, I’ve noticed that queries in Athena have not been working for over a day, so I’m trying to troubleshoot the issue.

Observations and Issues Here are some strange things I’ve noticed:

  • Even after pausing data ingestion via Direct Input for an hour or two to investigate, the row count of the table continues to increase when I check.

  • For a specific exchange, data is queryable in Athena up to 2014-01-10, but the program that ingests the data confirms that it has inserted data up to 2015-12-30. However, Athena queries still cannot retrieve this data.

  • To troubleshoot, I stopped ingesting data, then only ingested data for specific exchange symbols via Firehose, including data after 2015-12-30. However, in Athena, I can only see data up to 2014-01-13. This is the most confusing part.

It seems that when data is ingested via Firehose -> {buffer} -> Iceberg table, the data in the buffer is updated much later, or it might not be written until a certain amount of data accumulates. Is there a way to reduce this delay? Isn’t the buffer interval setting in Firehose meant to control this?

Additional Question When data is added, it is stored under the data directory in a hash folder, followed by a datetime partitioning folder. I wanted to avoid the hash folder, so when creating the table in EMR Spark, I set the option 'write.distribution-mode'='range'. However, it seems this might not be the correct option to achieve that...?

Also, possibly because of the hash folder structure in the S3 directory, when I try to rewrite data in Spark and specify the datetime, it fails to find the data.

发布评论

评论列表(0)

  1. 暂无评论