TL;DR If a topic has an infinite (unrestricted) number of unique keys (although messages with similar keys are in the same or adjacent segments), is topic compaction even possible/feasible?
Scenario:
Assume we have a Kafka topic for weather forecasts which periodically (every 15m) receives updated forecasts for the next 24h with a 15m resolution. The forecasts are determined for a significant but reasonable number of coordinates (10-100k). Based on these conditions, there will be 96 forecasts for each coordinate and each point in time.
A forecasts consists of:
- Key: (coordinate, timestamp)
- Value: (... weather parameters ...)
We would like to keep our forecasts forever since they can be valuable for model training and other things in the future. However, we do not need all versions of the forecasts - only the latest forecast for each (coordinate, timestamp)
tuple is of interest and worth archiving since those (presumably) have the highest quality.
So the logical choice would be to use topic compaction since it could potentially save up to ~98% of space (yes, I know, with a lot of theory). But that raises the question: Does topic compaction even work when the topic has a high (infinite) cardinality in its keys?
According to our back of the envelope estimation, we would create somewhere between 500kk and 2kkk unique keys per year. Whatever mechanism is used for log compaction (e.g. a hashmap), it obviously cannot work with a set of infinite size. So the only glimmer of hope is that similar keys occur close to each other in the stream (within 24h, so in the same or adjacent segments). Does that make a difference? Is log compaction able to work in such a scenario? And if yes, what parameters are worth to look at to make this work?