We are using Flink’s FileSink operation for publisher events to s3. Multiple kafka topics are used as source. The streaming is causing back pressure and eventually causing huge lag to process the Kafka topics. How should I resolve this back pressure problem and make sink faster to catch up with the source throughput?
We tried with increasing parallelism by adding more KPU’s, but that didn’t help. The lag keeps increasing and the checkpoint duration was also keeps increasing. Next, tried with RollingPolicy to increase the part size so that we do less S3 calls and that to didn’t help. Thought, we were doing lot of S3 calls and that’s causing the throttling and eventually taking long time to push records. How could I try to resolve the backpressure so that the check pointing is faster and keeps up with the source throughput.
We are using Flink’s FileSink operation for publisher events to s3. Multiple kafka topics are used as source. The streaming is causing back pressure and eventually causing huge lag to process the Kafka topics. How should I resolve this back pressure problem and make sink faster to catch up with the source throughput?
We tried with increasing parallelism by adding more KPU’s, but that didn’t help. The lag keeps increasing and the checkpoint duration was also keeps increasing. Next, tried with RollingPolicy to increase the part size so that we do less S3 calls and that to didn’t help. Thought, we were doing lot of S3 calls and that’s causing the throttling and eventually taking long time to push records. How could I try to resolve the backpressure so that the check pointing is faster and keeps up with the source throughput.
Share Improve this question asked yesterday Vivek RajendranVivek Rajendran 111 bronze badge New contributor Vivek Rajendran is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. 2- When you view your Flink job via the Web UI, where is the back-pressure starting? And do you see a steady increase in the checkpoint size, as well as the duration? – kkrugler Commented yesterday
- From the web UI it shows the back pressure on the source level. Which i believe is not the actual place. As I added some logs within each operator and could see all the operator are fast except when it reaches sink it takes lot of time. So, think S3 is the real bottleneck here, as it fails to keep up with the source throughput. Yes both big checkpoint and long duration u could see everytime. – Vivek Rajendran Commented yesterday
1 Answer
Reset to default 0Although the documentation isn't clear on this point, I believe you can use entropy injection with the S3 file sink, which should help. Have you tried this?