I am facing an issue where my Spark application fails approximately once every 50 days. However, I don’t see any errors in the application logs. The only clue I found is in the NodeManager logs, which show the following error:
WARN .apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Exception from container-launch with container ID: container_e225_1708884103504_1826568_02_000002 and exit code: 1
After the restart, I checked the memory usage in both the executor and the driver. In the Spark UI, the driver's memory usage appears unusual: it's showing 98.1 GB/19.1 GB.
- My spark version is 2.4.0.
My Questions:
- What does 98.1 GB / 19.1 GB in the Spark UI Storage tab for the driver indicate?
- Could this excessive driver memory usage be the reason for my application's failure?
- How can I debug or find the root cause of why my application fails once every 50 days?
Any insights or suggestions would be greatly appreciated!