I have a single-node setup on an Ubuntu server where:
- Spark is running outside a Docker container.
- A Django application is running inside a Docker container.
I’m trying to decide whether it’s better to:
- Install Spark inside the Docker image, or
- Keep Spark outside the Docker container (on the same server) so it can be shared across multiple Docker containers.
Questions:
- Will the latency be the same in both setups?
- What Spark configurations are recommended for this setup?
Currently, I’ve set the following in the Docker image to allow communication with Spark:
echo "<private_ip_address> app02" >> /etc/hosts
However, I’m encountering a warning in Spark:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Additionally, the Spark job is not utilizing all of the server's resources (currently, only 20% is being used). How can I resolve this issue and optimize the configuration?
Additional Context:
- Spark is running in standalone mode.
- The server has sufficient resources (CPU, RAM) for both Spark and the Django app.
- The goal is to ensure Spark jobs run efficiently while being accessible to the Django app.
Spark Configuration:
Key | Value |
---|---|
spark.dynamicAllocation.initialExecutors | 1 |
spark.serializer | .apache.spark.serializer.KryoSerializer |
spark.driver.memory | 12365m |
spark.executor.memory | 12365m |
spark.dynamicAllocation.enabled | true |
spark.driver.memory | 6g |
spark.executor.cores | 5 |
spark.executor.memory | 8g |
spark.dynamicAllocation.maxExecutors | 6 |
spark.driver.cores | 2 |
spark.master | spark://app02:7077 |
spark.executor.instances | 5 |
spark.dynamicAllocation.minExecutors | 2 |
spark.dynamicAllocation.initialExecutors | 3 |
spark.executor.memoryOverheadFactor | 0.1 |
total_cores_allocated | 25 |
spark.driver.port | 7078 |
spark.driver.host | private_ip |
spark.driver.bindAddress | 0.0.0.0 |