I set up a Docker environment containing Spark, PostgreSQL, and Jupyter Notebook, all connected on the same network. The idea is to process large datasets using Spark, then save the results to PostgreSQL.
My docker-compose is as follows:
version: '3.8'
services:
spark-master:
image: bitnami/spark:3.4.1
container_name: spark-master
command: bin/spark-class .apache.spark.deploy.master.Master
environment:
- SPARK_MODE=master
- SPARK_MASTER_HOST=spark-master
ports:
- "7077:7077" # RPC port for workers
- "8080:8080" # Web UI for Spark Master
networks:
- spark-network
spark-worker:
image: bitnami/spark:3.4.1
container_name: spark-worker
command: bin/spark-class .apache.spark.deploy.worker.Worker spark://spark-master:7077
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_CORES=2
- SPARK_WORKER_MEMORY=4g
- SPARK_EXECUTOR_MEMORY=3g
- SPARK_DRIVER_MEMORY=1g
- SPARK_RPC_MESSAGE_MAX_SIZE=512
- SPARK_NETWORK_TIMEOUT=300s
- SPARK_EXECUTOR_HEARTBEAT_INTERVAL=30s
depends_on:
- spark-master
ports:
- "8081:8081" # Web UI for Spark Worker
networks:
- spark-network
postgres:
image: postgres:13
container_name: postgres
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
ports:
- "5432:5432" # Expose PostgreSQL for Airflow and pgAdmin
volumes:
- postgres_data:/var/lib/postgresql/data
networks:
- spark-network
pgadmin:
image: dpage/pgadmin4
container_name: pgadmin
depends_on:
- postgres
environment:
PGADMIN_DEFAULT_EMAIL: [email protected]
PGADMIN_DEFAULT_PASSWORD: admin
ports:
- "5050:80" # pgAdmin Web UI
volumes:
- pgadmin_data:/var/lib/pgadmin
networks:
- spark-network
jupyter:
build:
context: .
dockerfile: Dockerfile
container_name: jupyter
ports:
- "8888:8888"
volumes:
- ./notebooks:/home/jovyan/work
networks:
- spark-network
environment:
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER=spark://spark-master:7077
- SPARK_DRIVER_HOST=jupyter # Use the Jupyter container hostname
- SPARK_DRIVER_PORT=4040 # Set a static driver port
- PYSPARK_PYTHON=python3
- PYSPARK_DRIVER_PYTHON=jupyter
- PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=0.0.0.0 --allow-root"
command: >
start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''
volumes:
postgres_data:
pgadmin_data:
spark-logs:
networks:
spark-network:
driver: bridge
Everything starts fine, and I can access the Spark Master and Worker UIs. When I test Spark in Jupyter by printing its version, it works immediately. However, when I run the following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("spark://spark-master:7077") \
.appName("JupyterETL") \
.getOrCreate()
data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
data.show()
It runs indefinitely in jupyter. When I checked the spark UI, it shows that worker was working on the task that I sent from jupyter. However, the container logs show no meaningful progress e.g stage 1 completed or not even after 10 mins, even though this should be a straightforward task.
Could you kindly help identify any issues with my docker-compose.yml or suggest what might be going wrong? Thank you!