最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Spark in Docker: Jupyter Task Running Indefinitely - Stack Overflow

programmeradmin0浏览0评论

I set up a Docker environment containing Spark, PostgreSQL, and Jupyter Notebook, all connected on the same network. The idea is to process large datasets using Spark, then save the results to PostgreSQL.

My docker-compose is as follows:

version: '3.8'

services:
  spark-master:
    image: bitnami/spark:3.4.1
    container_name: spark-master
    command: bin/spark-class .apache.spark.deploy.master.Master
    environment:
      - SPARK_MODE=master
      - SPARK_MASTER_HOST=spark-master
    ports:
      - "7077:7077"  # RPC port for workers
      - "8080:8080"  # Web UI for Spark Master
    networks:
      - spark-network

  spark-worker:
    image: bitnami/spark:3.4.1
    container_name: spark-worker
    command: bin/spark-class .apache.spark.deploy.worker.Worker spark://spark-master:7077
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_CORES=2
      - SPARK_WORKER_MEMORY=4g
      - SPARK_EXECUTOR_MEMORY=3g
      - SPARK_DRIVER_MEMORY=1g
      - SPARK_RPC_MESSAGE_MAX_SIZE=512
      - SPARK_NETWORK_TIMEOUT=300s
      - SPARK_EXECUTOR_HEARTBEAT_INTERVAL=30s
    depends_on:
      - spark-master
    ports:
      - "8081:8081"  # Web UI for Spark Worker
    networks:
      - spark-network

  postgres:
    image: postgres:13
    container_name: postgres
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    ports:
      - "5432:5432"  # Expose PostgreSQL for Airflow and pgAdmin
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - spark-network

  pgadmin:
    image: dpage/pgadmin4
    container_name: pgadmin
    depends_on:
      - postgres
    environment:
      PGADMIN_DEFAULT_EMAIL: [email protected]
      PGADMIN_DEFAULT_PASSWORD: admin
    ports:
      - "5050:80"  # pgAdmin Web UI
    volumes:
      - pgadmin_data:/var/lib/pgadmin
    networks:
      - spark-network

  jupyter:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: jupyter
    ports:
      - "8888:8888"
    volumes:
      - ./notebooks:/home/jovyan/work
    networks:
      - spark-network
    environment:
      - SPARK_MASTER_HOST=spark-master
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_DRIVER_HOST=jupyter  # Use the Jupyter container hostname
      - SPARK_DRIVER_PORT=4040    # Set a static driver port
      - PYSPARK_PYTHON=python3
      - PYSPARK_DRIVER_PYTHON=jupyter
      - PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=0.0.0.0 --allow-root"
    command: >
      start-notebook.sh --NotebookApp.token='' --NotebookApp.password=''

volumes:
  postgres_data:
  pgadmin_data:
  spark-logs:

networks:
  spark-network:
    driver: bridge

Everything starts fine, and I can access the Spark Master and Worker UIs. When I test Spark in Jupyter by printing its version, it works immediately. However, when I run the following code:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("spark://spark-master:7077") \
    .appName("JupyterETL") \
    .getOrCreate()

data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
data.show()

It runs indefinitely in jupyter. When I checked the spark UI, it shows that worker was working on the task that I sent from jupyter. However, the container logs show no meaningful progress e.g stage 1 completed or not even after 10 mins, even though this should be a straightforward task.

Could you kindly help identify any issues with my docker-compose.yml or suggest what might be going wrong? Thank you!

发布评论

评论列表(0)

  1. 暂无评论