Spin up an Apache Spark cluster: howto in 5 minutes

Spin Up An Apache Spark Cluster: Howto In 5 Minutes

The purpose of this flash tip is to provide the reader the information required for spinning up an Apache Spark cluster composed by  

  • One Master
  • One or multiple workers

The cluster can be used for several purposes: experimenting with Apache Spark or building an integration test suite.

Prerequisites

There are only two requirements for executing what we will describe in the next sections: Docker and Docker Compose.

Once installed, type the following commands to make sure everything is working.

Environment variables

It’s usually a good idea to isolate context variables in a dedicated file, which in this case is called “.env”. It is placed beside the main file (docker-compose.yml)

				
					version: "3.6"
volumes:
  dfs:
    name: "dfs"
    driver: local
services:
  spark-master:
    image: apache/spark:${SPARK_TAG}
    ports:
      - ${SPARK_MASTER_HTTP_HOST_PORT}:8080
      - ${SPARK_MASTER_HOST_PORT}:${SPARK_MASTER_CONTAINER_PORT}
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master >> logs/spark-master.out
    volumes:
      - dfs:/var/data
    networks:
      static-network:
        ipv4_address: ${SPARK_MASTER}
  spark-worker-1:
    image: apache/spark:${SPARK_TAG}
    environment:
      - SPARK_WORKER_DIR=/opt/spark/work-dir
      - SPARK_WORKER_PORT=${SPARK_WORKER_1_CONTAINER_PORT}
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=2048m
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER}:${SPARK_MASTER_CONTAINER_PORT}
    ports:
      - ${SPARK_WORKER_1_HOST_PORT}:${SPARK_WORKER_1_CONTAINER_PORT}
      - ${SPARK_WORKER_1_WEB_CONTAINER_PORT}:8081
    volumes:
      - dfs:/var/data
    depends_on:
      - spark-master
    networks:
      static-network:
        ipv4_address: ${SPARK_WORKER_1}
  spark-worker-2:
    image: apache/spark:${SPARK_TAG}
    environment:
      - SPARK_WORKER_DIR=/opt/spark/work-dir
      - SPARK_WORKER_PORT=${SPARK_WORKER_2_CONTAINER_PORT}
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=2048m
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER}:${SPARK_MASTER_CONTAINER_PORT}
    ports:
      - ${SPARK_WORKER_2_HOST_PORT}:${SPARK_WORKER_2_CONTAINER_PORT}
      - ${SPARK_WORKER_2_WEB_CONTAINER_PORT}:8082
    volumes:
      - dfs:/var/data
    depends_on:
      - spark-master
    networks:
      static-network:
        ipv4_address: ${SPARK_WORKER_2}
networks:
  static-network:
    ipam:
      config:
        - subnet: ${CONTAINERS_NETWORK_SUBNET}
				
			

docker compose

The docker-compose.yml is the default filename Docker Compose expects when you want to execute it. 

The descriptor above starts an Apache Spark cluster composed by 

  • one master
  • two workers

If you want more than two workers, create a new worker section by copying one of the two existing workers (remember to give it a unique name and add the corresponding environment variables in the environment file above). 

The cluster “emulates” a distributed filesystem (dfs, in the example) using a shared volume across the three nodes. 

There’s not much more to say. Open a terminal, cd to the directory where the two files above are located and 

				
					> docker-compose up
				
			

Just a few moments (depending on your machine’s resources), and your cluster is ready

The master and the workers also provide a web console. Hosts and ports depend on what you indicated in the environment file. Here’s the master web console in my cluster.  

Where you can see the workers we defined in the descriptor, properly registered. Enjoy!

Share this post

Leave a Reply

Discover more from SpazioCodice

Subscribe now to keep reading and get access to the full archive.

Continue reading