Spin Up An Apache Spark Cluster: Howto In 5 Minutes

The purpose of this flash tip is to provide the reader the information required for spinning up an Apache Spark cluster composed by

One Master
One or multiple workers

The cluster can be used for several purposes: experimenting with Apache Spark or building an integration test suite.

Prerequisites

There are only two requirements for executing what we will describe in the next sections: Docker and Docker Compose.

Once installed, type the following commands to make sure everything is working.

Environment variables

It’s usually a good idea to isolate context variables in a dedicated file, which in this case is called “.env”. It is placed beside the main file (docker-compose.yml)

				
					version: "3.6"
volumes:
  dfs:
    name: "dfs"
    driver: local
services:
  spark-master:
    image: apache/spark:${SPARK_TAG}
    ports:
      - ${SPARK_MASTER_HTTP_HOST_PORT}:8080
      - ${SPARK_MASTER_HOST_PORT}:${SPARK_MASTER_CONTAINER_PORT}
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master >> logs/spark-master.out
    volumes:
      - dfs:/var/data
    networks:
      static-network:
        ipv4_address: ${SPARK_MASTER}
  spark-worker-1:
    image: apache/spark:${SPARK_TAG}
    environment:
      - SPARK_WORKER_DIR=/opt/spark/work-dir
      - SPARK_WORKER_PORT=${SPARK_WORKER_1_CONTAINER_PORT}
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=2048m
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER}:${SPARK_MASTER_CONTAINER_PORT}
    ports:
      - ${SPARK_WORKER_1_HOST_PORT}:${SPARK_WORKER_1_CONTAINER_PORT}
      - ${SPARK_WORKER_1_WEB_CONTAINER_PORT}:8081
    volumes:
      - dfs:/var/data
    depends_on:
      - spark-master
    networks:
      static-network:
        ipv4_address: ${SPARK_WORKER_1}
  spark-worker-2:
    image: apache/spark:${SPARK_TAG}
    environment:
      - SPARK_WORKER_DIR=/opt/spark/work-dir
      - SPARK_WORKER_PORT=${SPARK_WORKER_2_CONTAINER_PORT}
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=2048m
    command: /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://${SPARK_MASTER}:${SPARK_MASTER_CONTAINER_PORT}
    ports:
      - ${SPARK_WORKER_2_HOST_PORT}:${SPARK_WORKER_2_CONTAINER_PORT}
      - ${SPARK_WORKER_2_WEB_CONTAINER_PORT}:8082
    volumes:
      - dfs:/var/data
    depends_on:
      - spark-master
    networks:
      static-network:
        ipv4_address: ${SPARK_WORKER_2}
networks:
  static-network:
    ipam:
      config:
        - subnet: ${CONTAINERS_NETWORK_SUBNET}

docker compose

The docker-compose.yml is the default filename Docker Compose expects when you want to execute it.

The descriptor above starts an Apache Spark cluster composed by

one master
two workers

If you want more than two workers, create a new worker section by copying one of the two existing workers (remember to give it a unique name and add the corresponding environment variables in the environment file above).

The cluster “emulates” a distributed filesystem (dfs, in the example) using a shared volume across the three nodes.

There’s not much more to say. Open a terminal, cd to the directory where the two files above are located and

				
					> docker-compose up

Just a few moments (depending on your machine’s resources), and your cluster is ready.

The master and the workers also provide a web console. Hosts and ports depend on what you indicated in the environment file. Here’s the master web console in my cluster.

Where you can see the workers we defined in the descriptor, properly registered. Enjoy!

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Spin Up An Apache Spark Cluster: Howto In 5 Minutes

Prerequisites

Environment variables

docker compose

Share this post

Leave a ReplyCancel reply

SpazioCodice SRL

Services

Useful Links

Contact Us

Spin Up An Apache Spark Cluster: Howto In 5 Minutes

Prerequisites

Environment variables

docker compose

Share this post

Leave a ReplyCancel reply

SpazioCodice SRL

Services

Useful Links

Contact Us

Discover more from SpazioCodice