Testing Spark Applications with Mesos

Written on 2019-07-04

This post walks through an example of running a cluster using a Mesos cluster manager on Mac OS. In the coming posts, we'll explore other examples, including clusters running a standalone cluster manager and a cluster manager in YARN.

Describing the Mesos Architecutre
Comparing Mesos and Standalone Architectures
Setting up Mesos
Setting up a SparkSession
Launching Mesos Daemons
Launching Spark Daemons
Accessing Web UI for Daemons
Launching Applications in Client Mode

Describing the Mesos Architecture

The Mesos architecture is arguably more similar to the standalone architecture, compared to the YARN architecture. This is because the essential components of the Mesos architecture include a master and workers only. In the Mesos architecture, a master schedules worker resources for applications that need to use them. Then, workers launch executors for applications, which execute tasks.

Mesos can run Docker containers. As a result, Mesos essentially can run any application that can be set up in a Docker container, which includes Spark applications.

Generally, Mesos is more powerful than a cluster with a Spark standalone cluster manager. Mesos can be used for applications other than Spark, as well. Specifically, it can be used for Java, Scala, Python, and other applications. It can also do more than schedule CPU and RAM resources. In particular, it is capable of scheduling disk space, network ports, etc.

Spark applications running on Mesos consist of two components. These two components include a scheduler and executor. The scheduler accepts or rejects CPU and RAM resources offered by the Mesos master. Then, this master automatically starts Mesos workers, which automatically start executors. Lastly, Mesos executors run tasks as requested by the scheduler.

Comparing Mesos and Standalone Architectures

Resource scheduling is the most distinct difference between the Mesos and standalone architectures. Specifically, a standalone cluster manager automatically assigns resources to applications. A Mesos cluster manager optionally offers resources to applications. In this case, an application can accept and refuse the resources.

Setting up Mesos

Install Mesos
```
$ brew install mesos
```
Install the ZooKeeper dependency
```
$ brew install zookeeper
```
Optionally, assign IP addresses for masters in /etc/mesos/zk:
```
zk://192.0.2.1:2181,192.0.2.2:2181/mesos
```

Setting up a SparkSession

Download Spark 2.4.6
Add the path for Spark in .bash_profile:

export SPARK_HOME=./spark-2.4.6-bin-hadoop2.7

Create the file ./conf/spark-defaults.conf:

spark.master=yarn
spark.driver.am.memory=512m
spark.yarn.am.memory=512m
spark.executor.memory=512m
spark.eventLog.enabled=true
spark.eventLog.dir=./tmp/spark-events/
spark.history.fs.logDirectory=./tmp/spark-events/
spark.driver.memory=5g

Create a Spark application:

# test.py
>>> from pyspark import SparkContext
>>> file = "~/data.txt"  # path of data
>>> masterurl = 'spark://localhost:7077'
>>> sc = SparkContext(masterurl, 'myapp')
>>> data = sc.textFile(file).cache()
>>> num_a = data.filter(lambda s: 'a' in s).count()
>>> print(num_a)
>>> sc.stop()

Launching Mesos Daemons

Start the master:
```
$ sudo service mesos-master start
```
Start a worker:
```
$ sudo service mesos-slave start
```
Stop the master:
```
$ sudo service mesos-master stop
```
Stop a worker:
```
$ sudo service mesos-slave stop
```

Launching Spark Daemons

Start a master daemon in standalone mode
```
$ ./sbin/start-master.sh
```

Start a worker daemon

$ ./sbin/start-slave.sh spark://localhost:7077

Start a history daemon
```
$ ./sbin/start-history-server.sh
```

Start a Spark application

$ ./bin/spark-submit \
--master mesos://localhost:5050 \
test.py

Stop the daemons

$ ./sbin/stop-master.sh
$ ./sbin/stop-slave.sh
$ ./sbin/stop-history-server.sh

Accessing Web UI for Daemons

Mesos provides a web UI for each initialized daemon. By default, Spark creates a web UI for the master on port 5050. The workers can take on different ports and can be accessed via the master web UI. The history server can be accessed on port 18080 by default. The table below summarizes the default locations for each web UI.

Daemon	Port
Mesos Master	`5050`
Spark History	`18080`

Launching Applications in Client Mode

Mesos workers offer their resources to the master
The Mesos scheduler registers with the Mesos master
- The Mesos master is located in the cluster
- The Mesos scheduler is Spark's Mesos-specific scheduler
- The Mesos scheduler runs in the driver
The Mesos master offers available resources to the Mesos scheduler
- This happens continuously
- The offer is sent out every second while master is alive
The Mesos scheduler accepts some of the resources
The Mesos scheduler sends metadata about the resources to the Mesos master
- This metadata includes information about these resources and tasks that run these resources
The Mesos master asks the workers to start the tasks with its specified resources
The Mesos workers launch Mesos executors
The Mesos executors launch Spark executors consisting of Tasks
The Spark executors communicate with the Spark driver

MesosClient