Testing Spark Applications in Standalone

Written on 2019-06-06

This post walks through an example of a cluster running in standalone mode. In the coming posts, we'll explore other examples, including clusters running a YARN cluster manager and Mesos cluster manager.

Setting up a SparkSession
Launching Daemons
Accessing Web UI for Daemons
Caveat about PySpark Applications
Launching Applications in Client Mode
Launching Applications in Cluster Mode

Setting up a SparkSession

Download Spark 2.4.6
Create the file ./conf/spark-defaults.conf:

spark.master=spark://localhost:7077
spark.eventLog.enabled=true
spark.eventLog.dir=./tmp/spark-events/
spark.history.fs.logDirectory=./tmp/spark-events/
spark.driver.memory=5g

Create a Spark application:

# test.py
>>> from pyspark import SparkContext
>>> file = "~/data.txt"  # path of data
>>> masterurl = 'spark://localhost:7077'
>>> sc = SparkContext(masterurl, 'myapp')
>>> data = sc.textFile(file).cache()
>>> num_a = data.filter(lambda s: 'a' in s).count()
>>> print(num_a)
>>> sc.stop()

Launching Daemons

Start a master daemon in standalone mode
```
$ ./sbin/start-master.sh 
```

Start a worker daemon

$ ./sbin/start-slave.sh spark://localhost:7077

Start a history daemon
```
$ ./sbin/start-history-server.sh
```

Start a Spark application

$ ./bin/spark-submit \
--master spark://localhost:7077 \
test.py

Stopping the daemons

$ ./sbin/stop-master.sh
$ ./sbin/stop-slave.sh
$ ./sbin/stop-history-server.sh

Accessing Web UI for Daemons

Spark provides a web UI for each initialized daemon. By default, Spark creates a web UI for the master on port 8080. The workers can take on different portsand can be accessed via the master web UI. The history server can be accessed on port 18080 by default. The table below summarizes the default locations for each web UI.

Daemon	Port
Master	`8080`
Worker	`8081`
History	`18080`

Caveat about PySpark Applications

Notice, launching an application in client mode doesn't seem to trigger a driver according to the master's web UI. This doesn't mean a driver isn't launched in client mode. The driver is still launched within the spark-submit process. However, the master's web UI omits driver information if the application is running in client mode.

So, we may want to launch an application in cluster mode now. However, running an application in cluster mode would give us the following error:

$ ./bin/spark-submit \
    --master spark://localhost:7077 \
    --deploy-mode cluster
    test.py
Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode is currently not supported for python applications on standalone clusters.

As of Spark 2.4.6, we can't run python applications in cluster mode when running a standalone cluster manager. This is a good opportunity for us to experiment with other resource managers in the next post. For now, we will run JavaSparkPi.java found in the examples directory.

Launching Applications in Client Mode

In a previous post, we defined the components associated with a driver program and cluster, while illustrating the interaction between a driver program and cluster components. Specifically, we defined this interaction when applications are launched in client mode. Now, we can execute an application and verify these steps using the logs.

Note, the timestamps and logged messages were slightly modified for clarification. However, the order and substance of each message still remains the same.

14:43:01 INFO
SparkContext: Submitted application: Spark Pi

14:43:02 INFO
Utils: Successfully started service 'sparkDriver'

14:43:03 INFO
StandaloneAppClient: Connecting to master

14:43:04 INFO
StandaloneSchedulerBackend: Connected to Spark cluster

14:43:05 INFO
Master: Registered app Spark Pi

14:43:06 INFO
Master: Launching executor on worker

14:43:07 INFO
Worker: Asked to launch executor

14:43:08 INFO
ExecutorRunner: Launched

14:43:09 INFO
StandaloneAppClient: Executor added on worker

14:43:10 INFO
StandaloneSchedulerBackend: Granted executor ID

14:43:11 INFO
StandaloneAppClient: Executor is now RUNNING

14:43:12 INFO
SparkContext: Starting job

...

14:43:13 INFO
DAGScheduler: Job finished

14:43:14 INFO
StandaloneSchedulerBackend: Shutting down all executors

14:43:15 INFO
Worker: Asked to kill executor

14:43:16 INFO
ExecutorRunner: Killing process!

14:43:17 INFO
Master: Removing app

14:43:18 INFO
SparkContext: Successfully stopped SparkContext

Launching Applications in Cluster Mode

In a previous post, we defined the interaction between a driver program and cluster components, while applications are launched in cluster mode. Now, we can execute an application in cluster mode to verify these steps using the logs.