Spark RDD Fundamentals
This post provides a high-level introduction to the RDD object in the Spark API. In the coming posts, we'll dive deeper into more generic objects in the Spark API. Then, we'll explore low-level concepts, including the Spark internals.
Table of Contents
- Resilient Distributed Datasets
- Defining a DAGScheduler
- Types of Transformations
- Lifecycle of a Spark Program
Resilient Distributed Datasets
Most likely, we've all worked with pandas DataFrames
before. They're in-memory, single-server data structures that offer many user-friendly functions for data processing. Functionally, Spark provides a data structure that is very similar in this sense, but can be used across multiple servers. This data structure is called a resilient distributed dataset, or an RDD
for short. In short, an RDD
is an in-memory data structure that is distributed across many servers within a Spark cluster.
Roughly, we can think of an RDD
as a distributed version of a pandas DataFrame
. I'm making this comparison because RDDs offer many pandas-like functions that are focused around data processing. These functions are called Transformations and Actions. Specifically, transformations create a new dataset from an existing one. Contrastingly, actions return non-dataset values, which generally relate to some aggregation.
Transformations are lazy. Meaning, an RDD
isn't computed until it receives an action. Actions always return values to the driver program. Spark receives a performance boost from any lazy evaluations. However, this could become a problem if users continuously recompute that same transformation. As a result, Spark allows us to persist an RDD
to memory using the persist
method. To summarize, transformations and actions have the following properties:
- Transformations are lazy by default
- Actions aren't lazy
- Transformations return a new
RDD
- Actions return an aggregated value of the
RDD
As stated previously, an RDD
receives a huge boost in performance by keeping data in memory. However, Spark supports data persistance to disk as well. Spark also supports data persistence to databases. To summarize, an RDD
has the following properties:
- Distributed
- Fault-tolerant
- Flexible functions such as
map(func)
- Optionally in-memory on the Driver's JVM
- Parallelizable using
sc.parallelize(data)
- Immutable (more on this later)
Defining a DAGScheduler
Recall, a transformation is a type of special RDD
method that returns another RDD
object. Again, these methods aren't computed until the RDD
receives an action, indicating that RDD
objects are immutable. Since RDD
objects can't change once they are created, Spark creates a new object called a DAG
when an action is called. In a DAG
, each node is an RDD
partition, and an edge is a transformation.
Spark breaks the RDD
into smaller chunks of data called partitions. In Spark, a partition is a chunk of data on a node within the cluster. At a high level, Spark breaks transformations and actions into Tasks
, which are mapped to partitions. Essentially, a Task
represents a unit of work on a partition of a distributed dataset.
Assuming nonsequential dependence, Tasks
are executed in parallel on partitions. Thus, the number of partitions made up by an RDD
should equal the number of CPU cores within a cluster. Theoretically, increasing the number of partitions would increase the amount of parallelism for a system, assuming there are available CPU cores. If an RDD
can't fit an entire set of data into memory, then that data will be stored to and read from disk.
Now, let's return to our previous discussion about the DAG
object. When an action is called, each DAG
is submitted to a DAGScheduler
object for execution. A DAGScheduler
organizes operations into Stages
, and a Stage
is organized into Tasks
. Each Task
is scheduled separately. It represents a unit of work on a partition of an RDD
, and is executed as a thread in an executor's JVM. The DAGScheduler
returns a TaskSet
object, which is passed to a TaskScheduler
. The TaskScheduler
launches tasks in the a cluster manager.
Multiple tasks can be executed in parallel for any stage. Specifically, any two stages can be executed in parallel if they aren't sequentially dependent on each other. Implying, tasks from one stage can be executed in parallel with tasks from a separate stage, if they aren't sequentially dependent on each other. Refer to this post for an illustration of how Tasks
and stages run in parallel.
The number of tasks is equal to the number of partitions. The number of stages is equal to the number of wide transformations. For examples that may help illustrate these concepts, refer to my next post.
Types of Transformations
There are two types of transformations that can be applied to RDDs
: narrow transformations and wide transformations. Narrow transformations refer to transformations where each partition contributes to one stage only. These include transformations like map
, filter
, etc. Wide transformations refer to transformations where each partition contributes to many stages. In Spark, this concept is called shuffling.
Shuffling is used for regrouping data between partitions. Shuffling is necessary for situations requiring information from each partition. Wider transformations are more expensive than narrow transformations in comparison. For example, the map
transformation doesn't require shuffling, since it applies element-wise transformations to each partition. This technique is called pipelining. In other words, an element in one partition doesn't need any information from other partitions. On the other hand, the groupByKey
wide transformation needs information from each partition. Specifically, a narrow transformation keeps its results in memory, whereas a wide transformation writes its results to disk. This post defines optimized shuffling algorithms in detail.
Lifecycle of a Spark Program
Now, we have a high-level level understanding of the core Spark data structures. This lecture defines the lifecycle of a Spark program in detail. Generally, a common lifecycle of a spark program looks like the following:
- Create some input RDDs from external data
- Lazily transform them to define new RDDs using transformations
- Ask Spark to
cache()
any intermediate RDDs that will be reused - Launch actions to start parallel computation
- Spark optimizes and executes its computations