Spark-Developer-Certification-Study-Guide(Python)
Loading...

Spark Developer Certification - Comprehensive Study Guide (python)

What is this?

While studying for the Spark certification exam and going through various resources available online, I thought it'd be worthwhile to put together a comprehensive knowledge dump that covers the entire syllabus end-to-end, serving as a Study Guide for myself and hopefully others.

Note that I used inspiration from the Spark Code Review guide - but whereas that covers a subset of the coding aspects only, I aimed for this to be more of a comprehensive, one stop resource geared towards passing the exam.

Awesome Resources/References used throughout this guide

References

Resources

PySpark Cheatsheet

PySpark Cheatshet

1. Spark Architecture Components

Candidates are expected to be familiar with the following architectural components and their relationship to each other:

Spark Basic Architecture

A cluster, or group of machines, pools the resources of many machines together allowing us to use all the cumulative resources as if they were one. Now a group of machines sitting somewhere alone is not powerful, you need a framework to coordinate work across them. Spark is a tailor-made engine exactly for this, managing and coordinating the execution of tasks on data across a cluster of computers.

The cluster of machines that Spark will leverage to execute tasks will be managed by a cluster manager like Spark’s Standalone cluster manager, YARN - Yet Another Resource Negotiator, or Mesos. We then submit Spark Applications to these cluster managers which will grant resources to our application so that we can complete our work.

Spark Applications

Spark Basic Architecture

Spark Applications consist of a driver process and a set of executor processes. In the illustration we see above, our driver is on the left and four executors on the right.

What is a JVM?

The JVM manages system memory and provides a portable execution environment for Java-based applications

Technical definition: The JVM is the specification for a software program that executes code and provides the runtime environment for that code.
Everyday definition: The JVM is how we run our Java programs. We configure the JVM's settings and then rely on it to manage program resources during execution.

The Java Virtual Machine (JVM) is a program whose purpose is to execute other programs.
JVM

The JVM has two primary functions:

  1. To allow Java programs to run on any device or operating system (known as the "Write once, run anywhere" principle)
  2. To manage and optimize program memory

JVM view of the Spark Cluster: Drivers, Executors, Slots & Tasks

The Spark runtime architecture leverages JVMs:

Spark Physical Cluster, slots

And a slightly more detailed view:

Spark Physical Cluster, slots

Elements of a Spark application are in blue boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots are in white boxes.

Responsibilities of the client process component

The client process starts the driver program. For example, the client process can be a spark-submit script for running applications, a spark-shell script, or a custom application using Spark API (like this Databricks GUI - Graphics User Interface). The client process prepares the classpath and all configuration options for the Spark application. It also passes application arguments, if any, to the application running inside the driver.

1A) Driver

The driver orchestrates and monitors execution of a Spark application. There’s always one driver per Spark application. You can think of the driver as a wrapper around the application.

The driver process runs our main() function, sits on a node in the cluster, and is responsible for:

  1. Maintaining information about the Spark Application
  2. Responding to a user’s program or input
  3. Requesting memory and CPU resources from cluster managers
  4. Breaking application logic into stages and tasks
  5. Sending tasks to executors
  6. Collecting the results from the executors

The driver process is absolutely essential - it’s the heart of a Spark Application and maintains all relevant information during the lifetime of the application.

  • The Driver is the JVM in which our application runs.
  • The secret to Spark's awesome performance is parallelism:
    • Scaling vertically (i.e. making a single computer more powerful by adding physical hardware) is limited to a finite amount of RAM, Threads and CPU speeds, due to the nature of motherboards having limited physical slots in Data Centers/Desktops.
    • Scaling horizontally (i.e. throwing more identical machines into the Cluster) means we can simply add new "nodes" to the cluster almost endlessly, because a Data Center can theoretically have an interconnected number of ~infinite machines
  • We parallelize at two levels:
    • The first level of parallelization is the Executor - a JVM running on a node, typically, one executor instance per node.
    • The second level of parallelization is the Slot - the number of which is determined by the number of cores and CPUs of each node/executor.

1B) Executor

The executors are responsible for actually executing the work that the driver assigns them. This means, each executor is responsible for only two things:

  1. Executing code assigned to it by the driver
  2. Reporting the state of the computation, on that executor, back to the driver node

1C) Cores/Slots/Threads

  • Each Executor has a number of Slots to which parallelized Tasks can be assigned to it by the Driver.
    • So for example:
      • If we have 3 identical home desktops (nodes) hooked up together in a LAN (like through your home router), each with i7 processors (8 cores), then that's a 3 node Cluster:
        • 1 Driver node
        • 2 Executor nodes
      • The 8 cores per Executor node means 8 Slots, meaning the driver can assign each executor up to 8 Tasks
        • The idea is, an i7 CPU Core is manufactured by Intel such that it is capable of executing it's own Task independent of the other Cores, so 8 Cores = 8 Slots = 8 Tasks in parellel

For example: the diagram below is showing 2 Core Executor nodes:

Spark Physical Cluster, tasks

  • The JVM is naturally multithreaded, but a single JVM, such as our Driver, has a finite upper limit.
  • By creating Tasks, the Driver can assign units of work to Slots on each Executor for parallel execution.
  • Additionally, the Driver must also decide how to partition the data so that it can be distributed for parallel processing (see below).
    • Consequently, the Driver is assigning a Partition of data to each task - in this way each Task knows which piece of data it is to process.
    • Once started, each Task will fetch from the original data source (e.g. An Azure Storage Account) the Partition of data assigned to it.

Note relating to Tasks, Slots and Cores

You can set the number of task slots to a value two or three times (i.e. to a multiple of) the number of CPU cores. Although these task slots are often referred to as CPU cores in Spark, they’re implemented as threads that work on a physical core's thread and don’t need to correspond to the number of physical CPU cores on the machine (since different CPU manufacturer's can architect multi-threaded chips differently).

In other words:

  • All processors of today have multiple cores (e.g. 1 CPU = 8 Cores)
  • Most processors of today are multi-threaded (e.g. 1 Core = 2 Threads, 8 cores = 16 Threads)
  • A Spark Task runs on a Slot. 1 Thread is capable of doing 1 Task at a time. To make use of all our threads on the CPU, we cleverly assign the number of Slots to correspond to a multiple of the number of Cores (which translates to multiple Threads).
    • By doing this, after the Driver breaks down a given command (DO STUFF FROM massive_table) into Tasks and Partitions, which are tailor-made to fit our particular Cluster Configuration (say 4 nodes - 1 driver and 3 executors, 8 cores per node, 2 threads per core). By using our Clusters at maximum efficiency like this (utilizing all available threads), we can get our massive command executed as fast as possible (given our Cluster in this case, 3*8*2 Threads --> 48 Tasks, 48 Partitions - i.e. 1 Partition per Task)
    • Say we don't do this, even with a 100 executor cluster, the entire burden would go to 1 executor, and the other 99 will be sitting idle - i.e. slow execution.
    • Or say, we instead foolishly assign 49 Tasks and 49 Partitions, the first pass would execute 48 Tasks in parallel across the executors cores (say in 10 minutes), then that 1 remaining Task in the next pass will execute on 1 core for another 10 minutes, while the rest of our 47 cores are sitting idle - meaning the whole job will take double the time at 20 minutes. This is obviously an inefficient use of our available resources, and could rather be fixed by setting the number of tasks/partitions to a multiple of the number of cores we have (in this setup - 48, 96 etc).