%md
# Spark Developer Certification - Comprehensive Study Guide (python)
## What is this? <br>
While studying for the [Spark certification exam](https://academy.databricks.com/exams) and going through various resources available online, [I](https://www.linkedin.com/in/mdrakiburrahman/) thought it'd be worthwhile to put together a comprehensive knowledge dump that covers the entire syllabus end-to-end, serving as a Study Guide for myself and hopefully others. <br>
Note that I used inspiration from the [Spark Code Review guide](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3249544772526824/521888783067570/7123846766950497/latest.html) - but whereas that covers a subset of the coding aspects only, I aimed for this to be more of a *comprehensive, one stop resource geared towards passing the exam*.
%md
# Awesome Resources/References used throughout this guide <br>
## References
- **Spark Code Review used for inspiration**: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3249544772526824/521888783067570/7123846766950497/latest.html
- **Syllabus**: https://academy.databricks.com/exam/crt020-python
- **Data Scientists Guide to Apache Spark**: https://drive.google.com/open?id=17KMSllwgMvQ8cuvTbcwznnLB6-4Oen9T
- **JVM Overview**: https://www.javaworld.com/article/3272244/what-is-the-jvm-introducing-the-java-virtual-machine.html
- **Spark Runtime Architecture Overview:** https://freecontent.manning.com/running-spark-an-overview-of-sparks-runtime-architecture
- **Spark Application Overview:** https://docs.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_spark_apps.html
- **Spark Architecture Overview:** http://queirozf.com/entries/spark-architecture-overview-clusters-jobs-stages-tasks-etc
- **Mastering Apache Spark:** https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-scheduler-Stage.html
- **Manually create DFs:** https://medium.com/@mrpowers/manually-creating-spark-dataframes-b14dae906393
- **PySpark SQL docs:** https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
- **Introduction to DataFrames:** https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html
- **PySpark UDFs:** https://changhsinlee.com/pyspark-udf/
- **ORC File:** https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
- **SQL Server Stored Procedures from Databricks:** https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark
- **Repartition vs Coalesce:** https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
- **Partitioning by Columns:** https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-dynamic-partition-inserts.html
- **Bucketing:** https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html
- **PySpark GroupBy and Aggregate Functions:** https://hendra-herviawan.github.io/pyspark-groupby-and-aggregate-functions.html
- **Spark Quickstart:** https://spark.apache.org/docs/latest/quick-start.html
- **Spark Caching - 1:** https://unraveldata.com/to-cache-or-not-to-cache/
- **Spark Caching - 2:** https://stackoverflow.com/questions/45558868/where-does-df-cache-is-stored
- **Spark Caching - 3:** https://changhsinlee.com/pyspark-dataframe-basics/
- **Spark Caching - 4:** http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence
- **Spark Caching - 5:** https://www.tutorialspoint.com/pyspark/pyspark_storagelevel.htm
- **Spark Caching - 6:** https://spark.apache.org/docs/2.1.2/api/python/_modules/pyspark/storagelevel.html
- **Spark SQL functions examples:** https://spark.apache.org/docs/2.3.0/api/sql/index.html
- **Spark Built-in Higher Order Functions Examples:** https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html
- **Spark SQL Timestamp conversion:** https://docs.databricks.com/_static/notebooks/timestamp-conversion.html
- **RegEx Tutorial:** https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
- **Rank VS Dense Rank:** https://stackoverflow.com/questions/44968912/difference-in-dense-rank-and-row-number-in-spark
- **SparkSQL Windows:** https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
- **Spark Certification Study Guide:** https://github.com/vivek-bombatkar/Databricks-Apache-Spark-2X-Certified-Developer
## Resources
#### PySpark Cheatsheet
<br>
Awesome Resources/References used throughout this guide
References
- Spark Code Review used for inspiration: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3249544772526824/521888783067570/7123846766950497/latest.html
- Syllabus: https://academy.databricks.com/exam/crt020-python
- Data Scientists Guide to Apache Spark: https://drive.google.com/open?id=17KMSllwgMvQ8cuvTbcwznnLB6-4Oen9T
- JVM Overview: https://www.javaworld.com/article/3272244/what-is-the-jvm-introducing-the-java-virtual-machine.html
- Spark Runtime Architecture Overview: https://freecontent.manning.com/running-spark-an-overview-of-sparks-runtime-architecture
- Spark Application Overview: https://docs.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_spark_apps.html
- Spark Architecture Overview: http://queirozf.com/entries/spark-architecture-overview-clusters-jobs-stages-tasks-etc
- Mastering Apache Spark: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-scheduler-Stage.html
- Manually create DFs: https://medium.com/@mrpowers/manually-creating-spark-dataframes-b14dae906393
- PySpark SQL docs: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
- Introduction to DataFrames: https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html
- PySpark UDFs: https://changhsinlee.com/pyspark-udf/
- ORC File: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
- SQL Server Stored Procedures from Databricks: https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark
- Repartition vs Coalesce: https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
- Partitioning by Columns: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-dynamic-partition-inserts.html
- Bucketing: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html
- PySpark GroupBy and Aggregate Functions: https://hendra-herviawan.github.io/pyspark-groupby-and-aggregate-functions.html
- Spark Quickstart: https://spark.apache.org/docs/latest/quick-start.html
- Spark Caching - 1: https://unraveldata.com/to-cache-or-not-to-cache/
- Spark Caching - 2: https://stackoverflow.com/questions/45558868/where-does-df-cache-is-stored
- Spark Caching - 3: https://changhsinlee.com/pyspark-dataframe-basics/
- Spark Caching - 4: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence
- Spark Caching - 5: https://www.tutorialspoint.com/pyspark/pyspark_storagelevel.htm
- Spark Caching - 6: https://spark.apache.org/docs/2.1.2/api/python/_modules/pyspark/storagelevel.html
- Spark SQL functions examples: https://spark.apache.org/docs/2.3.0/api/sql/index.html
- Spark Built-in Higher Order Functions Examples: https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html
- Spark SQL Timestamp conversion: https://docs.databricks.com/_static/notebooks/timestamp-conversion.html
- RegEx Tutorial: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
- Rank VS Dense Rank: https://stackoverflow.com/questions/44968912/difference-in-dense-rank-and-row-number-in-spark
- SparkSQL Windows: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
- Spark Certification Study Guide: https://github.com/vivek-bombatkar/Databricks-Apache-Spark-2X-Certified-Developer
Resources
PySpark Cheatsheet
Last refresh: Never
%md
# 1. Spark Architecture Components
Candidates are expected to be familiar with the following architectural components and their relationship to each other:
1. Spark Architecture Components
Candidates are expected to be familiar with the following architectural components and their relationship to each other:
Last refresh: Never
%md
### Spark Basic Architecture
A *cluster*, or group of machines, pools the resources of many machines together allowing us to use all the cumulative
resources as if they were one. Now a group of machines sitting somewhere alone is not powerful, you need a framework to coordinate
work across them. **Spark** is a tailor-made engine exactly for this, managing and coordinating the execution of tasks on data across a
cluster of computers. <br>
The cluster of machines that Spark will leverage to execute tasks will be managed by a cluster manager like Spark’s
Standalone cluster manager, _**YARN**_ - **Y**et **A**nother **R**esource **N**egotiator, or [_**Mesos**_](http://mesos.apache.org/). We then submit Spark Applications to these cluster managers which will
grant resources to our application so that we can complete our work. <br>
### Spark Applications
<br>
Spark Applications consist of a **driver** process and a set of **executor** processes. In the illustration we see above, our driver is on the left and four executors on the right. <br>
### What is a JVM?
*The JVM manages system memory and provides a portable execution environment for Java-based applications* <br>
**Technical definition:** The JVM is the specification for a software program that executes code and provides the runtime environment for that code. <br>
**Everyday definition:** The JVM is how we run our Java programs. We configure the JVM's settings and then rely on it to manage program resources during execution. <br>
The **Java Virtual Machine (JVM)** is a program whose purpose is to execute other programs. <br>

The JVM has **two primary functions**:
1. To allow Java programs to run on any device or operating system (known as the _"Write once, run anywhere"_ principle)
2. To manage and optimize program memory <br>
### JVM view of the Spark Cluster: *Drivers, Executors, Slots & Tasks*
The Spark runtime architecture leverages JVMs:<br><br>
<br>
And a slightly more detailed view:<br><br>
<br>
_Elements of a Spark application are in blue boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots are in white boxes._
#### Responsibilities of the client process component
The **client** process starts the **driver** program. For example, the client process can be a `spark-submit` script for running applications, a spark-shell script, or a custom application using Spark API (like this Databricks **GUI** - **G**raphics **U**ser **I**nterface). The client process prepares the classpath and all configuration options for the Spark application. It also passes application arguments, if any, to the application running inside the **driver**.
Spark Basic Architecture
A cluster, or group of machines, pools the resources of many machines together allowing us to use all the cumulative
resources as if they were one. Now a group of machines sitting somewhere alone is not powerful, you need a framework to coordinate
work across them. Spark is a tailor-made engine exactly for this, managing and coordinating the execution of tasks on data across a
cluster of computers.
The cluster of machines that Spark will leverage to execute tasks will be managed by a cluster manager like Spark’s
Standalone cluster manager, YARN - Yet Another Resource Negotiator, or Mesos. We then submit Spark Applications to these cluster managers which will
grant resources to our application so that we can complete our work.
Spark Applications
Spark Applications consist of a driver process and a set of executor processes. In the illustration we see above, our driver is on the left and four executors on the right.
What is a JVM?
The JVM manages system memory and provides a portable execution environment for Java-based applications
Technical definition: The JVM is the specification for a software program that executes code and provides the runtime environment for that code.
Everyday definition: The JVM is how we run our Java programs. We configure the JVM's settings and then rely on it to manage program resources during execution.
The Java Virtual Machine (JVM) is a program whose purpose is to execute other programs.
The JVM has two primary functions:
- To allow Java programs to run on any device or operating system (known as the "Write once, run anywhere" principle)
- To manage and optimize program memory
JVM view of the Spark Cluster: Drivers, Executors, Slots & Tasks
The Spark runtime architecture leverages JVMs:
And a slightly more detailed view:
Elements of a Spark application are in blue boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots are in white boxes.
Responsibilities of the client process component
The client process starts the driver program. For example, the client process can be a spark-submit
script for running applications, a spark-shell script, or a custom application using Spark API (like this Databricks GUI - Graphics User Interface). The client process prepares the classpath and all configuration options for the Spark application. It also passes application arguments, if any, to the application running inside the driver.
Last refresh: Never
%md
The **driver** orchestrates and monitors execution of a Spark application. There’s always **one driver per Spark application**. You can think of the driver as a wrapper around the application.<br>
The **driver** process runs our `main()` function, sits on a node in the cluster, and is responsible for: <br>
1. Maintaining information about the Spark Application
2. Responding to a user’s program or input
3. Requesting memory and CPU resources from cluster managers
4. Breaking application logic into stages and tasks
5. Sending tasks to executors
6. Collecting the results from the executors
The driver process is absolutely essential - it’s the heart of a Spark Application and
maintains all relevant information during the lifetime of the application.
* The **Driver** is the JVM in which our application runs.
* The secret to Spark's awesome performance is parallelism:
* Scaling **vertically** (_i.e. making a single computer more powerful by adding physical hardware_) is limited to a finite amount of RAM, Threads and CPU speeds, due to the nature of motherboards having limited physical slots in Data Centers/Desktops.
* Scaling **horizontally** (_i.e. throwing more identical machines into the Cluster_) means we can simply add new "nodes" to the cluster almost endlessly, because a Data Center can theoretically have an interconnected number of ~infinite machines
* We parallelize at two levels:
* The first level of parallelization is the **Executor** - a JVM running on a node, typically, **one executor instance per node**.
* The second level of parallelization is the **Slot** - the number of which is **determined by the number of cores and CPUs of each node/executor**.
The driver orchestrates and monitors execution of a Spark application. There’s always one driver per Spark application. You can think of the driver as a wrapper around the application.
The driver process runs our main()
function, sits on a node in the cluster, and is responsible for:
- Maintaining information about the Spark Application
- Responding to a user’s program or input
- Requesting memory and CPU resources from cluster managers
- Breaking application logic into stages and tasks
- Sending tasks to executors
- Collecting the results from the executors
The driver process is absolutely essential - it’s the heart of a Spark Application and maintains all relevant information during the lifetime of the application.
- The Driver is the JVM in which our application runs.
- The secret to Spark's awesome performance is parallelism:
- Scaling vertically (i.e. making a single computer more powerful by adding physical hardware) is limited to a finite amount of RAM, Threads and CPU speeds, due to the nature of motherboards having limited physical slots in Data Centers/Desktops.
- Scaling horizontally (i.e. throwing more identical machines into the Cluster) means we can simply add new "nodes" to the cluster almost endlessly, because a Data Center can theoretically have an interconnected number of ~infinite machines
- We parallelize at two levels:
- The first level of parallelization is the Executor - a JVM running on a node, typically, one executor instance per node.
- The second level of parallelization is the Slot - the number of which is determined by the number of cores and CPUs of each node/executor.
Last refresh: Never
%md
The **executors** are responsible for actually executing the work that the **driver** assigns them. This means, each
executor is responsible for only two things:<br>
1. Executing code assigned to it by the driver
2. Reporting the state of the computation, on that executor, back to the driver node
The executors are responsible for actually executing the work that the driver assigns them. This means, each
executor is responsible for only two things:
- Executing code assigned to it by the driver
- Reporting the state of the computation, on that executor, back to the driver node
Last refresh: Never
%md
* Each **Executor** has a number of **Slots** to which parallelized **Tasks** can be assigned to it by the **Driver**.
* So for example:
* If we have **3** identical home desktops (*nodes*) hooked up together in a LAN (like through your home router), each with i7 processors (**8** cores), then that's a **3** node Cluster:
* **1** Driver node
* **2** Executor nodes
* The **8 cores per Executor node** means **8 Slots**, meaning the driver can assign each executor up to **8 Tasks**
* The idea is, an i7 CPU Core is manufactured by Intel such that it is capable of executing it's own Task independent of the other Cores, so **8 Cores = 8 Slots = 8 Tasks in parellel**<br>
_For example: the diagram below is showing 2 Core Executor nodes:_ <br><br>
<br><br>
* The JVM is naturally multithreaded, but a single JVM, such as our **Driver**, has a finite upper limit.
* By creating **Tasks**, the **Driver** can assign units of work to **Slots** on each **Executor** for parallel execution.
* Additionally, the **Driver** must also decide how to partition the data so that it can be distributed for parallel processing (see below).
* Consequently, the **Driver** is assigning a **Partition** of data to each task - in this way each **Task** knows which piece of data it is to process.
* Once started, each **Task** will fetch from the original data source (e.g. An Azure Storage Account) the **Partition** of data assigned to it.
#### Note relating to Tasks, Slots and Cores
You can set the number of task slots to a value two or three times (**i.e. to a multiple of**) the number of CPU cores. Although these task slots are often referred to as CPU cores in Spark, they’re implemented as **threads** that work on a **physical core's thread** and don’t need to correspond to the number of physical CPU cores on the machine (since different CPU manufacturer's can architect multi-threaded chips differently). <br>
In other words:
* All processors of today have multiple cores (e.g. 1 CPU = 8 Cores)
* Most processors of today are multi-threaded (e.g. 1 Core = 2 Threads, 8 cores = 16 Threads)
* A Spark **Task** runs on a **Slot**. **1 Thread** is capable of doing **1 Task** at a time. To make use of all our threads on the CPU, we cleverly assign the **number of Slots** to correspond to a **multiple of the number of Cores** (which translates to multiple Threads).
* _By doing this_, after the Driver breaks down a given command (`DO STUFF FROM massive_table`) into **Tasks** and **Partitions**, which are tailor-made to fit our particular Cluster Configuration (say _4 nodes - 1 driver and 3 executors, 8 cores per node, 2 threads per core_). By using our Clusters at maximum efficiency like this (utilizing all available threads), we can get our massive command executed as fast as possible (given our Cluster in this case, _3\*8\*2 Threads --> **48** Tasks, **48** Partitions_ - i.e. **1** Partition per Task)
* _Say we don't do this_, even with a 100 executor cluster, the entire burden would go to 1 executor, and the other 99 will be sitting idle - i.e. slow execution.
* _Or say, we instead foolishly assign **49** Tasks and **49** Partitions_, the first pass would execute **48** Tasks in parallel across the executors cores (say in **10 minutes**), then that **1** remaining Task in the next pass will execute on **1** core for another **10 minutes**, while the rest of our **47** cores are sitting idle - meaning the whole job will take double the time at **20 minutes**. This is obviously an inefficient use of our available resources, and could rather be fixed by setting the number of tasks/partitions to a multiple of the number of cores we have (in this setup - 48, 96 etc).
- Each Executor has a number of Slots to which parallelized Tasks can be assigned to it by the Driver.
- So for example:
- If we have 3 identical home desktops (nodes) hooked up together in a LAN (like through your home router), each with i7 processors (8 cores), then that's a 3 node Cluster:
- 1 Driver node
- 2 Executor nodes
- The 8 cores per Executor node means 8 Slots, meaning the driver can assign each executor up to 8 Tasks
- The idea is, an i7 CPU Core is manufactured by Intel such that it is capable of executing it's own Task independent of the other Cores, so 8 Cores = 8 Slots = 8 Tasks in parellel
- The idea is, an i7 CPU Core is manufactured by Intel such that it is capable of executing it's own Task independent of the other Cores, so 8 Cores = 8 Slots = 8 Tasks in parellel
- If we have 3 identical home desktops (nodes) hooked up together in a LAN (like through your home router), each with i7 processors (8 cores), then that's a 3 node Cluster:
- So for example:
For example: the diagram below is showing 2 Core Executor nodes:
- The JVM is naturally multithreaded, but a single JVM, such as our Driver, has a finite upper limit.
- By creating Tasks, the Driver can assign units of work to Slots on each Executor for parallel execution.
- Additionally, the Driver must also decide how to partition the data so that it can be distributed for parallel processing (see below).
- Consequently, the Driver is assigning a Partition of data to each task - in this way each Task knows which piece of data it is to process.
- Once started, each Task will fetch from the original data source (e.g. An Azure Storage Account) the Partition of data assigned to it.
Note relating to Tasks, Slots and Cores
You can set the number of task slots to a value two or three times (i.e. to a multiple of) the number of CPU cores. Although these task slots are often referred to as CPU cores in Spark, they’re implemented as threads that work on a physical core's thread and don’t need to correspond to the number of physical CPU cores on the machine (since different CPU manufacturer's can architect multi-threaded chips differently).
In other words:
- All processors of today have multiple cores (e.g. 1 CPU = 8 Cores)
- Most processors of today are multi-threaded (e.g. 1 Core = 2 Threads, 8 cores = 16 Threads)
- A Spark Task runs on a Slot. 1 Thread is capable of doing 1 Task at a time. To make use of all our threads on the CPU, we cleverly assign the number of Slots to correspond to a multiple of the number of Cores (which translates to multiple Threads).
- By doing this, after the Driver breaks down a given command (
DO STUFF FROM massive_table
) into Tasks and Partitions, which are tailor-made to fit our particular Cluster Configuration (say 4 nodes - 1 driver and 3 executors, 8 cores per node, 2 threads per core). By using our Clusters at maximum efficiency like this (utilizing all available threads), we can get our massive command executed as fast as possible (given our Cluster in this case, 3*8*2 Threads --> 48 Tasks, 48 Partitions - i.e. 1 Partition per Task) - Say we don't do this, even with a 100 executor cluster, the entire burden would go to 1 executor, and the other 99 will be sitting idle - i.e. slow execution.
- Or say, we instead foolishly assign 49 Tasks and 49 Partitions, the first pass would execute 48 Tasks in parallel across the executors cores (say in 10 minutes), then that 1 remaining Task in the next pass will execute on 1 core for another 10 minutes, while the rest of our 47 cores are sitting idle - meaning the whole job will take double the time at 20 minutes. This is obviously an inefficient use of our available resources, and could rather be fixed by setting the number of tasks/partitions to a multiple of the number of cores we have (in this setup - 48, 96 etc).
- By doing this, after the Driver breaks down a given command (
Last refresh: Never
Spark Developer Certification - Comprehensive Study Guide (python)
What is this?
While studying for the Spark certification exam and going through various resources available online, I thought it'd be worthwhile to put together a comprehensive knowledge dump that covers the entire syllabus end-to-end, serving as a Study Guide for myself and hopefully others.
Note that I used inspiration from the Spark Code Review guide - but whereas that covers a subset of the coding aspects only, I aimed for this to be more of a comprehensive, one stop resource geared towards passing the exam.
Last refresh: Never