Getting Started with PySpark for Big Data Analytics using Jupyter Notebooks and Jupyter Docker Stacks

An updated version of this popular post is published in Towards Data Science: Getting Started with Data Analytics using Jupyter Notebooks, PySpark, and Docker

Introduction

Image for post
Image for post
Search results courtesy GoogleTrends (https://trends.google.com)

Learning popular programming paradigms, such as Python, Scala, R, Apache Hadoop, Apache Spark, and Apache Kafka, requires the use of multiple complex technologies. Installing, configuring, and managing these technologies often demands an advanced level of familiarity with Linux, distributed systems, cloud- and container-based platforms, databases, and data-streaming applications. These barriers may prove a deterrent to Students, Mathematicians, Statisticians, and Data Scientists.

Image for post
Image for post
Search results courtesy GoogleTrends (https://trends.google.com)

Driven by the explosive growth of these technologies and the need to train individuals, many commercial enterprises are lowering the barriers to entry, making it easier to get started. The three major cloud providers, AWS, Azure, and Google Cloud, all have multiple Big Data-, AI- and ML-as-a-Service offerings.

Similarly, many open-source projects are also lowering the barriers to entry into these technologies. An excellent example of an open-source project working on this challenge is Project Jupyter. Similar to the Spark Notebook and Apache Zeppelin projects, Jupyter Notebooks enables data-driven, interactive, and collaborative data analytics with Julia, Scala, Python, R, and SQL.

This post will demonstrate the creation of a containerized development environment, using Jupyter Docker Stacks. The environment will be suited for learning and developing applications for Apache Spark, using the Python, Scala, and R programming languages. This post is not intended to be a tutorial on Spark, PySpark, or Jupyter Notebooks.

Featured Technologies

Image for post
Image for post

Jupyter Notebooks

Image for post
Image for post
Search results courtesy GoogleTrends (https://trends.google.com)

Jupyter Docker Stacks

Apache Spark

Spark’s polyglot programming model allows users to write applications quickly in Scala, Java, Python, R, and SQL. Spark includes libraries for Spark SQL (DataFrames and Datasets), MLlib (Machine Learning), GraphX (Graph Processing), and DStreams (Spark Streaming). You can run Spark using its standalone cluster mode, on Amazon EC2, Apache Hadoop YARN, Mesos, or Kubernetes.

PySpark

Docker

Image for post
Image for post

Docker Swarm

PostgreSQL

Demonstration

Architecture

Image for post
Image for post

Source Code

git clone \
--branch master --single-branch --depth 1 --no-tags \
https://github.com/garystafford/pyspark-setup-demo.git

Source code samples are displayed as GitHub Gists, which may not display correctly on some mobile and social media browsers.

Deploy Docker Stack

The Jupyter container’s working directory is set on line 10 of the stack.yml file, working_dir:/home/$USER/work. The local bind-mounted working directory is $PWD/work. This path is bind-mounted to the working directory in the Jupyter container, on line 24 of the stack.yml file, $PWD/work:/home/$USER/work. The PWD environment variable assumes you are working on Linux or macOS (CD on Windows).

By default, the user within the Jupyter container is jovyan. Optionally, I have chosen to override that user with my own local host’s user account, as shown on line 16 of the stack.yml file, NB_USER: $USER. I have used the macOS host’s USER environment variable value (equivalent to USERNAME on Windows). There are many options for configuring the Jupyter container, detailed here. Several of those options are shown on lines 12-18 of the stack.yml file (gist).

Assuming you have a recent version of Docker installed on your local development machine, and running in swarm mode, standing up the stack is as easy as running the following command from the root directory of the project:

docker stack deploy -c stack.yml pyspark

The Docker stack consists of a new overlay network, pyspark-net, and three containers. To confirm the stack deployed, you can run the following command:

docker stack ps pyspark --no-trunc
Image for post
Image for post

Note the jupyter/all-spark-notebook container is quite large. Depending on your Internet connection, if this is the first time you have pulled this Docker image, the stack may take several minutes to enter a running state.

To access the Jupyter Notebook application, you need to obtain the Jupyter URL and access token (read more here). This information is output in the Jupyter container log, which can be accessed with the following command:

docker logs $(docker ps | grep pyspark_pyspark | awk '{print $NF}')
Image for post
Image for post

Using the URL and token shown in the log output, you will be able to access the Jupyter web-based user interface on localhost port 8888. Once there, from the Jupyter dashboard landing page, you should see all the files in the project’s work/ directory.

Also shown below, note the types of files you are able to create from the dashboard, including Python 3, R, Scala (using Toree or spylon-kernal), and text. You can also open a Jupyter Terminal or create a new Folder.

Image for post
Image for post

Running Python Scripts

Run the script from within the Jupyter container, from a Jupyter Terminal window:

python ./01_simple_script.py

You should observe the following output.

Image for post
Image for post

Kaggle Datasets

For this demonstration, I chose the ‘Transactions from a Bakery’ dataset from Kaggle.

Image for post
Image for post

The dataset contains 21,294 rows, each with four columns of data. Although certainly nowhere near ‘big data’, the dataset is large enough to test out the Jupyter container functionality (gist).

Submitting Spark Jobs

Run the script directly from a Jupyter Terminal window:

python ./02_bakery_dataframes.py

An example of the output of the Spark job is shown below. At the time of this post, the latest jupyter/all-spark-notebook Docker Image runs Spark 2.4.3, Scala 2.11.12, and Java 1.8.0_191 using the OpenJDK.

Image for post
Image for post

More typically, you would submit the Spark job, using the spark-submit command. Use a Jupyter Terminal window to run the following command:

$SPARK_HOME/bin/spark-submit 02_bakery_dataframes.py

Below, we see the beginning of the output from Spark, using the spark-submit command.

Image for post
Image for post

Below, we see the scheduled tasks executing and the output of the print statement, displaying the top 10 rows of bakery data.

Image for post
Image for post

Interacting with Databases

To demonstrate the flexibility of the Jupyter Docker Stacks to work with databases, I have added PostgreSQL to the Docker Stack. We can read and write data from the Jupyter container to the PostgreSQL instance, running in a separate container.

To begin, we will run a SQL script, written in Python, to create our database schema and some test data in a new database table. To do so, we will need to install the psycopg2 package into our Jupyter container. You can use the docker exec command from your terminal. Alternatively, as a superuser, your user has administrative access to install Python packages within the Jupyter container using the Jupyter Terminal window. Both pip and conda are available to install packages, see details here.

Run the following command to install psycopg2:

# using pip
docker exec -it \
$(docker ps | grep pyspark_pyspark | awk '{print $NF}') \
pip install psycopg2-binary

These packages give Python the ability to interact with PostgreSQL. The included Python script, 03_load_sql.py, will execute a set of SQL statements, contained in a SQL file, bakery_sample.sql, against the PostgreSQL container instance.

To execute the script, run the following command:

python ./03_load_sql.py

This should result in the following output, if successful.

Image for post
Image for post

To confirm the SQL script’s success, I have included Adminer. Adminer (formerly phpMinAdmin) is a full-featured database management tool written in PHP. Adminer natively recognizes PostgreSQL, MySQL, SQLite, and MongoDB, among other database engines.

Adminer should be available on localhost port 8080. The password credentials, shown below, are available in the stack.yml file. The server name, postgres, is the name of the PostgreSQL container. This is the domain name the Jupyter container will use to communicate with the PostgreSQL container.

Image for post
Image for post

Connecting to the demo database with Adminer, we should see the bakery_basket table. The table should contain three rows of data, as shown below.

Image for post
Image for post

Developing Jupyter NoteBooks

To see the power of Jupyter Notebooks, I have written a basic notebook document, 04_pyspark_demo_notebook.ipynb. The document performs some typical PySpark functions, such as loading data from a CSV file and from the PostgreSQL database, performing some basic data analytics with Spark SQL, graphing the data using BokehJS, and finally, saving data back to the database, as well as to the popular Apache Parquet file format. Below we see the notebook document, using the Jupyter Notebook user interface.

Image for post
Image for post

PostgreSQL Driver

PyCharm

Image for post
Image for post

As mentioned earlier, a key feature of Jupyter Notebooks is their ability to save the output from each Cell as part of the notebook document. Below, we see the notebook document on GitHub. The output is saved, as part of the notebook document. Not only can you distribute the notebook document, but you can also preserve and share the output from each cell.

Image for post
Image for post

Using Additional Packages

Shown below, we use Plotly to construct a bar chart of daily bakery items sold for the year 2017 based on the Kaggle dataset. The chart uses SciPy and NumPy to construct a linear fit (regression) and plot a line of best fit to the bakery data. The chart also uses SciPy’s Savitzky-Golay Filter to plot the second line, illustrating a smoothing of our bakery data.

Image for post
Image for post

Plotly also provides Chart Studio Online Chart Maker. Plotly describes Chart Studio as the world’s most sophisticated editor for creating D3.js and WebGL charts. Shown below, we have the ability to enhance, stylize, and share our bakery data visualization using the free version of Chart Studio Cloud.

Image for post
Image for post

nbviewer

Image for post
Image for post

Monitoring Spark Jobs

Image for post
Image for post

We can review details of each stage of the Spark job, including a visualization of the DAG, which Spark constructs as part of the job execution plan, using the DAG Scheduler.

Image for post
Image for post

We can also review the timing of each event, occurring as part of the stages of the Spark job.

Image for post
Image for post

We can also use the Spark interface to review and confirm the runtime environment, including versions of Java, Scala, and Spark, as well as packages available on the Java classpath.

Image for post
Image for post

Spark Performance

Image for post
Image for post

We can use the docker stats command to examine the container’s CPU and memory metrics:

docker stats \
--format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

Below, we see the stats from the stack’s three containers immediately after being deployed, showing little or no activity. Here, Docker has been allocated 2 CPUs, 3GB of RAM, and 2 GB of swap space available, from the host machine.

Image for post
Image for post

Compare the stats above with the same three containers, while the example notebook document is running on Spark. The CPU shows a spike, but memory usage appears to be within acceptable ranges.

Image for post
Image for post

Linux top and htop

docker exec -it \
$(docker ps | grep _pyspark | awk '{print $NF}') \
top -o %CPU

With top, we can observe the individual performance of each process running in the Jupyter container.

Image for post
Image for post

Lastly, htop, an interactive process viewer for Unix, can be installed into the container and ran with the following set of bash commands, from a Jupyter Terminal window or using docker exec:

docker exec -it \
$(docker ps | grep _pyspark | awk '{print $NF}') \
sh -c "apt-get update && apt-get install htop && htop --sort-key PERCENT_CPU"

With htop, we can observe individual CPU activity. The two CPUs at the top left of the htopwindow are the two CPUs assigned to Docker. We get insight into the way Docker is using each CPU, as well as other basic performance metrics, like memory and swap.

Image for post
Image for post

Assuming your development machine host has them available, it is easy to allocate more compute resources to Docker if required. However, in my opinion, this stack is optimized for development and learning, using reasonably sized datasets for data analysis and ML. It should not be necessary to allocate excessive resources to Docker, possibly starving your host machine’s own compute capabilities.

Image for post
Image for post

Conclusion

Originally published at programmaticponderings.com on November 20, 2018. The post and project code was updated on September 28, 2019.

All opinions expressed in this post are my own and not necessarily the views of my current or past employers, their clients.

AWS Solutions Architect | AWS Certified Pro | Polyglot Developer | Data Analytics | DataOps | DevOps

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store