Getting Started with PySpark for Big Data Analytics using Jupyter Notebooks and Jupyter Docker Stacks

Gary A. Stafford
16 min readNov 22, 2018

An updated version of this popular post is published in Towards Data Science: Getting Started with Data Analytics using Jupyter Notebooks, PySpark, and Docker

Introduction

There is little question, big data analytics, data science, artificial intelligence (AI), and machine learning (ML), a subcategory of AI, have all experienced a tremendous surge in popularity over the last few years. Behind the hype curves and marketing buzz, these technologies are having a significant influence on many aspects of our modern lives. Due to their popularity and potential benefits, academic institutions and commercial enterprises are rushing to train large numbers of Data Scientists and ML and AI Engineers.

Search results courtesy GoogleTrends (https://trends.google.com)

Learning popular programming paradigms, such as Python, Scala, R, Apache Hadoop, Apache Spark, and Apache Kafka, requires the use of multiple complex technologies. Installing, configuring, and managing these technologies often demands an advanced level of familiarity with Linux, distributed systems, cloud- and container-based platforms, databases, and data-streaming applications. These barriers may prove a deterrent to Students, Mathematicians, Statisticians, and Data Scientists.

--

--

Gary A. Stafford
Gary A. Stafford

Written by Gary A. Stafford

Area Principal Solutions Architect @ AWS | 10x AWS Certified Pro | Polyglot Developer | DataOps | GenAI | Technology consultant, writer, and speaker

Responses (9)