Get Started with PySpark and Jupyter Notebook

2017-08-15

Apache Spark is a must for Big data’s lovers. In a few words, Spark is a fast and powerful framework that provides an API to perform massive distributed processing over resilient sets of data.

Jupyter Notebook is a popular application that enables you to edit, run and share Python code into a web view. It allows you to modify and re-execute parts of your code in a very flexible way. That’s why Jupyter is a great tool to test and prototype programs.

Install Jupyter Notebook

Please refer to my previous blog “Remote Access to Ipython Notebooks via ssh”.

Install pySpark

Refer to Get Started with PySpark and Jupyter Notebook in 3 Minutes

Before installing pySpark, make sure you have Java 8 or higher installed on your computer. Of course, you will also need Python.

First of all, visit the Spark downloads page. Select the latest Spark release, a prebuilt package for Hadoop, and download it directly. Unzip it and move it to your /opt folder:

$ tar -xzf spark-1.2.0-bin-hadoop2.4.tgz
$ mv spark-1.2.0-bin-hadoop2.4 /opt/spark-1.2.0

Create a symbolic link:

$ ln -s /opt/spark-1.2.0 /opt/spark̀

This way, you will be able to download and use multiple spark versions.

Finally, tell your bash (or zsh, etc.) where to find spark. To do so, configure your $PATH variables by adding the following lines in your ~/.bashrc (or ~/.zshrc) file:

export PYTHON_HOME=/opt/python2.7
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PYTHON_HOME/bin:$PATH

PySpark in Jupyter

Configure PySpark driver to use Jupyter Notebook: running pyspark will automatically open a Jupyter Notebook.

Update PySpark driver environment variables: add these lines to your ~/.bashrc (or ~/.zshrc) file.

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --allow-root --no-browser --port=8889"

Restart your terminal and launch PySpark again:

$ pyspark local[20]

Now, this command should start a Jupyter Notebook in your web browser.

Remote Access

On the local machine, start an SSH tunnel to access Jupyter:

local_user@local_host$ ssh -N -f -L localhost:8888:localhost:8889 remote_user@remote_host

On the local machine, start an SSH tunnel to access Spark Jobs:

local_user@local_host$ ssh -N -f -L localhost:4040:localhost:4040 remote_user@remote_host

Spark API

sc
SparkContext
Spark UI
Version
v2.2.0
Master
local[20]
AppName
PySparkShell

sqlContext
sql
spark
sqlCtx

Andrew Peng

Xueping Peng