Enabling and using Pyspark with Jupyter and Anaconda

I remember it took me sometime to get this configured when I first started trying Jupyter and Spark out. Hopefully this is helpful for others. This works for Hadoop 2.6.0-CDH5.9.1, Spark 1.6.0 using python2.7 and Python3. For other versions, you need to adjust the path accordingly. Basically, you just need to tell spark 4 things:

  • The location of your (Ana)conda installation
  • The location of your Jupyter installation and its configuration
  • The location of your Python installation
  • Resources your Spark executors need

Type the following from your bash terminal (If you are using Cloudera, this would be your Edge node. If not, this is a server location where you can run jupyter notebook and pyspark command ).

#Specify your (Ana)conda Path installation if you haven't already done so.
#Change the (Ana)conda path accordingly if you are using python3. 
#Usually it's Anaconda3-X.X.X/bin
export PATH=/opt/cloudera/parcels/Anaconda2-4.1.2/bin/:$PATH
 
#Assuming you already have installed Jupyter in your (Ana)conda installation, 
#specify the path to PySpark Driver.
#Change the (Ana)conda path accordingly.
#If you are using python3. Usually it's Anaconda3-X.X.X/bin
export PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda2-4.1.2/bin/jupyter
 
#Specify the IP and port configuration for your Jupyter notebook. 
#The IP address should be the IP of a server where you can run "jupyter notebook" command successfully
#Make sure the port number you are using isn't a privilege ports (port number less than 1024),
#or ports that are currently being used. 
#See below for command to that list currently used port numbers 
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='x.x.x.x' --NotebookApp.port=8667"
 
#Specify your python configuration
#Change the (Ana)conda path accordingly.
#If you are using python3. Usually it's Anaconda3-X.X.X/bin
export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda2-4.1.2/bin/python
 
#Specify your Spark master URL and Executor configuration
#Adjust the number of executors, memory, and default parallelism to your need based on the availability of your cluster
MASTER=spark://master:8088 pyspark --num-executors 2 --executor-memory=6G --conf spark.default.parallelism=15

It’s probably easiest to just create an executable file for the above commands and just run the executable if you regularly need to run Jupyter with Spark. To do so, paste the above command into a text file(e.g. runjupyspar.sh), and make it executable

chmod +x runjupyspark.sh

Once you execute the command above (or the executable), you should see Jupyter notebook server telling you how to access the notebook server. The message from the notebook server should be something similar to:
The Jupyter Notebook is running at: http://x.x.x.x:xxxx/
Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Launch your web browser, and access the URL. Try creating a new notebook

Creating new Jupyter Notebook using PySpark
Creating new Jupyter Notebook using PySpark

try executing this command in a cell:

print sc.version

It should tell you the version of spark you are running.

Running Spark using Jupyter Notebook
Running Spark using Jupyter Notebook

If you are having trouble choosing an open port, use the following bash script to list all the ports that are currently in use. You should NOT be using these ports.

netstat -tlp | grep LISTEN | awk '{print $4}' |grep -o ":[[:digit:]]\{1,\}" | tr -d ":" | sort -n

Leave a Reply

Your email address will not be published. Required fields are marked *

*