I remember it took me sometime to get this configured when I first started trying Jupyter and Spark out. Hopefully this is helpful for others. This works for Hadoop 2.6.0-CDH5.9.1, Spark 1.6.0 using python2.7 and Python3. For other versions, you need to adjust the path accordingly. Basically, you just need to tell spark 4 things:
- The location of your (Ana)conda installation
- The location of your Jupyter installation and its configuration
- The location of your Python installation
- Resources your Spark executors need
Type the following from your bash terminal (If you are using Cloudera, this would be your Edge node. If not, this is a server location where you can run jupyter notebook and pyspark command ).
#Specify your (Ana)conda Path installation if you haven't already done so. #Change the (Ana)conda path accordingly if you are using python3. #Usually it's Anaconda3-X.X.X/bin export PATH=/opt/cloudera/parcels/Anaconda2-4.1.2/bin/:$PATH #Assuming you already have installed Jupyter in your (Ana)conda installation, #specify the path to PySpark Driver. #Change the (Ana)conda path accordingly. #If you are using python3. Usually it's Anaconda3-X.X.X/bin export PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda2-4.1.2/bin/jupyter #Specify the IP and port configuration for your Jupyter notebook. #The IP address should be the IP of a server where you can run "jupyter notebook" command successfully #Make sure the port number you are using isn't a privilege ports (port number less than 1024), #or ports that are currently being used. #See below for command to that list currently used port numbers export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='x.x.x.x' --NotebookApp.port=8667" #Specify your python configuration #Change the (Ana)conda path accordingly. #If you are using python3. Usually it's Anaconda3-X.X.X/bin export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda2-4.1.2/bin/python #Specify your Spark master URL and Executor configuration #Adjust the number of executors, memory, and default parallelism to your need based on the availability of your cluster MASTER=spark://master:8088 pyspark --num-executors 2 --executor-memory=6G --conf spark.default.parallelism=15 |
It’s probably easiest to just create an executable file for the above commands and just run the executable if you regularly need to run Jupyter with Spark. To do so, paste the above command into a text file(e.g. runjupyspar.sh), and make it executable
chmod +x runjupyspark.sh |
Once you execute the command above (or the executable), you should see Jupyter notebook server telling you how to access the notebook server. The message from the notebook server should be something similar to:
The Jupyter Notebook is running at: http://x.x.x.x:xxxx/
Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Launch your web browser, and access the URL. Try creating a new notebook
try executing this command in a cell:
print sc.version |
It should tell you the version of spark you are running.
If you are having trouble choosing an open port, use the following bash script to list all the ports that are currently in use. You should NOT be using these ports.
netstat -tlp | grep LISTEN | awk '{print $4}' |grep -o ":[[:digit:]]\{1,\}" | tr -d ":" | sort -n |