{"id":548,"date":"2018-02-09T15:18:17","date_gmt":"2018-02-09T20:18:17","guid":{"rendered":"http:\/\/digitallibraryworld.com\/?p=548"},"modified":"2018-03-07T11:15:46","modified_gmt":"2018-03-07T16:15:46","slug":"enabling-and-using-pyspark-with-jupyter-and-anaconda","status":"publish","type":"post","link":"https:\/\/heisbudi.com\/?p=548","title":{"rendered":"Enabling and using Pyspark with Jupyter and Anaconda"},"content":{"rendered":"<p>I remember it took me sometime to get this configured when I first started trying Jupyter and Spark out. Hopefully this is helpful for others. This works for Hadoop 2.6.0-CDH5.9.1, Spark 1.6.0 using python2.7 and Python3. For other versions, you need to adjust the path accordingly. Basically, you just need to tell spark 4 things:<\/p>\n<ul>\n<li>The location of your (Ana)conda installation<\/li>\n<li>The location of your Jupyter installation and its configuration<\/li>\n<li>The location of your Python installation<\/li>\n<li>Resources your Spark executors need<\/li>\n<\/ul>\n<p>Type the following from your bash terminal (If you are using Cloudera, this would be your Edge node. If not, this is a server location where you can run <em>jupyter notebook<\/em> and <em>pyspark<\/em> command ).<\/p>\n<pre lang=\"bash\">#Specify your (Ana)conda Path installation if you haven't already done so.\r\n#Change the (Ana)conda path accordingly if you are using python3. \r\n#Usually it's Anaconda3-X.X.X\/bin\r\nexport PATH=\/opt\/cloudera\/parcels\/Anaconda2-4.1.2\/bin\/:$PATH\r\n\r\n#Assuming you already have installed Jupyter in your (Ana)conda installation, \r\n#specify the path to PySpark Driver.\r\n#Change the (Ana)conda path accordingly.\r\n#If you are using python3. Usually it's Anaconda3-X.X.X\/bin\r\nexport PYSPARK_DRIVER_PYTHON=\/opt\/cloudera\/parcels\/Anaconda2-4.1.2\/bin\/jupyter\r\n\r\n#Specify the IP and port configuration for your Jupyter notebook. \r\n#The IP address should be the IP of a server where you can run \"jupyter notebook\" command successfully\r\n#Make sure the port number you are using isn't a privilege ports (port number less than 1024),\r\n#or ports that are currently being used. \r\n#See below for command to that list currently used port numbers \r\nexport PYSPARK_DRIVER_PYTHON_OPTS=\"notebook --NotebookApp.open_browser=False --NotebookApp.ip='x.x.x.x' --NotebookApp.port=8667\"\r\n\r\n#Specify your python configuration\r\n#Change the (Ana)conda path accordingly.\r\n#If you are using python3. Usually it's Anaconda3-X.X.X\/bin\r\nexport PYSPARK_PYTHON=\/opt\/cloudera\/parcels\/Anaconda2-4.1.2\/bin\/python\r\n\r\n#Specify your Spark master URL and Executor configuration\r\n#Adjust the number of executors, memory, and default parallelism to your need based on the availability of your cluster\r\nMASTER=spark:\/\/master:8088 pyspark --num-executors 2 --executor-memory=6G --conf spark.default.parallelism=15<\/pre>\n<p>It&#8217;s probably easiest to just create an executable file for the above commands and just run the executable if you regularly need to run Jupyter with Spark. To do so, paste the above command into a text file(e.g. runjupyspar.sh), and make it executable<\/p>\n<pre lang=\"bash\">chmod +x runjupyspark.sh<\/pre>\n<p>Once you execute the command above (or the executable), you should see Jupyter notebook server telling you how to access the notebook server. The message from the notebook server should be something similar to:<br \/>\n<strong>The Jupyter Notebook is running at: http:\/\/x.x.x.x:xxxx\/<\/strong><br \/>\n<strong>Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).<\/strong><\/p>\n<p>Launch your web browser, and access the URL. Try creating a new notebook<\/p>\n<figure id=\"attachment_555\" aria-describedby=\"caption-attachment-555\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/digitallibraryworld.com\/wp-content\/uploads\/2018\/02\/spark_jupyter_notebook.png\"><img loading=\"lazy\" class=\"size-medium wp-image-555\" alt=\"Creating new Jupyter Notebook using PySpark\" src=\"http:\/\/digitallibraryworld.com\/wp-content\/uploads\/2018\/02\/spark_jupyter_notebook-300x132.png\" width=\"300\" height=\"132\" srcset=\"https:\/\/heisbudi.com\/wp-content\/uploads\/2018\/02\/spark_jupyter_notebook-300x132.png 300w, https:\/\/heisbudi.com\/wp-content\/uploads\/2018\/02\/spark_jupyter_notebook.png 359w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-555\" class=\"wp-caption-text\">Creating new Jupyter Notebook using PySpark<\/figcaption><\/figure>\n<p>try executing this command in a cell:<\/p>\n<pre lang=\"python\">print sc.version<\/pre>\n<p>It should tell you the version of spark you are running.<\/p>\n<figure id=\"attachment_557\" aria-describedby=\"caption-attachment-557\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/digitallibraryworld.com\/wp-content\/uploads\/2018\/02\/running-spark.png\"><img loading=\"lazy\" class=\"size-medium wp-image-557\" alt=\"Running Spark using Jupyter Notebook\" src=\"http:\/\/digitallibraryworld.com\/wp-content\/uploads\/2018\/02\/running-spark-300x64.png\" width=\"300\" height=\"64\" srcset=\"https:\/\/heisbudi.com\/wp-content\/uploads\/2018\/02\/running-spark-300x64.png 300w, https:\/\/heisbudi.com\/wp-content\/uploads\/2018\/02\/running-spark-1024x219.png 1024w, https:\/\/heisbudi.com\/wp-content\/uploads\/2018\/02\/running-spark-570x121.png 570w, https:\/\/heisbudi.com\/wp-content\/uploads\/2018\/02\/running-spark-770x164.png 770w, https:\/\/heisbudi.com\/wp-content\/uploads\/2018\/02\/running-spark-940x201.png 940w, https:\/\/heisbudi.com\/wp-content\/uploads\/2018\/02\/running-spark.png 1047w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-557\" class=\"wp-caption-text\">Running Spark using Jupyter Notebook<\/figcaption><\/figure>\n<p>If you are having trouble choosing an open port, use the following bash script to list all the ports that are currently in use. You should NOT be using these ports.<\/p>\n<pre lang=\"bash\">netstat -tlp | grep LISTEN | awk '{print $4}' |grep -o \":[[:digit:]]\\{1,\\}\" | tr -d \":\" | sort -n<\/pre>\n<div data-counters='1' data-style='square' data-size='regular' data-url='https:\/\/heisbudi.com\/?p=548' data-title='Enabling and using Pyspark with Jupyter and Anaconda' class='linksalpha_container linksalpha_app_3'><a href='\/\/www.linksalpha.com\/share?network='facebook' class='linksalpha_icon_facebook'><\/a><a href='\/\/www.linksalpha.com\/share?network='twitter' class='linksalpha_icon_twitter'><\/a><a href='\/\/www.linksalpha.com\/share?network='googleplus' class='linksalpha_icon_googleplus'><\/a><a href='\/\/www.linksalpha.com\/share?network='mail' class='linksalpha_icon_mail'><\/a><\/div><div data-position='' data-url='https:\/\/heisbudi.com\/?p=548' data-title='Enabling and using Pyspark with Jupyter and Anaconda' class='linksalpha_container linksalpha_app_7'><a href='\/\/www.linksalpha.com\/share?network='facebook' class='linksalpha_icon_facebook'><\/a><a href='\/\/www.linksalpha.com\/share?network='twitter' class='linksalpha_icon_twitter'><\/a><a href='\/\/www.linksalpha.com\/share?network='googleplus' class='linksalpha_icon_googleplus'><\/a><a href='\/\/www.linksalpha.com\/share?network='mail' class='linksalpha_icon_mail'><\/a><\/div>","protected":false},"excerpt":{"rendered":"<p>I remember it took me sometime to get this configured when I first started trying Jupyter and Spark out. Hopefully this is helpful for others. This works for Hadoop 2.6.0-CDH5.9.1, Spark 1.6.0 using python2.7 and Python3. For other versions, you need to adjust the path accordingly. Basically, you just need to tell spark 4 things: <a class=\"read-more\" href=\"https:\/\/heisbudi.com\/?p=548\">[&hellip;]<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[34,26,33],"tags":[37,36,35],"_links":{"self":[{"href":"https:\/\/heisbudi.com\/index.php?rest_route=\/wp\/v2\/posts\/548"}],"collection":[{"href":"https:\/\/heisbudi.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/heisbudi.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/heisbudi.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/heisbudi.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=548"}],"version-history":[{"count":9,"href":"https:\/\/heisbudi.com\/index.php?rest_route=\/wp\/v2\/posts\/548\/revisions"}],"predecessor-version":[{"id":571,"href":"https:\/\/heisbudi.com\/index.php?rest_route=\/wp\/v2\/posts\/548\/revisions\/571"}],"wp:attachment":[{"href":"https:\/\/heisbudi.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=548"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/heisbudi.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=548"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/heisbudi.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=548"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}