Installing Stand alone Zeppelin for EMR
This is a supplement to the
AWS blog of the same subject. In our scenario, we will have a standalone server running Zeppelin. We also have an EMR cluster. These are all running on a non-internet accessible VPC. These are the steps we took to make this happen. The necessary files were obtained on a internet connected machine and introduced via S3 bucket.
- Launch a new EC2 Linux (RedHat) instance, we'll call this Zeppelin instance.
- Attach a security group that can talk to itself, call it zeppelin.
- Install AWS CLI. Use this instructions. Here's a summary:
curl -O https://bootstrap.pypa.io/get-pip.py
python get-pip.py --user
export PATH=~/.local/bin:$PATH
source ~/.bash_profile
pip install awscli --upgrade --user
Install JDK 8 Developer:
yum install java-1.8.0-openjdk-devel.x86_64
- This will appear at /etc/alternatives/java_sdk_openjdk
- Create a new directory: /home/ec2-user/zeppelin-notebook
- Download Zeppelin from Apache: http://apache.mirrors.tds.net/zeppelin/zeppelin-0.8.0/zeppelin-0.8.0-bin-all.tgz
- Extract Content
tar -zxvf zeppelin-0.8.0-bin-all.tgz
- To make it simpler later on we're going to move this new directory to /home/ec2-user/zeppelin
mv zeppelin-0.8.0-bin-all /home/ec2-user/zeppelin
- Download Spark from Apache
http://www.gtlib.gatech.edu/pub/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
- Extract Content and move it to this directory
tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz
mv spark-2.3.1-bin-hadoop2.7.tgz /home/ec2-user/prereqs/spark
- Go to EMR and launch a new cluster
- Go to Advanced Option
- Select Hadoop, Hive, and Spark
- Add 2 Custom JAR type Steps
Name: Hadoopconf
JAR location: command-runner.jar
Arguments: aws s3 cp /etc/hadoop/conf/ s3://<YOUR_S3_BUCKET>/hadoopconf --recursive
Action on Failure: Continue
Name: hive-site
JAR location: command-runner.jar
Arguments: aws s3 cp /etc/hive/conf/hive-site.xml s3://<YOUR_S3_BUCKET>/hiveconf/hive-site.xml
Action on failure: Continue
- Leave rest as default. I prefer to put them in the same subnet as my Zeppelin Instance.
- Be sure to attach security group, ES that we created above as Additional Security Group.
- Launch cluster
- Go to Steps and wait for them to complete
- Go back to Zeppelin Instance
- Download hadoopconf from S3 located used in step 12
aws s3 sync s3://<YOUR_S3_BUCKET>/hadoopconf /home/ec2-user/hadoopconf
- Download hive-site from S3 to zeppelin conf directory
aws s3 sync s3://<YOUR_S3_BUCKET>/hiveconf/hive-site.xml /home/ec2-user/zeppelin/conf/hive-site.xml
- Make a copy of /home/ec2-user/zeppelin/conf/zeppelin-env.sh.template as zeppelin-env.sh
- Add following to the top of the file:
export JAVA_HOME=/etc/alternatives/java_sdk_openjdk
export MASTER=yarn-client
export HADOOP_CONF_DIR=/home/ec2-user/hadoopconf
export ZEPPELIN_NOTEBOOK_DIR=/home/ec2-user/zeppelin-notebook
export SPARK_HOME=/home/ec2-user/prereqs/spark
- Make a copy of /home/ec2-user/zeppelin/conf/zeppelin-site.xml.template as zeppelin-site.xml
- Edit the following entry
<name>zeppelin.notebook.dir</name>
<value>/home/ec2-user/zeppelin-notebook</value>
- Make a copy of /home/ec2-user/prereqs/spark/conf/spark-env.sh.template as spark-env.sh
- Add the following to the top:
export HADOOP_CONF_DIR=/home/ec2-user/hadoopconf
export JAVA_HOME=/etc/alternatives/java_sdk_openjdk
- Start it
sudo bin/zeppelin-daemon.sh start
- Tail the log file you find here (/home/ec2-user/zeppelin/logs) and wait for Zeppelin to start
- To check the status of daemon (you can also do stop or restart in place of status)
sudo bin/zeppelin-daemon.sh status
- While it's doing that, let's download jersey-client (Spark Dependency) - probably could use newer version, but I'm just going to go with the one in the documentation
wget http://central.maven.org/maven2/com/sun/jersey/jersey-client/1.13/jersey-client-1.13.jar
- Put this here /dependencies/jersey-client1.13.jar
- Log into Zeppelin (it's on port 8080)
- Go to Interpreter
- Go to Spark and click Edit
- Add the jersey-client we downloaded as a Dependency use the full path of /dependencies/jersey-client1.13.jar
- Click Save
- Now we're going to download the sample note here
- Also download this csv file and move it to your own S3 bucket location.
- Go back home
- Click import Note
- Select the sample note we downloaded on step 36
- When the note comes up, update the csv file location to your own S3 bucket location where you moved this file in step 37
- Run it
No comments:
Post a Comment