Technical references: Installing Zeppelin

Installing Stand alone Zeppelin for EMR

This is a supplement to the AWS blog of the same subject. In our scenario, we will have a standalone server running Zeppelin. We also have an EMR cluster. These are all running on a non-internet accessible VPC. These are the steps we took to make this happen. The necessary files were obtained on a internet connected machine and introduced via S3 bucket.

Launch a new EC2 Linux (RedHat) instance, we'll call this Zeppelin instance.
Attach a security group that can talk to itself, call it zeppelin.

Install AWS CLI. Use this instructions. Here's a summary:

curl -O https://bootstrap.pypa.io/get-pip.py
python get-pip.py --user
export PATH=~/.local/bin:$PATH
source ~/.bash_profile
pip install awscli --upgrade --user

Install JDK 8 Developer:

yum install java-1.8.0-openjdk-devel.x86_64

This will appear at /etc/alternatives/java_sdk_openjdk
Create a new directory: /home/ec2-user/zeppelin-notebook
Download Zeppelin from Apache: http://apache.mirrors.tds.net/zeppelin/zeppelin-0.8.0/zeppelin-0.8.0-bin-all.tgz
Extract Content
```
tar -zxvf zeppelin-0.8.0-bin-all.tgz
```
To make it simpler later on we're going to move this new directory to /home/ec2-user/zeppelin
```
mv zeppelin-0.8.0-bin-all /home/ec2-user/zeppelin
```
Download Spark from Apache
http://www.gtlib.gatech.edu/pub/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz

Extract Content and move it to this directory

tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz
mv spark-2.3.1-bin-hadoop2.7.tgz /home/ec2-user/prereqs/spark

Go to EMR and launch a new cluster

Go to Advanced Option
Select Hadoop, Hive, and Spark

Add 2 Custom JAR type Steps

Name: Hadoopconf
JAR location: command-runner.jar
Arguments: aws s3 cp /etc/hadoop/conf/ s3://<YOUR_S3_BUCKET>/hadoopconf --recursive
Action on Failure: Continue

Name: hive-site
JAR location: command-runner.jar
Arguments: aws s3 cp /etc/hive/conf/hive-site.xml s3://<YOUR_S3_BUCKET>/hiveconf/hive-site.xml
Action on failure: Continue

Leave rest as default. I prefer to put them in the same subnet as my Zeppelin Instance.
Be sure to attach security group, ES that we created above as Additional Security Group.
Launch cluster
Go to Steps and wait for them to complete
Go back to Zeppelin Instance

Download hadoopconf from S3 located used in step 12

aws s3 sync s3://<YOUR_S3_BUCKET>/hadoopconf /home/ec2-user/hadoopconf

Download hive-site from S3 to zeppelin conf directory

aws s3 sync s3://<YOUR_S3_BUCKET>/hiveconf/hive-site.xml /home/ec2-user/zeppelin/conf/hive-site.xml

Make a copy of /home/ec2-user/zeppelin/conf/zeppelin-env.sh.template as zeppelin-env.sh

Add following to the top of the file:

export JAVA_HOME=/etc/alternatives/java_sdk_openjdk
export MASTER=yarn-client
export HADOOP_CONF_DIR=/home/ec2-user/hadoopconf
export ZEPPELIN_NOTEBOOK_DIR=/home/ec2-user/zeppelin-notebook
export SPARK_HOME=/home/ec2-user/prereqs/spark

Make a copy of /home/ec2-user/zeppelin/conf/zeppelin-site.xml.template as zeppelin-site.xml

Edit the following entry

<name>zeppelin.notebook.dir</name>
<value>/home/ec2-user/zeppelin-notebook</value>

Make a copy of /home/ec2-user/prereqs/spark/conf/spark-env.sh.template as spark-env.sh

Add the following to the top:

export HADOOP_CONF_DIR=/home/ec2-user/hadoopconf
export JAVA_HOME=/etc/alternatives/java_sdk_openjdk

Start it
```
sudo bin/zeppelin-daemon.sh start
```
Tail the log file you find here (/home/ec2-user/zeppelin/logs) and wait for Zeppelin to start
To check the status of daemon (you can also do stop or restart in place of status)
```
sudo bin/zeppelin-daemon.sh status
```
While it's doing that, let's download jersey-client (Spark Dependency) - probably could use newer version, but I'm just going to go with the one in the documentation
```
wget http://central.maven.org/maven2/com/sun/jersey/jersey-client/1.13/jersey-client-1.13.jar
```
Put this here /dependencies/jersey-client1.13.jar
Log into Zeppelin (it's on port 8080)
Go to Interpreter
Go to Spark and click Edit
Add the jersey-client we downloaded as a Dependency use the full path of /dependencies/jersey-client1.13.jar
Click Save
Now we're going to download the sample note here
Also download this csv file and move it to your own S3 bucket location.
Go back home
Click import Note
Select the sample note we downloaded on step 36
When the note comes up, update the csv file location to your own S3 bucket location where you moved this file in step 37
Run it

Technical references

Wednesday, August 1, 2018

Installing Zeppelin

Installing Stand alone Zeppelin for EMR

No comments:

Post a Comment

AWS WAF log4j query