Wednesday, August 1, 2018

Installing Zeppelin

Installing Stand alone Zeppelin for EMR

This is a supplement to the AWS blog of the same subject. In our scenario, we will have a standalone server running Zeppelin. We also have an EMR cluster. These are all running on a non-internet accessible VPC. These are the steps we took to make this happen. The necessary files were obtained on a internet connected machine and introduced via S3 bucket.

  1. Launch a new EC2 Linux (RedHat) instance, we'll call this Zeppelin instance.
  2. Attach a security group that can talk to itself, call it zeppelin.
  3. Install AWS CLI. Use this instructions. Here's a summary:
    curl -O https://bootstrap.pypa.io/get-pip.py
    python get-pip.py --user
    export PATH=~/.local/bin:$PATH
    source ~/.bash_profile
    pip install awscli --upgrade --user

  4. Install JDK 8 Developer:

    yum install java-1.8.0-openjdk-devel.x86_64
  5. This will appear at /etc/alternatives/java_sdk_openjdk
  6. Create a new directory: /home/ec2-user/zeppelin-notebook
  7. Download Zeppelin from Apache: http://apache.mirrors.tds.net/zeppelin/zeppelin-0.8.0/zeppelin-0.8.0-bin-all.tgz
  8. Extract Content
    tar -zxvf zeppelin-0.8.0-bin-all.tgz
  9. To make it simpler later on we're going to move this new directory to /home/ec2-user/zeppelin
    mv zeppelin-0.8.0-bin-all /home/ec2-user/zeppelin
  10. Download Spark from Apache
    http://www.gtlib.gatech.edu/pub/apache/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
  11. Extract Content and move it to this directory
    tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz
    mv spark-2.3.1-bin-hadoop2.7.tgz /home/ec2-user/prereqs/spark
  12. Go to EMR and launch a new cluster
    1. Go to Advanced Option
    2. Select Hadoop, Hive, and Spark
    3. Add 2 Custom JAR type Steps
      Name: Hadoopconf
      JAR location: command-runner.jar
      Arguments: aws s3 cp /etc/hadoop/conf/ s3://<YOUR_S3_BUCKET>/hadoopconf --recursive
      Action on Failure: Continue
      Name: hive-site
      JAR location: command-runner.jar
      Arguments: aws s3 cp /etc/hive/conf/hive-site.xml s3://<YOUR_S3_BUCKET>/hiveconf/hive-site.xml
      Action on failure: Continue
      
  13. Leave rest as default. I prefer to put them in the same subnet as my Zeppelin Instance.
  14. Be sure to attach security group, ES that we created above as Additional Security Group.
  15. Launch cluster
  16. Go to Steps and wait for them to complete
  17. Go back to Zeppelin Instance
  18. Download hadoopconf from S3 located used in step 12
    aws s3 sync s3://<YOUR_S3_BUCKET>/hadoopconf /home/ec2-user/hadoopconf
    
  19. Download hive-site from S3 to zeppelin conf directory
    aws s3 sync s3://<YOUR_S3_BUCKET>/hiveconf/hive-site.xml /home/ec2-user/zeppelin/conf/hive-site.xml
    
  20. Make a copy of /home/ec2-user/zeppelin/conf/zeppelin-env.sh.template as zeppelin-env.sh
  21. Add following to the top of the file:
    export JAVA_HOME=/etc/alternatives/java_sdk_openjdk
    export MASTER=yarn-client
    export HADOOP_CONF_DIR=/home/ec2-user/hadoopconf
    export ZEPPELIN_NOTEBOOK_DIR=/home/ec2-user/zeppelin-notebook
    export SPARK_HOME=/home/ec2-user/prereqs/spark
    
  22. Make a copy of /home/ec2-user/zeppelin/conf/zeppelin-site.xml.template as zeppelin-site.xml
  23. Edit the following entry
    <name>zeppelin.notebook.dir</name>
    <value>/home/ec2-user/zeppelin-notebook</value>
    
  24. Make a copy of /home/ec2-user/prereqs/spark/conf/spark-env.sh.template as spark-env.sh
  25. Add the following to the top:
    export HADOOP_CONF_DIR=/home/ec2-user/hadoopconf
    export JAVA_HOME=/etc/alternatives/java_sdk_openjdk
    
  26. Start it
    sudo bin/zeppelin-daemon.sh start
    
  27. Tail the log file you find here (/home/ec2-user/zeppelin/logs) and wait for Zeppelin to start
  28. To check the status of daemon (you can also do stop or restart in place of status)
    sudo bin/zeppelin-daemon.sh status
    
  29. While it's doing that, let's download jersey-client (Spark Dependency) - probably could use newer version, but I'm just going to go with the one in the documentation
    wget http://central.maven.org/maven2/com/sun/jersey/jersey-client/1.13/jersey-client-1.13.jar
    
  30. Put this here /dependencies/jersey-client1.13.jar
  31. Log into Zeppelin (it's on port 8080)
  32. Go to Interpreter
  33. Go to Spark and click Edit
  34. Add the jersey-client we downloaded as a Dependency use the full path of /dependencies/jersey-client1.13.jar
  35. Click Save
  36. Now we're going to download the sample note here
  37. Also download this csv file and move it to your own S3 bucket location. 
  38. Go back home
  39. Click import Note
  40. Select the sample note we downloaded on step 36
  41. When the note comes up, update the csv file location to your own S3 bucket location where you moved this file in step 37
  42. Run it

No comments:

Post a Comment

AWS WAF log4j query

How to query AWS WAF log for log4j attacks 1. Setup your Athena table using this instruction https://docs.aws.amazon.com/athena/latest/ug/wa...