Zeppelin is a web-based open notebook with many capabilities that enable communicative data analytics. This multi-language, multipurpose and a multi user web-based notebook is used for visualization and analytics. It also explores and ingests data, collaborates and shares features to Spark and Hadoop, writes code in Python, Scala, Hive, SparkSQL, Markdown, Shell and R. It is capable of pulling data from sources like MongDB, Solr and Oracle. The same data pulled from these multiple sources goes through analyzing using tools such as Apache Spark.
What Zeppelin does
Since Zeppelin is a communicative browser-based notebook, it allows data analysts and scientists, developers and engineers to produce more by organizing, developing, sharing data codes, visualizing and executing results without necessarily using cluster details or having to refer to command lines. These notebooks allow every user to communicate, and execute huge workflows.
Spark has a number of available notebooks, while iPython remains the best choice and one of the greatest examples of a perfect notebook for data science.
Installing Zeppelin is easy when you are not using it for multiple users of Amazon Web Services. If you do use it for AWS, you need to follow a few steps to set it up. Although you can install Zeppelin on both Linux and Windows, the best recommendation would be to use Linux. This is because, besides being much lighter than Windows, Linux has more documentation.
Before setting up Zeppelin, make sure you have the following resources:
An installed Amazon Web Services (AWS) command line interface.
In the region of the launch, you should have a key pair.
A Zeppelin notebook storage bucket, preferably an S3 bucket.
Permissions to create EC2 instances, EMR Clusters and S3 buckets.
Once you have what you need, it’s time to start setting up. Below are the steps to use or follow to set up Zeppelin.
1. EMR Cluster
The first step in setting up Zeppelin is to create an Amazon EMR Cluster. Go the EMR console on Amazon and choose “create cluster”, after which you choose “go to advanced options”
When you are at advanced options, enter these following options:
Select the following applications: Spark 1.5.2, Hive 1.0.0 and Hadoop 2.6.0 and make sure Hue and Pig are deselected.
Go to add steps and choose Custom JAR for step type.
After that choose configure, then enter:
Arguments: file:///usr/lib/hadoop/bin/hadoop fs -copyFromLocal /etc/hadoop/conf/ http://s3<YOUR_S3_BUCKET>/hadoopconf
JAR location: s3://elasticmapreduce/libs/script-runner/script-runner.jar
After you finish setting up the EMR Cluster, it is now time to launch Zeppelin using AWS EC2. This is server-less, and a cloud optimized ETL SERVICE. EC2, which does not need any provisions and is easy to manage, is the best to use for the development of notebooks and endpoints.
Making sure all your other settings are at default values and you have provisioned AWS’s EC2, select your key pair on the security options page. Then create a Zeppelin notebook server and by clicking, create notebook server, go to actions. This action will start after a few minutes. When the status changes to Waiting, your cluster is ready.
3. Create Stack
For the steps that follow during the launch, you need to be conversant with subnet ID, public DNS, VPC ID, Master and Core and security groups. Public DNS, security groups and VPC ID are an instances page of EC2.
Launch continues with Apache Zeppelin of an EC2 instance and with CloudFormation template of an EC2 Zeppelin instance.
Create Stack in the CloudFormation console, then enter the following details after choosing specify on Amazons S3 template URL: http://ift.tt/1mJD4EF.
Click ‘next’, and when you get to the next page, enter the following parameters after giving your Stack a name:
• EMRSlaveSecurityGroup: Security group of EMR core & task
• EMRMasterSecurityGroup: Security group of EMR master
• Instance Type: m3.xlarge
• For key Name, choose your key pair
• For your S3HadoopConfFolder: from your account, replace <mybucket> with any S3 bucket you choose
• For your S3HiveConfFolder: from your account, replace <mybucket> with an S3 bucket
• For SSH Location: use into the Zeppelin instance a CIDR block allowed to connect using SSH
• For ZeppelinSubnetId: where your EMR Cluster launched, subnet
• For ZeppelinVPCId: VPC where your EMR cluster launched, VPC
• Go to Next.
• Then for your instance specify a tag. This is optional.
• Again, go to Next
• Go through all your choices reviewing them thoroughly and making sure the IAM acknowledgement is okay.
• Go to create.
It takes a few minutes to complete your Stack creation because it has to complete creating Zeppelin provisions, Zeppelin prerequisites and EC2 instances. You can use your waiting time to check the S3 console, and create a folder for Zeppelin users and a subfolder for notebook. You should also create a bucket in S3 console for your Zeppelin notebook storage.
Go back to the CloudFormation console, and if the status returns, go ahead to CREATE_COMPLETE. When done, you know that your EC2 is ready and you can go onto the next step.
To view your EC2 instance, open the EC2 console. You will need an IP address and a security group for the next subsequent steps of configuring your EMR in order for your Zeppelin instance to provide traffic.
Follow the next steps:
• Select your cluster in the EMR console and go to the Cluster Details page after you are through.
• Using the default ElasticMapReduce-master, select a security group from the security group master
• Choose the following on the security group page:.EDIT, INBOUND, ADD RULE and ALL TCP. Then click on Custom IP and enter the security group name of the EC2 Zeppelin instance.
• For core and task security groups, repeat all the above steps.
When finished with the above steps, your Zeppelin instance configuration is complete.
5. SSH Access
What you need next is the creation of an SSH key after launching your instance. It is easier if you already have a .pem file. If you do not have a .pem file, create and download Zeppelin. Protect this file, because you will need it for everything you do from this point onwards. Launch instances after clicking an existing key.
Note the IP address when your instance starts and change the .pem file permission by typing chmod 600 Zeppelin.pem from the Linux or MacOs terminal. Only your user can read the file you create.
Using SSH, Connect to your Zeppelin EC2 instance and if you are using PuTTY, follow all the instructions from Windows Using PuTTY topic when connecting to Your Linux Instance.
Now it is time to install your Zeppelin. On your browser, go to Zeppelin download and with all interpreters, click on binary package. Under your suggested mirror, copy the link and go back to the terminal where you have your SSH instance. Using wget, get a Zeppelin copy and enter wget, then the copied link and lastly, press return and wait for the download.
Untar Zeppelin after the download is complete and then press return again.
7. Start Zeppelin
After your session with SSH is complete, you can now start your Zeppelin instance by typing sudo bin/Zeppelin-daemon.sh start. Then switch on to the window of your client and start testing your instance by going to http://<yourZeppelinInstanceIP>:8080/#/
Go to the Zeppelin homepage and choose Import Note, then enter the Zeppelin Tutorial JSON file location as: http://ift.tt/1OQssAL
The Creation of Zeppelin instance is quite complex and needs an expert to do the job. However complex it is, with its many capabilities, it is a must-have and makes work easier.
Have you learnt anything from this article? Would you like to add something? Leave your thoughts, recommendations or comments in the box below.