Getting started with Cloudera Hadoop on vSphere

This past week, my buddy Paudie and I have been neck-deep in Cloudera/Hadoop, with a view to getting it successfully deployed on vSphere. The purpose of this was solely a learning exercise, to try to understand what operational considerations need to be taking into account when running Hadoop on top vSphere. These operational considerations range from items such as maintenance mode, rack awareness, high availability, replication and protection of the data. Both Cloudera/Hadoop and vSphere offers ways to do all of this, so the longer term objective is to figure out whether or not these features are compatible, and whether one mechanism is preferred over the other, and so on. However, before we could get started on this, we needed to get Cloudera/Hadoop deployed. Oh, the fun we had! This post is basically to give you enough details to deploy Cloudera/Hadoop in virtual machines running on top of vSphere using “single-user mode” (to avoid using root). This is purely for evaluation purposes; this is not for production use. You should refer to Cloudera documentation if you want to deploy an environment for production use on top of vSphere – this is not what I am showing you here.

Why Cloudera?

Well, there is one reason that we chose the Cloudera distribution, and that was for the Cloudera Manager. This is an interface that allows you to create clusters, add and remove nodes, enable and disable services, and so on. Operationally, this is most important, as Cloudera Manager allows you to spin up new services (HBase, Impala, Spark, etc) with just a few clicks. It has lots of good managing and monitoring options built-in as well, so that is why we went with the Cloudera distribution.

Base VM Template

For our testing, we chose Centos Linux distribution (we actually tried 7.3 and 7.4). We created a template VM with 2 vCPUs and 12GB of memory (this may not be best practices, but this is what we chose). As well as a 40GB base disk for the OS, we added an additional 250GB disk for the HDFS (Hadoop filesystem). We also placed this second disk on its own PVSCSI controller on the VM. Please note that this sizing was decided upon with “trial and error” as we had some difficulty tracking down recommended sizes. Once the VM was built from a virtual hardware perspective, we went ahead and installed the Centos OS. A few tips during the deploy:

As we are going to do a single user deploy, add a user called cloudera-scm during the OS install. This user will be used for our single-user Cloudera deployment.
Ensure that this user is a superuser by adding it to the wheel group to allow it to “sudo”. This step is also available during the installation.
Ensure that networking is enabled/turned on, which it isn’t by default on Centos. Again, this is an option during the install.
Ensure that DNS is working, both for forward and reverse lookups.

Centos Guest Configuration

Once the OS is installed, there are a number of additional configuration steps needed.

Remove SELinux. Cloudera will not deploy with SELinux enabled, which it is by default. You can disable it by editing /etc/selinux/config and setting SELINUX=disabled. This requires a reboot.
Disable the firewall using the commands:
- systemctl stop firewalld
- systemctl disable firewalld
Centos 7.4 also has a NAT interface which Cloudera will report as incorrectly configured. We simply removed it as follows:
- systemctl disable libvirtd.service
Cloudera requires that the aggressiveness of VM swapping be tuned to a low value of 10. To make the setting persist through reboots, add the entry vm.swappiness=10 to /etc/sysctl.conf
Many Linux distros, including Centos, have a feature called Transparent Huge Pages, which interacts poorly with Hadoop workloads and can seriously degrade performance. The recommendation is to disable the feature by editing /etc/rc.local and adding the following entries to make the changes persist through reboots:
- echo never > /sys/kernel/mm/transparent_hugepage/defrag
- echo never > /sys/kernel/mm/transparent_hugepage/enabled
To make working in single user mode, easier, we recommend modifying the %wheel entry via the visudo command, and allow sudo commands to be run without passwords:
- Comment out the entry: %wheel ALL=(ALL) ALL: ALL
- Add the entry: %wheel ALL=(ALL) NOPASSWD: ALL
Format the 250GB disk drive. With fdisk, build a single primary partition that covers the whole of the disk. The filesystem format we chose was XFS.
- fdisk /dev/sdb, (n)ew partition 1, whole of the disk
- mkfs.xfs /dev/sdb1
Mount the newly created filesystem on /data/1, and mount it on reboots. Use the noatime option.
- mkdir -p /data/1/
- vi /etc/fstab and add the line to the end of the file:
  - /dev/sdb1 /data/1 xfs defaults,noatime 1 2
Verify that the filesystem is mounted after reboots by using mount -a.
Change ownership and privileges of the mount point as follows:
- chown -R cloudera-scm:cloudera-scm /data/1/

Once this setup is complete, you can convert the VM to a template and deploy out 7 VMs.

VMs required for deployment

A total of 7 VMs are needed. It is recommended to use Guest OS customization so that each VM gets its own ID and hostname.

4 x data nodes VMs
2 x name nodes VMs
1 x Cloudera Manager VMs

Once the VMs are deployed, verify its hostname, and that DNS is working both forward and reverse lookups. If your distro does not include the nslookup command for verifying DNS, it can be installed with the command:

sudo yum install bind-utils

Preparing the Cloudera Manager VM

Since we will be doing this deployment in single user mode, a number of directories need to be created in advance on the Cloudera Manager VM. Here is the list of directories, along with the appropriate commands to create them and changed ownership. This is the drawback with doing single user mode, in that all of these have to be created manually.

sudo mkdir -p /var/lib/cloudera-host-monitor
sudo chown -R cloudera-scm:cloudera-scm /var/lib/cloudera-host-monitor
sudo mkdir -p /var/lib/cloudera-service-monitor
sudo chown -R cloudera-scm:cloudera-scm /var/lib/cloudera-service-monitor
sudo mkdir -p /var/log/cloudera-scm-eventserver/
sudo chown -R cloudera-scm:cloudera-scm /var/log/cloudera-scm-eventserver/
sudo mkdir -p /var/log/cloudera-scm-navigator/
sudo chown -R cloudera-scm:cloudera-scm /var/log/cloudera-scm-navigator/
sudo mkdir -p /var/lib/cloudera-scm-navigator/temp
sudo chown -R cloudera-scm:cloudera-scm /var/lib/cloudera-scm-navigator/temp
sudo mkdir -p /var/lib/cloudera-scm-navigator/solr
sudo chown -R cloudera-scm:cloudera-scm /var/lib/cloudera-scm-navigator/solr
sudo mkdir -p /var/lib/cloudera-scm-eventserver
sudo chown -R cloudera-scm:cloudera-scm /var/lib/cloudera-scm-eventserver

Preparing the data nodes and name nodes

Again, because we plan to do this in single user mode, necessary directories also need to be created on the data nodes and name nodes. My initial plan is to deploy HDFS (Hadoop filesystem) and YARN (Yet Another Resource Negotiator) which also provides the MapReduce part of Hadoop. If I wanted to deploy other services, other directories would need to be created. I created a short script on my Cloudera Manager VM to create the directories on my other 6 nodes:

for i in cor-datanode-01 cor-datanode-02 cor-datanode-03 cor-datanode-04 cor-namenode-01 cor-namenode-02
do
ssh cloudera-scm@${i} sudo mkdir -p /var/lib/hadoop-httpfs
ssh cloudera-scm@${i} sudo mkdir -p /var/lib/hadoop-hdfs
ssh cloudera-scm@${i} sudo mkdir -p /var/log/hadoop-hdfs
ssh cloudera-scm@${i} sudo mkdir -p /var/lib/hadoop-hdfs/audit
ssh cloudera-scm@${i} sudo mkdir -p /var/lib/hadoop-yarn
ssh cloudera-scm@${i} sudo mkdir -p /var/log/hadoop-yarn
ssh cloudera-scm@${i} sudo mkdir -p /var/log/hadoop-mapreduce
ssh cloudera-scm@${i} sudo mkdir -p /var/run/hdfs-sockets
ssh cloudera-scm@${i} sudo chown -R cloudera-scm:cloudera-scm /var/lib/hadoop-hdfs
ssh cloudera-scm@${i} sudo chown -R cloudera-scm:cloudera-scm /var/lib/hadoop-httpfs
ssh cloudera-scm@${i} sudo chown -R cloudera-scm:cloudera-scm /var/log/hadoop-hdfs
ssh cloudera-scm@${i} sudo chown -R cloudera-scm:cloudera-scm /var/lib/hadoop-hdfs/audit
ssh cloudera-scm@${i} sudo chown -R cloudera-scm:cloudera-scm /var/lib/hadoop-yarn
ssh cloudera-scm@${i} sudo chown -R cloudera-scm:cloudera-scm /var/log/hadoop-yarn
ssh cloudera-scm@${i} sudo chown -R cloudera-scm:cloudera-scm /var/log/hadoop-mapreduce
ssh cloudera-scm@${i} sudo chown -R cloudera-scm:cloudera-scm /var/run/hdfs-sockets/
done

Note that the /var/run/hdfs-sockets file is removed on reboot. You may want to add the mkdir and chown commands for this file into /etc/rc.local on each node so that the file is recreated on reboot.

One drawback of this script is that I have to provide all the passwords each time. A better approach might be to use keys which will allow you to ssh and run ssh commands without a password. To do this, generate your keys on the Cloudera Manager VM using the command ssh-keygen, and click-through the default responses. Then copy the keys to each host as follows:

for i in cor-datanode-01 cor-datanode-02 cor-datanode-03 cor-datanode-04 cor-namenode-01 cor-namenode-02
do
sudo ssh-copy-id -i ~/.ssh/id_rsa.pub cloudera-scm@${i}
done

This will allow the above script to work without requesting passwords. Remember the folders above have only been created for HDFS and YARN services. More folders will need to be created if you want to enable additional services. This is why single-user mode is not recommended for anything other than proof-of-concepts and “kicking the tires”, so to speak.

Cloudera Deployment

Now we are ready to go with the Cloudera Deployment.

The next step is to pull down the binaries and start the installation. You need to be logged in as user cloudera-scm on your Cloudera Manager VM. You also need to have the wget utility installed. If it is not installed, simply run sudo yum install wget. Then run the following command as per the Cloudera documentation: wget https://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin. When the binary is downloaded to the Cloudera Manager VM, change the permissions as follows: chmod u+x cloudera-manager-installer.bin and then run the command: sudo ./cloudera-manager-installer.bin. The following shortened video of the process (reduced to 30 seconds) will demonstrate what happens at this stage of the install. These videos do not include any voice-over – they simply display the steps involved rather than including a bunch of screenshots.

The next step of the process is as per the final directive in the previous install. You need to point a browser at the IP address of the Cloudera Manager, port 7180. Then log in with credential admin/admin, and then proceed to select the VMs that will become your name nodes (primary and backup) and data nodes (x4). I chose 4 data nodes as by default HDFS replicates each block 3 times, this gives me some availability in the event of loosing a node. In this example, I omitted the Cloudera Manager from the list of nodes. However, you should also include the Cloudera Manager, as per the instructions in the installer. Note that at this point, it is only CDH that is being deployed. I have deliberately chosen not to deploy services at this point. We will do those next. You will also see that this is where single-user mode is selected. The shortened video (reduced to 4 minutes from ~25 minutes ) also captures a failed attempt to deploy, but it also shows how a retry can recover the situation. Note that this is a trial license deployment.

The final step above is a host inspection to ensure that all of the nodes were deployed successfully and are healthy. Any issues discovered during the inspection should be addressed before proceeding with deploying services.

If the inspection passes all tests, we are now onto our final step which is to add some services to our cluster. As I said, I would add HDFS and YARN only. This video shows you the various steps involved in deploying out these two services from Cloudera Manager. It is quite straight forward, and one of the reasons we chose Cloudera. We already pre-created the necessary directories on all of the nodes for these two services earlier on. This video is in real-time and lasts approx 4 minutes. I did not wait for the YARN service to start, only HDFS.

The main points from the service deployment is the fact that we have chosen to use an internal database rather than an external database. Again, this is simply because we are doing a proof-of-concept. For anyone doing a production deployment, you will need an external database. The service roll out also displays all the advanced settings and folders/directories required by the service, which we have already created above.

If everything deploys successfully, you should see a Cloudera Manager landing page similar to the following:

Conclusion

So as you can see, there is quite a bit of work to get done in the background before enabling any big data like services. However, once the deployment is completes, Cloudera Manager makes it very easy to add additional services, such as Impala, HBase, and so on. The difficulty with single-user mode however, as mentioned earlier, is that there are usually a number of folders that needs to be manually created in order for this to work. In fact, we’d recommend not doing a single-mode deployment due to these issues.

Now, our next step is to examine what this means operationally for Cloudera running on top of vSphere (and later on vSAN). Check back from time to time, and I’ll report on our findings.

Question for the Experts

Has anyone tried to deploy Cloudera Enterprise with single user mode before? We tried to take this deployment a step further by enabling HA via Cloudera Manager, but it constantly failed to enable. It seems that it was trying to do something as the hdfs user, rather than the single user and constantly hit “Failed to initialize Shared Edits Directory of NameNode”. Is this a limitation of single-user mode? Has anyone successfully done it? Once we moved away from single-user mode, we were able to deploy and enable HA successfully.