Hadoop Administration, Troubleshooting, and Security Video Tutorial


12.1 Hadoop Administration, Troubleshooting, and Security

Hello and welcome to the Big Data and Hadoop Developer course offered by Simplilearn. This lesson will focus on Hadoop administration, troubleshooting, and security.

12.2 Objectives

After completing this lesson, you will be able to: •List the commands used in Hadoop programming •Explain the different configurations of Hadoop cluster •Identify the different parameters for performance monitoring and tuning •Explain the configuration of security parameters in Hadoop

12.3 Typical Hadoop Core Cluster

A typical Hadoop Core cluster is composed of machines that run a set of cooperating server processes. Machines in the cluster are not required to be homogeneous. If the machines have similar processing power, memory, and disk bandwidth, the cluster administration becomes easier. In such a case, only one set of configuration files and runtime environments needs to be maintained and distributed.

12.4 Load Balancer

The Hadoop has to balance the huge data requested by a user or an application. This balancing of data load is performed using Load balancer tool. Use ‘start-balancer.sh’ to start the balancer and ‘stop-balancer.sh’ to stop the balancer.

12.5 Commands Used in Hadoop Programming

The Application Master is expected to run on the machine in which the scripts are executed. The Hadoop Core servers load their configurations from files available in the configuration directory of any Hadoop Core installation. Let’s discuss some of the commands used in Hadoop programming. slaves.sh runs its arguments on each of the hosts listed in the conf/slaves file. start-mapred.sh starts the Hadoop MapReduce server, the Application Master, and Node Manager. stop-mapred.sh stops the Hadoop MapReduce server, the Application Master, and Node Manager.

12.6 Different Configuration Files of Hadoop Cluster

Configuration files are responsible for configuring the system for a specific task. Following are the configuration files of a Hadoop cluster: •hadoop-env.sh sets Hadoop environment settings such as Java path and security settings •core-site.xml defines the NameNode and the HDFS temporary directory •mapred-site.xml defines the number of reducers, mappers, and other settings related to MapReduce operations •masters specifies the Secondary NameNode in a clustered environment •slaves specifies the DataNodes in a clustered environment

12.7 Properties of hadoop-default.xml

hadoop-default.xml is used for setting up the parameters that maintain consistency in the Hadoop cluster with respect to distributed computing. Following are the properties defined through hadoop-default.xml: •Global, •Logging, •I/O, •File system, •MapReduce, and •IPC properties. Click each property to know more.

12.8 Hadoop Cluster–Critical Parameters

Global properties refer to the settings that must be maintained throughout the cluster. Logging properties refer to the settings related to log generation and maintenance. I/O properties relate to the input and output operations to and from an HDFS cluster. File system properties relate to the input and output files during job execution. MapReduce properties refer to the settings related to proper job execution such as the number of mappers. IPC properties refer to the settings related to inter-process communication.

12.9 Hadoop DFS Operation–Critical Parameters

Let’s now look at the critical parameters that must be configured for any Hadoop cluster and DFS operation. The three critical parameters that must be configured for any Hadoop cluster are as follows. The hadoop.tmp.dir parameter is used as a temporary directory for both local file system and HDFS. The fs.default.name parameter is used to specify the NameNode machine’s hostname and port number. The mapred.job.tracker parameter is used to define the host and port on which MapReduce Application Master runs.

12.10 Port Numbers for Individual Hadoop Services

The table shows the individual port numbers for specific services that can be accessed via the NameNode IP. Please note that these ports may vary in different commercial distributions.

12.11 Performance Monitoring

The performance of a cluster needs to be monitored to ensure that the resources are properly allocated and de-allocated for optimum utilization. This ensures that the resources are not idle. The Hadoop framework provides several APIs for allowing external agents to provide monitoring services to the Hadoop Core service. Some agents used for Performance Monitoring are JMX, Nagios, Ganglia, Chukwa, and FailMon.

12.12 Performance Tuning

Performance tuning is a method that helps perform the specific job faster and better by making the resources participate actively in a specified job. The factors considered during Performance Tuning are network bandwidth, disk throughput, CPU overhead, and memory.

12.13 Parameters of Performance Tuning

Performance Tuning is done using the following parameters. dfs.datanode.handler.count handles the number of server threads for the DataNode. dfs.datanode.du.reserved reserves space in bytes per volume. dfs.replication sets the replication factor. fs.checkpoint.dir is the default replication factor, which stores the temporary images and merges them in need of a job in the local file system of the DFS Secondary NameNode. mapred.local.dir.minspacestart limits the job tasks for execution if the space is relatively less. dfs.block.size changes the block size; the default is 64MB. dfs.name.edits.dir determines the exact storage position of the DFS NameNode transaction or edits file in the local file system.

12.14 Troubleshooting and Log Observation

Logs are important to Administrators during troubleshooting the Hadoop cluster. Remember the following points during troubleshooting and completing log observations: •Name the logs in the Machinename-username-hadoop_service format. An example is hadoop-sl 000-datanode-DNode1.log. •Logs are always checked for troubleshooting. •Check the Java exceptions and error messages in case of errors during MapReduce job execution.

12.15 Apache Ambari

Apache Ambari is an open operation framework that enables System Administrators to provision, manage, and monitor a Hadoop cluster, as well as integrate Hadoop with the Enterprise operational tools.

12.16 Key Features of Apache Ambari

Following are a few key features of Apache Ambari: •It has wizard-driven installation of Hadoop across ‘n’ number of hosts •It provides API-driven installation of Hadoop via Ambari Blueprints for automated provisioning •It helps in granular control of Hadoop service and component lifecycles •It helps in management of Hadoop service configurations and advanced job diagnostic and visualization tools •It has robust RESTful APIs for customization and integration with enterprise systems

12.17 Business Scenario

Olivia is the EVP of IT operations with Nutri Worldwide, Inc. It has started using Hadoop predominantly for data processing and analysis. Few employees in this company have experience with Hadoop, however, the company needs to start using it. This has resulted in some common errors, such as slower response time, thus preventing a smooth workflow. Olivia wants to prevent occurrences of such events in the future. She wants to make Hadoop scalable, organized, and effective in her organization.

12.18 Troubleshooting a Missing DataNode Issue Demo 01

First create an issue of missing DataNode to perform troubleshooting. Use the command shown on the screen to reformat the NameNode. Press Enter to continue. In the re-format question, type upper case Y and press Enter to continue. Type Clear and press Enter. The format is successfully performed. Use the command shown on the screen to start all the services. Press Enter. All the Hadoop services have successfully started. Use the jps command to check the status of the Hadoop services. Press Enter. You will observe that the DataNode service is missing. Sometimes while upgrading or downgrading hardware of the cluster, you may face this. Press Enter. The best way to understand the reason for issues is to read the log file. Use the command shown on the screen to open the log file for DataNode. Press Enter. You will see a bunch of data displayed. Observe the highlighted part. This part shows the reason for the DataNode service not starting. The reason is that the namespace ID of NameNode and DataNode does not match. As a technician, you need to note down the NameNode namespace ID, that is, 1861898000. Type clear and press Enter. Use the command shown on the screen to open the location to re-write the namespace for DataNode. Press Enter. Delete the old namespace id and re-write the namespace id as 1861898000. Then, save the file. Use the command shown on the screen to stop the service. Press Enter. Use the command shown on the screen to start the service now. Press Enter. Let’s verify whether the DataNode is active. Type jps and press Enter. You will now see that the DataNode service is successfully restored.

12.19 Optimizing a Hadoop Cluster Demo 02

Let’s create a chunk of data and perform sorting. Type the command shown on the screen to create a chunk. Press Enter. A map-reduce operation is performed for sorting. Let’s check the generated data in the GUI. Click Browse the file system. Click the data link. Click the demoinput link. Observe that a 500MB data is generated and the default block size is 64MB. Click on Go back to DFS home. Now let’s perform sorting operation in this data. Type the command shown on the screen to perform sort operation, and press Enter. This will start the MapReduce operation to perform terasort. You can check the job status in MapReduce GUI. Note the address. Click the Terasort job to see the job status and completion time. This page shows that the operation is completed in 3 min and 30 seconds. Let’s try and perform some optimization. Press Enter. Open hdfs-site.xml. Type the command shown on the screen and press Enter. You need to set the hdfs-site.xml parameters. Set dfs.replication as 2, dfs.block.size as 128MB, dfs.namenode.handler.count as 20, and dfs.datanode.handler.count as 5. Press Enter. Open mapred-site.xml. Type the command shown on the screen. Press Enter. You need to set mapred-site.xml parameters now. Press Enter once you set the value for all parameters. You need to delete the demo output and input file. The command to delete the demo output file is shown on the screen. Press Enter. The command to delete the demo input file is shown on the screen. Stop the hadoop services and start again. The command to stop hadoop services is stop-all.sh. Press Enter. The command to start the hadoop service is start-all.sh Press Enter. Ensure that all services are active. This is done using the command jps. Press Enter. Type clear and press Enter. Let’s re-create the data using Teragen. The command is shown on the screen. Press Enter. The MapReduce operation will start for generating a data file. Press Enter. Let’s now perform terasort on the generated data. The command is shown on the screen. Press Enter. Let’s check the data in GUI. Click the data link. Click the demoinput file. You will observe that the block size is now 128MB. Let’s check the MapReduce jobs status. Click the second job to find the job execution time. Since we run this example in a Pseudo distributed mode, the time taken is longer than the previous timing. However, if you optimize the system in a real cluster, the execution time will decrease giving more throughputs. Thus we have successfully performed the optimization process.

12.20 Hadoop Security—Kerberos

So far we have discussed Hadoop configuration and troubleshooting. Let us now discuss Hadoop Security in detail. Hadoop relies on Kerberos for secure authentication. Kerberos is a third party authentication mechanism in which users and services rely on a Kerberos server for authentication. Kerberos server, also known as Key Distribution Center or KDC, has three parts: They are: •Principal •Authentication Server •Ticket Granting Server Principal is a database of the users, and it services their respective Kerberos passwords. Authentication Server or AS is for initial authentication and issuing a Ticket Granting Ticket or TGT. Ticket Granting Server or TGS is for issuing subsequent service tickets based on the initial TGT.

12.21 Kerberos—Authentication Mechanism

The steps for the Kerberos authentication mechanism are as follows. Step 1: A user principal requests for authentication to the AS. Step 2: AS returns a TGT that is encrypted using the user principal's Kerberos password. Step 3: User principal decrypts the TGT locally using its Kerberos password. The service principal uses a special file, called a keytab, which contains its authentication credentials to avoid providing a password each time to decrypt the TGT.

12.22 Kerberos Configuration—Steps

The key steps for Kerberos configuration in Hadoop cluster include: •Installing the Key Distribution Center or KDC •Configuring the KDC •Creating the Kerberos database •Setting up the first user principal for administrator •Starting Kerberos •Creating service principals for NameNode, DataNode, Application Master, and Node Manager •Installing Java Cryptography Extension or JCE Unlimited Strength Jurisdiction Policy File on all machines •Creating a mapping between service principals and UNIX usernames •Adding information to three main service configuration files: core-site.xml, hdfs-site.xml, and mapred-site.xml

12.23 Data Confidentiality

Hadoop also provides the following mechanisms for maintaining data confidentiality in its cluster. Data Encryption on RPC implies securing data transfer between Hadoop services and clients. For this, you need to set hadoop.rpc.protection to ‘privacy’ in the core site .xml which will activate data encryption. Data encryption on block data transfer implies securing transfer protocol of DataNode. To activate this, set dfs.encrypt.data.transfer to ‘true’ in the hdfs-site.xml. Data encryption on HTTP implies protecting data transfer between Web-console and clients using S-S-L or H-T-T-P-S. Click the URL shown to refer to distribution security guide on how to activate these mechanisms.

12.25 Quiz

The following are a few questions to test your understanding of the concepts discussed here.

12.28 Summary

Let us summarize the topics covered in this lesson: •Hadoop can be optimized based on infrastructure and available resources. •Hadoop is an open-source application, and the support provided for complicated optimization is less. •Optimization is performed through .xml files. •Logs are the best medium through which an administrator can understand a problem and troubleshoot it accordingly. •Hadoop relies on the Kerberos-based security mechanism.

12.29 Thank you

With this we conclude the last lesson of Big Data and Hadoop Developer course. Thank you and happy learning!

12.26 Case Study

Scenario: XY Networks provides network security support to many organizations. It has system-generated log files that are critical for security analysis and monitoring. These files are growing in size, and the company is running out of storage space. It also uses expensive and obsolete backup mechanism for these files. The company was given an estimate of 5 million dollars to upgrade their storage and backup mechanism. It’s IT team suggests that storage costs can be reduced by 90% by using Hadoop. A cluster of more than 100 machines is required to set up and maintain Hadoop and other ecosystem products. The IT team has heard of Ambari, which can help monitor the cluster. Click Analysis to know the company’s next move. Analysis: The IT team does research on Ambari, and finds that it can be used to monitor Hadoop and other ecosystem tools such as Hive, HBase, and Oozie. It also interacts with machine monitoring tools like Nagios and Ganglia, and helps them add new machines or remove some machines for maintenance. It also alerts Administrators in case of a resource outage. Some advantages of using Ambari are: 1.A single dashboard for all the tools with web-based interface. 2.Provisioning resources and machines from any place. 3.Health check of all servers. 4.Easy to configure. Click Solution for the steps to install Ambari to monitor the clusters.

12.27 Case Study - Demo

Solution: Perform the following steps to set up a 3-node hadoop cluster with Hadoop, Hive, Pig, HBase, Oozie, Sqoop, Flume, and Spark, and install Ambari to monitor the cluster: 1.Check the dashboard for all the tools installed. 2.Check the services and their status. 3.Check the host machines installed. 4.Check alerts in case of Server malfunction. 5.Check resource usage like memory, disk, and network.