Hadoop Developer- ZooKeeper Sqoop and Flume Video Tutorial

310 Views

10.1 ZooKeeper, Sqoop, and Flume

Hello and welcome to lesson 11 of the Big Data and Hadoop Developer course offered by Simplilearn. This lesson will focus on ZooKeeper, Sqoop, and Flume.

10.2 Objectives

After completing this lesson, you will be able to: •Explain ZooKeeper and its role •List the challenges faced in distributed processing •Install and configure ZooKeeper •Explain the concept of Sqoop •Install and configure Sqoop •Explain the concept of Flume •Configure and run Flume

10.3 Why ZooKeeper

Apache Hadoop, when introduced, became a popular technology for management of the Big Data. Consequently, different services like HBase, Storm, and Kafka were added as part of a Hadoop cluster. With increasing services and computing nodes, there was a need for the integrated management. This is where ZooKeeper is used in out-of-work services to support high availability of data processing. Consider a distributed system with multiple servers, which are responsible for holding data and perform operations on that data. How do you determine the servers that are alive and operating at any moment of time? Or, how do you determine the servers available to process a build in a distributed build system?

10.4 What is ZooKeeper

Contrary to the popular belief, ZooKeeper is not used to store data, but for storing nodes. It checks whether the client is available or not. It is a fast, highly available, fault-tolerant, distributed coordination service. With ZooKeeper, you can build reliable, distributed data structures for group membership, leader election, coordinated workflow, and configuration services, as well as generalized distributed data structures such as locks, queues, barriers, and latches.

10.5 Features of ZooKeeper

Some of the salient features of ZooKeeper are as follows: ZooKeeper provides a simple and high performance kernel for building complex clients. It also provides distributed coordination services for distributed applications. ZooKeeper follows FIFO, the First-In-First-Out Approach, when it comes to job execution. It allows for synchronization, serialization, and coordination of nodes in a Hadoop cluster. It comes with pipeline architecture to achieve a wait-free approach. Further, ZooKeeper manages problems by using built-in algorithms for deadlock detection and prevention. It applies a multi-processing approach to reduce wait-time for process execution. In addition, ZooKeeper allows for distributed processing. Thus, it is compatible with services related to MapReduce.

10.6 Challenges Faced in Distributed Applications

There are certain challenges faced in distributed applications. Coordination of nodes in a cluster is error-prone. The race conditions when it comes to job execution and synchronization and the deadlocks with respect to detection and prevention are challenging. There are chances of partial failure of job execution, resulting in the restarting of all tasks. Further, there are inconsistencies due to hardware or service failures.

10.7 Coordination

Some of the key points related to coordination are as follows: Group membership: It refers to the introduction of a new node in the cluster to perform synchronization and distributed jobs. Leader election: In a distributed environment, leader election is the selection of a leader node in a cluster to enable job allocation and monitoring. Dynamic configuration: It refers to flexibility in configuring a node according to the type of job to be executed. Critical sections: These are the areas in distributed processing used to store the global variables required to perform job executions. Status monitoring: This refers to monitoring the status of nodes in the cluster process heartbeat, RAM synchronization, and others. Queuing: This refers to the generation of a queuing mechanism to perform multiple jobs at a stipulated time.

10.8 Goals and Uses of ZooKeeper

The goals and uses of ZooKeeper are described here. There are a number of goals of ZooKeeper, such as: Serialization ensures avoidance of delay in read or write operations; Reliability ensures that an update applied by a user in the cluster persists till it is overwritten by another update; Atomicity does not allow partial results. Any user update can either succeed or fail; and Simple Application Programming Interface or API provides an interface for development and implementation. There are many uses of ZooKeeper. Some of them are: Configuration ensures the nodes in the cluster are in sync with each other and also with the NameNode server. Message queue is the communication with nodes present in the cluster. Notification refers to the process of notifying the NameNode of any failure in the cluster so that the specific task can be restarted from another node. Further, synchronization ensures all nodes in the cluster are in sync with each other, and the services are working well.

10.9 ZooKeeper Entities

ZooKeeper comprises three entities: leader, follower, and observer. Leader, as a system, is responsible for initiating the process and ensuring that the nodes in the cluster are in sync with the process that is executed. Only one leader can exist in a cluster. Follower is the system that obeys the leader, accepts the job or the messages from the leader, and performs them. There can be several followers in a cluster. Observer is a system that observes the nodes to ensure efficiency and job completion. The observer helps the leader to assign specific types of jobs to different nodes ensuring that the busy nodes do not receive multiple jobs.

10.10 ZooKeeper Data Model

ZooKeeper has a hierarchical namespace, where each node is called Znode. An example is shown on the image, where the tree diagram represents the namespace. The tree follows a top-down approach where '/' is the root, and App1 and App2 reside in the root. The path to access the db is /App1/db, which is called the hierarchical path. Similarly, for accessing conf under App1, the path is /App/conf and under App 2 the path is ‘/App2/conf.

10.11 Znode

Znode is an in-memory data node in the ZooKeeper data service. It has a hierarchical namespace and follows UNIX-like notations for path. The screen further examines the contents related to Znode. Click each tab to know more. Types of Znodes: There are two types of Znodes, Regular Znode and Ephemeral Znode. Flag of Znode: Sequential flag is the only flag of Znode that is responsible for accessing data sequentially. Features of Znode: Watch mechanism is one of the features of Znode. The watch mechanism receives notifications from nodes that reside in the cluster. It also enables one-time triggers for job execution. Timeout mechanism is also a feature of Znode, where session can be used. Session is a term applied for a connection to a server from the client. The timeout mechanism permits allocation of resources for a limited time period. Znode is not designed for data storage; it stores metadata or configuration in the leader system; and it also stores information such as timestamp version.

10.12 Client API Functions

Client APIs have built-in functions such as create, delete, exist, getData, getChildren, setData, and Sync for performing operations. The parameters for each function are as follows: Path, data, and flag for create function; path and version for delete function; path and watch for exist, getData, and getChildren functions; path, data, and version for setData function; and path for sync function. ZooKeeper performs these operations automatically. However, a user can change the behavior of ZooKeeper by using any of these functions. There are two versions of synchronization. They are synchronous and asynchronous.

10.13 Recipe 1—Cluster Management

Recipes are the guidelines for using ZooKeeper to implement higher order functions. The following is a recipe for cluster management used in cloud environments: •For each client host i, where i equals 1 to N: •Watch on /members •Create slash members slash host array as ephemeral nodes •A node joining or leaving cluster generates an alert •Keep updating slash members slash host array periodically for node status changes. For example, change in load, memory, and CPU.

10.14 Recipe 2—Leader Election

The following is an example of a recipe for leader election: •All participants of the election process create an ephemeral-sequential node on the same election path. •The node with the smallest sequence number is the leader. •Each follower node listens to the node with the next lower sequence number. •When the leader is removed, go to election-path and find a new leader; the node with the lowest sequence number becomes the leader. •When the session expires, check the election state, and go to election if needed. The given image illustrates this example in detail. Please spend some time to go through the image.

10.15 Recipe 3—Distributed Exclusive Lock

The following is a recipe for distributed exclusive lock function, assuming there are N web crawler clients trying to acquire a lock on links data. •Clients create an ephemeral, sequential Znode under the path given on screen. •Clients request a list of children for the lock Znode, that is, locknode. •The client with the lowest ID, according to natural ordering, will hold the lock. • Other clients set watches on the Znode with the ID immediately preceding its own identification. They periodically check for the lock in case of notification. •The client wishing to release a lock deletes the node, which triggers the next client in line to acquire the lock.

10.16 Business Scenario

Tim Burnet is the AVP of IT-infra ops at Nutri Worldwide, Inc. He anticipates that Olivia Tyler, the EVP of IT operations, will ask him to work on a high-performance coordination service for distributed applications as part of his current project. Tim knows that he must use ZooKeeper for this task. He wants to be prepared, so he decides to install ZooKeeper.

10.17 View ZooKeeper Nodes Using CLI Demo 1

In this demo, you will see how to view ZooKeeper nodes using ZooKeeper Command line interface. Let us look at the ZooKeeper command line interface. First, let us find the Zookeeper CLI, that is, z-k-C-l-i, using the Linux find command. You will be able to observe the path listed on the screen. CD to the directory containing z-k-C-l-i and launch it using local host IP address 127.0.0.1 and port 2-1-8-1. Certain info messages appear on the screen. Enter the help command once the ZooKeeper prompt appears. List the nodes using the l-s slash command. List one of the nodes using ‘l-s slash node name’. For example, ‘l-s slash h-base hyphen unsecure’. You can also get information about the node using ‘get slash hbase hyphen unsecure’. Observe the listed attributes, like the number of children nodes.

10.18 Why Sqoop

While companies across industries are trying to move from structured relational databases like MySQL, Teradata, Netezza, and so on to Hadoop, there were concerns on the ease of transitioning of existing databases. It was challenging to load bulk data into Hadoop, or accessing it from MapReduce. Users had to consider data consistency, production system resource consumption, and data preparation. Data transfer using scripts was both time consuming and inefficient. Direct access of data from external systems was also complicated. This was solved with the introduction of Sqoop. Sqoop allows smooth import and export of data from structured databases. Along with Oozie, Sqoop helps in scheduling and automation of import and export tasks.

10.19 What is Sqoop

Sqoop, an Apache Hadoop Ecosystem project, is a command-line interface application for transferring data between relational databases and Hadoop. It supports incremental loads of a single table, or a free-form SQL query. Imports can also be used to populate tables in Hive or HBase. Exports can be used to put data from Hadoop into a relational database.

10.20 Sqoop—Real-life Connect

Online marketers, Coupon.com uses Sqoop to exchange data between Hadoop and the IBM Netezza data warehouse appliance. The organization can query its structured databases, and transfer the results into Hadoop using Sqoop. The Apollo group, an education company, also uses Sqoop to extract data from databases as well as to inject the results from Hadoop Jobs back into relational databases.

10.21 Sqoop and Its Uses

Sqoop is required when a database is imported from a Relational Database (RDB) to Hadoop or vice versa. A Relational Database or RDB, refers to any data in a structured format. Databases in MySQL or Oracle are examples of RDB. While exporting databases from a Relational Database to Hadoop, users must consider consistency of data, consumption of production system resources, and preparation of data for provisioning downstream pipeline. While importing the database from Hadoop to a Relational Database, users must keep in mind that directly accessing data residing on external systems within a MapReduce framework complicates applications. It also exposes the production system to excessive loads originating from cluster nodes. Hence, Sqoop is required in both the scenarios.

10.22 Sqoop and Its Uses (contd.)

The following is a summary of processing of Sqoop. Sqoop runs in a Hadoop Cluster. It imports data from the RDB or NoSQL DB to Hadoop. It has access to the Hadoop core which helps in using mappers to slice the incoming data into unstructured formats and place the data in HDFS. It exports data back into RDB, ensuring that the schema of the data in the database is maintained. This screen describes how Sqoop performs its execution. First, the dataset being transferred is divided into partitions. Next, a map-only job is launched with individual mappers responsible for transferring a slice of the dataset. Lastly, each record of the data is handled in a type-safe manner as Sqoop uses metadata to infer the data types.

10.23 Benefits of Sqoop

Use the command shown on the image to import data present in MySQL database using Sqoop where sl000 is the database name, and auth is the table name.

10.24 Sqoop Processing

The process of the Sqoop import is summarized in this screen: •Sqoop introspects the database to gather the necessary metadata for the data being imported. •A map-only Hadoop job is submitted to the cluster by Sqoop. •The map-only job performs data transfer using the metadata captured in step one.

10.25 Sqoop Execution—Process

The imported data is saved in a directory on HDFS based on the table being imported. Users can specify any alternative directory where the files should be populated. By default, these files contain comma delimited fields with new lines separating the different records. Users can also override the format in which data is copied by explicitly specifying the field separator and recording terminator characters. Further, users can easily import data in Avro data format by specifying the option ‘-as-avrodatafile’ with the import command. Sqoop supports different data formats for importing data. It also provides several options for tuning the import operation.

10.26 Importing Data Using Sqoop

The processes of importing data to Hive and HBase are given on the screen. Click each tab to know more. Importing data to Hive: First, Sqoop takes care of populating the Hive metastore with appropriate metadata for the table, and also invokes the necessary commands to load the table or partition. Next, using Hive import, Sqoop converts the data from the native datatypes in the external datastore into the corresponding types within Hive. Further, Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new line or other Hive delimiter characters in it, Sqoop allows the removal of such characters. The data is then correctly populated for consumption in Hive. Lastly, after the import is completed, the user can operate on table just like any other table in Hive. Importing data to HBase When data is imported to HBase, Sqoop can populate the data in a particular column family in an HBase table. The HBase table and the column family settings are required to import a table to HBase. Data imported to HBase is converted to its string representation and inserted as UTF-8 bytes. Use the commands shown on the screen to import data to HBase. •Connect to the database using the first command. •Specify the parameters such as username, password, and table-name using the second command. •Create an HBase table with the column family as specified in MySQL using the third command.

10.27 Sqoop Import—Process

The process of the Sqoop import is summarized in this screen: •Sqoop introspects the database to gather the necessary metadata for the data being imported. •A map-only Hadoop job is submitted to the cluster by Sqoop. •The map-only job performs data transfer using the metadata captured in step one.

10.28 Sqoop Import—Process (contd.)

The imported data is saved in a directory on HDFS based on the table being imported. Users can specify any alternative directory where the files should be populated. By default, these files contain comma delimited fields with new lines separating the different records. Users can also override the format in which data is copied by explicitly specifying the field separator and recording terminator characters. Further, users can easily import data in Avro data format by specifying the option ‘-as-avrodatafile’ with the import command. Sqoop supports different data formats for importing data. It also provides several options for tuning the import operation.

10.29 Importing Data to Hive and HBase

The processes of importing data to Hive and HBase are given on the screen. Click each tab to know more. Importing data to Hive: First, Sqoop takes care of populating the Hive metastore with appropriate metadata for the table, and also invokes the necessary commands to load the table or partition. Next, using Hive import, Sqoop converts the data from the native datatypes in the external datastore into the corresponding types within Hive. Further, Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new line or other Hive delimiter characters in it, Sqoop allows the removal of such characters. The data is then correctly populated for consumption in Hive. Lastly, after the import is completed, the user can operate on table just like any other table in Hive. Importing data to HBase When data is imported to HBase, Sqoop can populate the data in a particular column family in an HBase table. The HBase table and the column family settings are required to import a table to HBase. Data imported to HBase is converted to its string representation and inserted as UTF-8 bytes. Use the commands shown on the screen to import data to HBase. •Connect to the database using the first command. •Specify the parameters such as username, password, and table-name using the second command. •Create an HBase table with the column family as specified in MySQL using the third command.

10.30 Exporting Data from Hadoop Using Sqoop

Use the command shown on the image to export data from Hadoop using Sqoop. There are a number of steps to export data from Hadoop using Sqoop. First, introspect the database for metadata and transfer the data. Next, transfer the data from HDFS to the DB. Further, Sqoop divides the input dataset into splits. Sqoop uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions to ensure optimal throughput and minimal resource utilization.

10.31 Sqoop Connectors

The different types of Sqoop connectors are Generic JDBC, Default Sqoop, and Fast-path connectors. The Generic JDBC connector can be used to connect to any database that is accessible via JDBC. The Default Sqoop connector is designed for specific databases such as MySQL, PostgreSQL, Oracle, SQL Server, and DB2. The Fast-path connector specializes in using specific batch tools to transfer data with high throughput. For example, MySQL and PostgreSQL databases.

10.32 Sample Sqoop Commands

This screen lists various common Sqoop commands. The first command on the screen is for importing the data from the mysql table sqoop demo to an HDFS directory. Note the –m 1 which ensures there is only one mapper output. In the second command, note that id is greater than two, which places a condition on data to be imported. You can also specify a specific sql query as shown in third command on the screen, where it mentions –e select start from sqoop underscore demo where id equals to thirteen. The third command on the screen shows an export function. Please note the hyphens and the double hyphen before driver, connect, username, and password. There are some more commands listed on the screen for your reference. Please go through the same for better understanding.

10.33 Business Scenario

Tim Burnet, the AVP of IT-infra ops at Nutri Worldwide, Inc., is working on an ongoing data model restructuring exercise for the employee directory application. He wants employee details data from an existing RDBMS. He plans to import the data into a Sqoop system and export it back to the RDBMS. Currently, Sqoop is not installed in his system. Tim plans to install Sqoop and import and export this bulk data.

10.34 Install Sqoop Demo 2

In this demo, you will learn about Installing Sqoop. Visit the website sqoop.apache.org Click the nearby mirror link. Select a mirror link to download Pig. Select version 1.4.4. Ensure you select the version that supports hadoop 1.0. Right-click and copy the sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz link. Go to Ubuntu server and type wget followed by the copied link to download sqoop. Then, press Enter. Untar the file to extract it in the folder. Type tar -xvzf sqoop and press Tab. Copy the folder in the location, /usr/local/sqoop. Type sudo cp –r sq and press Enter. Set up bashrc. Type sudo vi $HOME/ .bashrc and press Enter. Set the highlighted part to ensure sqoop has all the dependencies required. Update the bash prompt. Type the exec bash command and press Enter. To perform a demo on sqoop, you need a database server. In this example, you will install MySQL server. Type the command shown on the screen and press Enter. Once MySQL is installed, type the command shown on the screen and press Enter to start MySQL. Create a database named SL. Type the command shown on the screen and press Enter. Type the command use sl and press Enter so that MySQL will use this as the main database. Create a table called authentication which has columns called username and password as headings. Type the command shown on the screen and press Enter. To insert data in the table, type the command shown on the screen and press Enter. Ensure that the data is inserted successfully using the command mentioned on the screen, and press Enter. The next step is to download the database driver that will be used by sqoop. This can be done by visiting www.mysql.com/downloads/ Click the Download from MySQL Developer Zone link. You will be taken to the dev.mysql.com/downloads/ landing page. To download MySQL Connectors, click the Download link. You will be taken to the dev.mysql.com/downloads/connector/ landing page. Select Connector/J to download your required distribution. Select the Platform Independent option from the Select Platform drop-down list. Click the Download button next to Platform-Independent, Architecture-Independent, Compressed TAR Archive connector. Click the ‘No thanks, just start my download.’ link. A Download File Info pop-up box appears. Click the Start Download button. Ensure that the downloaded file is copied in /usr/local/sqoop/lib. Press Enter. To change the ownership of the folder /usr/local/sqoop, type the command shown on the screen and press Enter. Change the ownership for /usr/local/sqoop/lib Type the command shown on the screen and press Enter. You have successfully installed and configured sqoop.

10.35 Import Data on Sqoop Using MySQL Database Demo 3

In this demo, you will learn how to import data on Sqoop using MySQL database. Let us look at ingesting data from a relational database, MySQL, to HDFS using Sqoop. To do this, first launch MySQL and create some data. List the databases in MySQL using the ‘show databases’ command. Use a test database in this case. Create a table named my sql_data with the id as integer primary key and name it varchar 50. Once the table is created, insert a row in the table with a value like 1, USA. To verify whether the row has been inserted successfully, list the contents of the table using the select query. View the contents of the table on the screen. Insert another row in the table with a value like 2, India. Verify if the row has been inserted again using the select query. Exit MySQL. Now, we will run the sqoop utility. Enter the sqoop command as shown on the screen and execute it. You can see some log messages on the screen. In the background, sqoop invokes MapReduce programs. Finally, you will get a confirmation and some job metrics. Check the data that has been imported in HDFS. Observe that two or more files are created. Cat the contents of the part outputs, once for each. Now, run sqoop with one mapper. Execute the command with a –m 1. You will get a confirmation with job metrics on completion. List the contents of the HDFS directory. You will observe only one output file as you ran only one mapper. Cat the contents to view the data. You will observe that all the records have been imported successfully into one file in HDFS.

10.36 Export Data Using Sqoop from Hadoop Demo 4

This demo will show you how to export data from Hadoop using Sqoop. Let us now export data from HDFS to MySQL database using Sqoop. Launch MySQL by typing mysql on command prompt. Type use test to change the database. Create a table named ‘exported_table’ with ID as the primary key integer and name as the variable character. Select star from the exported_ table to check the existing data. You will see zero results indicating the table has no data. You should have already created the table named ‘export_data’. You can verify them using the command shown on the screen. You can list the contents of the file using the command shown on the screen. You will notice that the datasets contain 2 rows. The first row is laptop and the second row is desktop. Copy the file to HDFS using the command shown on the screen. Let us now use the Sqoop export command to export the data from HDFS to MySQL database. Enter the JDBC URL of MySQL database. Also, mention the table in which data has to be exported. Mention the HDFS file or directory from which data has to be exported MySQL table and press Enter. You will notice the MapReduce percentage. You will see Map is 100 percent, whereas Reduce is 0 percent. This indicates that Reducers are not running as a part of the Sqoop job. Launch MySQL again. Type use test to change the database. Select star from the exported_ table to check the existing data. You will notice the 2 records have been exported from HDFS to MySQL database.

10.37 Why Flume

Consider a scenario where a company has thousands of services running on different servers in a cluster that produce many large logs. These logs should be analyzed together. The current issue involves determining how to send the logs to a setup that has Hadoop. The channel or method used for sending them must be reliable, scalable, extensive, and manageable. To solve this problem log aggregation tool called flume can be used. Apache Sqoop and Flume are the tools that are used to gather data from different sources and load them into HDFS. Sqoop in Hadoop is used to extract structured data from databases like Teradata, Oracle, and so on, whereas Flume in Hadoop sources data that is stored in different sources, and deals with unstructured data.

10.38 Apache Flume—Introduction

Apache Flume is a distributed and reliable service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows, which is robust and fault-tolerant. The diagram depicts how the log data is collected from a web server and is sent to HDFS using the source, channel, and sink components. A Flume source consumes events delivered to it by an external source like a web server. When a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. One example is the file channel, which is backed by the local filesystem. The sink removes the event from the channel and puts it into an external repository like HDFS or forwards it to the Flume source of the next Flume agent or next hop in the flow.

10.39 Flume Model

The Flume model comprises three entities: an agent, a processor, and a collector. The agent is responsible for receiving data from an application. Running flume agents will ensure data is continuously ingested into Hadoop. The processor component is responsible for performing intermediate processing of jobs. The collector component is responsible for writing data to the permanent storage (HDFS).

10.40 Flume—Goals

Flume works towards attaining certain key goals for data ingestion. Flume aims to ensure reliability, as it has tunable failure recovery modes. It also aims to achieve scalability by providing a horizontally scalable data path that can be used to form a topology of agents. For extensibility, Flume leverages plug-in architecture for extending modules. For manageability, Flume provides a centralized data flow management interface, hence making it convenient to manage across nodes.

10.41 Scalability in Flume

Flume is scalable. It has a horizontally scalable data path, which helps in achieving load balance in case of higher load in the production environment. The given image depicts an example of horizontal scalability. Assume there are two agents and two collectors. If one collector is down, then the two agents will scale down to just one collector.

10.42 Flume—Sample Use Cases

Flume can be used for a variety of use cases. It can be used for collecting logs from nodes in a Hadoop cluster; for collecting logs from services such as httpd and mail; and for process monitoring. Further, Flume has also been widely used for collecting impressions from custom applications for an advertisement network like Meebo.

10.43 Business Scenario

As part of the Nutri Worldwide, Inc. web infrastructure, logs are being maintained on web servers. Tim wants to put these logs in HDFS for analyzing click stream data and processing analytics on the data. Tim intends to use Flume to constantly ingest the web log data to Hadoop.

10.44 Configure and Run Flume Agents Demo 5

In this demo, you will learn how to configure and run flume agents.

10.46 Quiz

The following are a few questions to test your understanding of the concepts discussed.

10.53 Summary

Let us summarize the topics covered in this lesson: •ZooKeeper provides a simple and high performance kernel for building more complex clients. It has three basic entities: leader, follower, and observer. Watch is used to obtain notifications of all followers and observers to the leaders. •Sqoop is a tool designed to transfer data between Hadoop and relational databases including MySQL, MS SQL, and PostgreSQL. It allows the import of data from an RDB, such as SQL, MySQL or Oracle, into HDFS. It allows the reverse as well. •Apache Flume is a distributed data collection service that gathers the flow of data from its source and aggregates the data to sink. Flume provides a reliable and scalable agent mode to ingest data into HDFS.

10.54 Conclusion

This concludes the lesson ‘ZooKeeper, Sqoop, and Flume.’ The next lesson will cover ‘Ecosystem and its Components.’

10.47 Case Study—ZooKeeper

Scenario: XY Networks provides network security support to many organizations. It has set up a cluster of servers, and created an in-house application that monitors these clusters. It faces multiple problems in coordinating these servers, and there are numerous partial failures and race conditions. Users complain the application’s behavior to be unpredictable. They found the handling of distributed coordination in their software is complicated, and will take 200 Man-Days of effort to fix it. Instead, they want to explore the ZooKeeper tool, which is a part of Apache Foundation, and is a library of recipes for distributed process coordination. Click Analysis to know the company’s next move. Analysis: The IT team researched on Apache ZooKeeper. It has very expressive APIs and well-tested mechanism for handling distributed process coordination. Some of the recipes provided by ZooKeeper are: 1.Handling partial failures 2.Leader selection 3.Avoiding deadlocks 4.Avoiding race conditions Click Solution for the steps to explore the nodes and watches in ZooKeeper.

10.48 Case Study—ZooKeeper—Demo

Solution: Perform the following steps to explore the nodes and watches in ZooKeeper: 1.Open a ZooKeeper session on one node. 2.Create an ephemeral znode. 3.Create few sequential ephemeral znodes under the node. 4.List the znodes. 5.Retrieve the znode data. 6.Open another ZooKeeper session. 7.Create a watch on the ephemeral nodes from the second session. 8.Delete one of the nodes in the first session, and view the alerts in second session, 9.Quit the first session, and view the alerts in second session.

10.49 Case Study—Sqoop

Scenario: XY Invest provides investment advice to high net worth individuals, and maintains stock market data for many exchanges. It has stock market data files that are critical for analysis and monitoring. It chose to implement Hadoop for storing stock market data files, and decided to use Hive for analysis on Hadoop for the company data scientists’ ease of use. Its current data is in RDBMS, which is used as a transaction database, and has to be transferred to Hadoop and Hive. The company’s IT team has heard that Sqoop is a popular tool for ingesting RDBMS data into Hadoop and Hive. Click Analysis to know the company’s next move. Analysis: The IT team does research on Sqoop, and finds that it is efficient in copying large amounts of data from RDBMS into Hadoop. Sqoop is also effective in importing back aggregated data from Hadoop to RDBMS for other analysis. Advantages of using Sqoop for data ingestion are: 1.It can connect to different types of databases. 2.It can read the table schema, and convert data types into Hadoop and Hive. 3.It can create tables in Hive from database schema. 4.It uses Hadoop map-only jobs for transferring data, and uses 4 mappers by default. 5.It uses the primary key of the table for splitting data for the mappers. 6.It also generates Java-class file for the table with getter and setter methods. 7.It is capable of copying data in popular Parquet format. Click Solution for the steps to explore Sqoop.

10.50 Case Study—Sqoop—Demo

Perform the following steps to explore Sqoop: 1.Connect to MySQL database with JDBC connection from Sqoop. 2.List the tables in the MySQL database. 3.Copy the data from MySQL table to Hadoop. 4.Copy the data in Parquet format. 5.Use Sqoop to create the table in Hive, and copy the data. 6.Check the data in Hive table.

10.51 Case Study—Flume

Scenario: XY Networks provides network security support to many organizations. It has set up a cluster of servers, and created an in-house application that monitors these clusters. It collects System Log files from numerous customers in real-time, and stores these files in Hadoop for analytics processing. A highly scalable and fault-tolerant system is required for gathering these log files. The company’s IT team has heard that Flume and Sqoop are popular tools for data ingestion into Hadoop. Since their data is unstructured log files, Flume is the tool of choice for their business need. Click Analysis to know the company’s next move. Analysis: The IT team does research on Flume, and finds that it has a source-sink architecture with loosely connected sources and sinks. Channels are used to receive data from source to sinks. The advantages of using Flume for data ingestion are: 1.It can process millions of log messages per minute. 2.It is horizontally scalable, and can add new machines to cluster to increase the capacity. 3.Provides real-time processing of log files. 4.It is highly fault-tolerant. If the sink is down, messages will accumulate in the channel and will be delivered once the sink functions. Click Solution for the steps to explore Flume.

10.52 Case Study—Flume—Demo

Solution: Perform the following steps to explore Flume: 1.Create a Properties file that defines source, sink, and channel. 2.Make the source to extract data from a log file and sink and store it into HDFS. 3.Run Flume with this Properties file. 4.Add data to the log file. 5.Check if the data is copied to HDFS by Flume. 6.As more data is added to log file, Flume picks up the data and adds to HDFS. Click Next to view the demo.


{{detail.h1_tag}}

{{detail.display_name}}
... ...

{{author.author_name}}

{{detail.full_name}}

Published on {{detail.created_at| date}} {{detail.duration}}

  • {{detail.date}}
  • Views {{detail.downloads}}
  • {{detail.time}} {{detail.time_zone_code}}

Registrants:{{detail.downloads}}

Downloaded:{{detail.downloads}}

About the On-Demand Webinar

About the Webinar

Hosted By

...

{{author.author_name}}

{{author.author_name}}

{{author.about_author}}

About the E-book

View On-Demand Webinar

Register Now!

First Name*
Last Name*
Email*
Company*
Phone Number*

View On-Demand Webinar

Register Now!

Webinar Expired

Download the Ebook

Email
{{ queryPhoneCode }}
Phone Number {{ detail.getCourseAgree?'*':'(optional)'}}

Show full article video

About the Author

{{detail.author_biography}}

About the Author

{{author.about_author}}