Introduction to Big Data and Hadoop

Certification Training
3094 Learners
View Course Now!

Hive HBase and Hadoop Ecosystem Components Tutorial

1 Hive, HBase and Hadoop Ecosystem Components

Let us summarize the topics covered in this lesson: • Big Data has three characteristics, namely, variety, velocity, and volume. • Hadoop HDFS and Hadoop MapReduce are the core components of Hadoop. • One of the key features of MapReduce is that the map output is a set of key/value pairs which are grouped and sorted by key. • TaskTracker manages the execution of individual map and reduce tasks on a compute node in the cluster. • Pig is a high-level data flow scripting language. It uses HDFS for storing and retrieving data. In the next lesson, we will focus on Hive, HBase and components of the Hadoop ecosystem.

2 Objectives

By the end of this lesson, you will be able to: ? Describe the basics of Hive ? Explain HBase and Cloudera ? Discuss the commercial distributions of Hadoop and ? Explain the components of the Hadoop ecosystem In the next screen, we will focus on an introduction to the concept of Hive.

3 Hive—Introduction

Hive is defined as a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large data sets stored in Hadoop. Following are the facts related to Hive: It provides an SQL-like (read as S-Q-L like) language called HiveQL or HQL (read as H-Q-L). Due to its SQL-like interface, Hive is a popular choice for Hadoop analytics. It provides massive scale-out and fault tolerance capabilities for data storage and processing of commodity hardware. Relying on MapReduce for execution, Hive is batch-oriented and has high latency for query execution. In the next screen, we will discuss the key characteristics of Hive.

3 Hive—Introduction

Hive is defined as a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large data sets stored in Hadoop. Following are the facts related to Hive: It provides an SQL-like (read as S-Q-L like) language called HiveQL or HQL (read as H-Q-L). Due to its SQL-like interface, Hive is a popular choice for Hadoop analytics. It provides massive scale-out and fault tolerance capabilities for data storage and processing of commodity hardware. Relying on MapReduce for execution, Hive is batch-oriented and has high latency for query execution. In the next screen, we will discuss the key characteristics of Hive.

4 Hive—Characteristics

Hive is a system for managing and querying unstructured data into a structured format. It uses the concept of MapReduce for the execution of its scripts and the Hadoop Distributed File System or HDFS for storage and retrieval of data. Following are the key principles underlying Hive: Hive commands are similar to that of SQL (read as S-Q-L). SQL is a data warehousing tool that is similar to Hive. Hence, learning Hive will not be a big challenge for those who are familiar with SQL. Hive contains extensive, pluggable MapReduce scripts in the language of your choice. These scripts include rich, user-defined data types and user-defined functions. Hive has an extensible framework to support different files and data formats. Performance is better in Hive since Hive engine uses the best in-built script to reduce the execution time, thus enabling high output in less time. In the next screen, we will discuss the system architecture and the components of Hive.

4 Hive—Characteristics

Hive is a system for managing and querying unstructured data into a structured format. It uses the concept of MapReduce for the execution of its scripts and the Hadoop Distributed File System or HDFS for storage and retrieval of data. Following are the key principles underlying Hive: Hive commands are similar to that of SQL (read as S-Q-L). SQL is a data warehousing tool that is similar to Hive. Hence, learning Hive will not be a big challenge for those who are familiar with SQL. Hive contains extensive, pluggable MapReduce scripts in the language of your choice. These scripts include rich, user-defined data types and user-defined functions. Hive has an extensible framework to support different files and data formats. Performance is better in Hive since Hive engine uses the best in-built script to reduce the execution time, thus enabling high output in less time. In the next screen, we will discuss the system architecture and the components of Hive.

5 System Architecture and Components of Hive

The image on the screen shows the architecture of the Hive system. It also illustrates the role of Hive and Hadoop in the development process. In the next screen, we will discuss the basics of Hive Query Language.

6 Basics of Hive Query Language

Hive Query Language or HQL is the query language for Hive engine. Hive supports basic SQL queries such as the From clause sub-query, ANSI JOIN such as equi-join only, multi-table insert, multi group-by, sampling, and objects traversal. HQL provides support to pluggable MapReduce scripts using the TRANSFORM command. In the next screen, we will focus on tables in Hive.

7 Data Model—Tables

Hive tables are analogous to tables in relational databases. A Hive table logically comprises the data that is stored and the associated metadata. Each table has a corresponding directory in HDFS. There are two types of tables in Hive. They are managed tables and external tables. In the next screen, we will focus on data types in Hive.

7 Data Model—Tables

Hive tables are analogous to tables in relational databases. A Hive table logically comprises the data that is stored and the associated metadata. Each table has a corresponding directory in HDFS. There are two types of tables in Hive. They are managed tables and external tables. In the next screen, we will focus on data types in Hive.

8 Data Types in Hive

There are three data types in Hive. They are primitive, complex, and user-defined types. In the next screen, we will discuss serialization and de-serialization.

9 Serialization and De serialization

Serialization takes a Java object that Hive has been working with, and turns it into something that Hive can write to HDFS or another supported system. Serialization is used when writing data, for example, through an INSERT-SELECT statement. De-serialization is used during query time to execute SELECT statements. Other facts related to serialization and deserialization are: • The interface used for performing serialization and de-serialization is SerDe (Read as Ser-De). • In some situations, the interface used for de-serialization is LazySerDe (Read as Lazy- Ser-De). • This interface allows unstructured data to be converted into structured data due to its flexibility. • While using this interface, the data is read based on the separation by different delimiter characters. • The SerDe interface is located in the jar file mentioned onscreen. In the next screen, we will focus on User-Defined Functions and MapReduce scripts.

10 UDF/UDAF vs. MapReduce Scripts

The table on the screen shows the comparison of User-Defined and User-Defined Aggregate Functions with MapReduce scripts: User-Defined Functions are written in Java while MapReduce scripts can be written in any language. Both User-Defined Functions and MapReduce scripts support 1 to 1, n to 1, and 1 to n input to output. However, User-Defined Functions are faster than MapReduce scripts since the latter spawns new processes for different operations. In the next screen, we will focus on an introduction to the concept of HBase.

11 HBase—Introduction

Apache HBase is a distributed, column-oriented database built on top of HDFS. HBase can scale horizontally to thousands of commodity servers and petabytes of data by indexing the storage. Apache HBase is an open-source, distributed, and versioned non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. HBase supports random, real-time CRUD (read as “C-R-U-D”) operations. CRUD stands for Create, Read, Update, and Delete. The goal of HBase is to host very large tables with billions of rows and millions of columns, atop clusters of commodity hardware. In the next screen, we will focus on the key characteristics of HBase.

11 HBase—Introduction

Apache HBase is a distributed, column-oriented database built on top of HDFS. HBase can scale horizontally to thousands of commodity servers and petabytes of data by indexing the storage. Apache HBase is an open-source, distributed, and versioned non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. HBase supports random, real-time CRUD (read as “C-R-U-D”) operations. CRUD stands for Create, Read, Update, and Delete. The goal of HBase is to host very large tables with billions of rows and millions of columns, atop clusters of commodity hardware. In the next screen, we will focus on the key characteristics of HBase.

12 Characteristics of HBase

HBase is a type of NoSQL (read as No S-Q-L) database and is classified as a key value store. In HBase, value is identified with a key. Both key and value are byte-array, which means binary formats can be stored easily. Values are stored in key-orders and can be accessed quickly by their keys. HBase is a database in which tables have no schema. Column families and not columns are defined at the time of table creation. In the next screen, we will focus on the HBase architecture.

13 HBase Architecture

HBase has two types of Nodes which are Master and RegionServer. There is only one Master node running at a time whereas there can be one or more RegionServers. The high availability of the Master node is maintained with ZooKeeper. The Master node manages cluster operations like assignment, load balancing, and splitting. It is not a part of read or write path. The RegionServer hosts tables, performs reads, and buffers writes. Clients communicate with RegionServer for read and write. A region in HBase is the subset of a table’s rows. The Master node detects the status of RegionServers and assigns regions to RegionServers. In the next screen, we will compare HBase with Relational Database Management System or RDBMS.

14 HBase vs. RDBMS

HBase provides certain advantages compared to Relational Database Management System. HBase allows automatic partitioning in relation to manual partitioning in RDBMS. HBase can scale linearly and automatically with new nodes. RDBMS usually scales vertically by adding more hardware resources. Further, as part of the Hadoop ecosystem, HBase uses commodity hardware. But RDBMS relies on expensive servers. HBase has mechanisms for fault tolerance that RDBMS may or may not have. HBase leverages batch processing with MapReduce distributed processing. RDBMS relies on multiple threads or processes rather than MapReduce distributed processing. In the next screen, we will focus on an introduction to Cloudera.

15 Cloudera—Introduction

Cloudera is a commercial tool used to deploy Hadoop in an enterprise setup. Following are the salient features of Cloudera: Cloudera uses 100% open-source distribution of Apache Hadoop and related projects such as Apache Pig, Apache Hive, Apache HBase and Apache Sqoop. Cloudera offers the user-friendly Cloudera Manager for system management, Cloudera Navigator for data management, dedicated technical support, and so on. In the next screen, we will explore Cloudera distribution.

15 Cloudera—Introduction

Cloudera is a commercial tool used to deploy Hadoop in an enterprise setup. Following are the salient features of Cloudera: Cloudera uses 100% open-source distribution of Apache Hadoop and related projects such as Apache Pig, Apache Hive, Apache HBase and Apache Sqoop. Cloudera offers the user-friendly Cloudera Manager for system management, Cloudera Navigator for data management, dedicated technical support, and so on. In the next screen, we will explore Cloudera distribution.

16 Cloudera Distribution

Just like Linux distribution is open source and distributed by commercial vendors, Cloudera and many other vendors offer Hadoop as a commercial distribution. Cloudera’s distribution is known as CDH or Cloudera Distribution including Apache Hadoop, and it delivers the core elements of Hadoop. These elements include scalable storage, distributed computing and additional components such as a user interface, and necessary enterprise capabilities such as security. CDH includes the core elements of Apache Hadoop and several key open source projects. These projects, when coupled with customer support, management, and governance through a Cloudera Enterprise subscription, can deliver an enterprise data hub. In the next screen, we will focus on Cloudera Manager.

17 Cloudera Manager

Cloudera Manager is used to administer Apache Hadoop. It is used to configure the following, among others: HDFS Hive engine Hue MapReduce Oozie ZooKeeper Flume HBase Cloudera Impala Cloudera Search and YARN (Read as yarn). In the next screen, we will discuss the Hortonworks Data Platform.

18 Hortonworks Data Platform

Hortonworks Data Platform or HDP enables Enterprise Hadoop with a suite of essential capabilities that serve as the functional definition of any data platform technology. It has a comprehensive set of capabilities aligned to functional areas such as data management, data access, data governance and integration, security, and operations. HDP can be downloaded from the URL mentioned on the screen. In the next screen, we will look at the MapR data platform.

19 MapR Data Platform

The MapR data platform supports more than 20 open source projects. It also supports multiple versions of the individual projects, thereby allowing users to migrate to the latest versions at their own pace. The screen shows all the projects actively supported in the current General Availability or GA version of MapR Distribution for Hadoop—M7. MapR can be downloaded from the URL mentioned on the screen. In the next screen, we will focus on Pivotal HD, another commercial distribution of Hadoop.

20 Pivotal HD

Pivotal HD is a commercially supported, enterprise-capable distribution of Hadoop. It consists of GemFire XD® along with toolsets such as HAWQ (Read as hawk), MADlib (Read as M-A-D-lib), OpenMPI (Read as Open M-P-I), GraphLab, and Spring XD. Pivotal HD can be downloaded from the URL mentioned on the screen. Pivotal HD aims to accelerate data analytics projects, and significantly expands Hadoop’s capabilities. Pivotal GemFire brings real-time analytics to Hadoop, enabling businesses to process and make critical decisions immediately. In the next screen, we will focus on an introduction to the concept of ZooKeeper.

21 Introduction to ZooKeeper

ZooKeeper is an open-source and high performance co-ordination service for distributed applications. It offers services such as: Naming, Locks and synchronization, Configuration management, and Group services. In the next screen, we will discuss the features of ZooKeeper.

22 Features of ZooKeeper

Some of the salient features of ZooKeeper are as follows: ZooKeeper provides a simple and high performance kernel for building more complex coordination primitives at the client. It also provides distributed co-ordination services for distributed applications. ZooKeeper follows FIFO, that is, a First-In-First-Out Approach when it comes to job execution. It allows synchronization, serialization, and co-ordination of nodes in a Hadoop cluster. It comes with pipeline architecture to achieve a wait-free approach. ZooKeeper takes care of problems by using in-built algorithms for deadlock detection and prevention. It applies a multi-processing approach to reduce wait-time for process execution. ZooKeeper also allows for distributed processing. Thus, it is compatible with services related to MapReduce. In the next screen, we will focus on the goals of ZooKeeper.

23 Goals of ZooKeeper

The goals of ZooKeeper are as follows: Serialization ensures avoidance of delay in read or write operations. Reliability persists when an update is applied by a user in the cluster. Atomicity does not allow partial results. Any user update can either succeed or fail. Simple Application Programming Interface or API provides an interface for development and implementation. In the next screen, we will discuss the typical uses of ZooKeeper.

24 Uses of ZooKeeper

The uses of ZooKeeper are as follows: Configuration refers to ensuring that the nodes in the cluster are in sync with each other and also with the NameNode server. Message queue is the communication with nodes present in the cluster. Notification refers to the process of notifying the NameNode of any failure that occurs in the cluster so that the specific task can be restarted from another node. Synchronization refers to ensuring that all the nodes in the cluster are in sync with each other, and the services are up and running. In the next screen, we will focus on what Sqoop is and the reasons why it is used.

25 Sqoop—Reasons to Use It

Sqoop is an Apache Hadoop ecosystem project whose responsibility is to import or export operations across relational databases such as MySQL (read as My S-Q-L), MSSQL (read as M-S S-Q-L) and Oracle to HDFS. Listed on the screen are the reasons to use Scoop. SQL servers are deployed worldwide and are the primary ways to accept the data from a user. Nightly processing is done on SQL server for years. It is essential to have a mechanism to move the data from traditional SQL DB to Hadoop HDFS as Hadoop makes its way into enterprises. Transferring the data using automated scripts is inefficient and time-consuming. Traditional DB has reporting, data visualization, and other enterprise applications built in but to handle large data, we need an ecosystem. The need to bring the processed data from Hadoop HDFS to the applications like database engine or web services is satisfied by Sqoop. In the next screen, we will continue to discuss why Sqoop is needed.

25 Sqoop—Reasons to Use It

Sqoop is an Apache Hadoop ecosystem project whose responsibility is to import or export operations across relational databases such as MySQL (read as My S-Q-L), MSSQL (read as M-S S-Q-L) and Oracle to HDFS. Listed on the screen are the reasons to use Scoop. SQL servers are deployed worldwide and are the primary ways to accept the data from a user. Nightly processing is done on SQL server for years. It is essential to have a mechanism to move the data from traditional SQL DB to Hadoop HDFS as Hadoop makes its way into enterprises. Transferring the data using automated scripts is inefficient and time-consuming. Traditional DB has reporting, data visualization, and other enterprise applications built in but to handle large data, we need an ecosystem. The need to bring the processed data from Hadoop HDFS to the applications like database engine or web services is satisfied by Sqoop. In the next screen, we will continue to discuss why Sqoop is needed.

26 Sqoop—Reasons to Use It (contd.)

Sqoop is required when database is imported from Relational Database or RDB to Hadoop or vice versa. A Relational Database refers to any data in a structured format. Databases in MySQL or Oracle are the examples of RDB. While exporting databases from a Relational Database to Hadoop, users must consider consistency of data, consumption of production system resources, and preparation of data for provisioning downstream pipeline. While importing the data from Hadoop to a Relational Database, users must keep in mind that directly accessing data residing on external systems within MapReduce framework complicates applications, and exposes the production system to excessive loads originating from cluster nodes. Hence, Sqoop is needed. In the next screen, we will discuss the benefits of using Sqoop.

26 Sqoop—Reasons to Use It (contd.)

Sqoop is required when database is imported from Relational Database or RDB to Hadoop or vice versa. A Relational Database refers to any data in a structured format. Databases in MySQL or Oracle are the examples of RDB. While exporting databases from a Relational Database to Hadoop, users must consider consistency of data, consumption of production system resources, and preparation of data for provisioning downstream pipeline. While importing the data from Hadoop to a Relational Database, users must keep in mind that directly accessing data residing on external systems within MapReduce framework complicates applications, and exposes the production system to excessive loads originating from cluster nodes. Hence, Sqoop is needed. In the next screen, we will discuss the benefits of using Sqoop.

27 Benefits of Sqoop

The benefits of using Sqoop are as follows: It is a tool designed to transfer data from Hadoop to an RDB and vice versa. It transforms data in Hadoop with the help of MapReduce or Hive without extra coding. It is used to import data from a relational database such as SQL (read as S-Q-L), or MySQL (read as My S-Q-L), or Oracle into the Hadoop Distributed File System. Sqoop exports data back to the RDB. In the next screen, we will focus on the Apache Hadoop ecosystem.

28 Apache Hadoop Ecosystem

The image on the screen displays the various Hadoop ecosystem components as part of Apache Software Foundation projects. Please note there are many other commercial and open source offerings apart from the Apache projects mentioned on this screen. The Hadoop ecosystem components have been categorized as follows: • File system • Data store • Serialization • Job execution • Work management • Development • Operations • Security • Data transfer • Data interactions • Analytics and intelligence • Search and • Graph processing In the next few screens, we will discuss some of the Hadoop ecosystem components. We will start with Apache Oozie in the following screen.

29 Apache Oozie

Apache Oozie is a workflow scheduler system used to manage Hadoop MapReduce jobs. The workflow scheduler provides an option to the users to prioritize jobs based on their requirement. Apache Oozie executes and monitors workflows in Hadoop. It also performs periodic scheduling of workflows. Further, Oozie has the capability to trigger the execution of workflows based on data availability. It also provides a web and CLI. In the next screen, we will focus on an introduction to Mahout.

30 Introduction to Mahout

Mahout is an ecosystem component that is dedicated to machine learning. The Machine learning process can be done in three modes, namely, supervised, unsupervised and semi-supervised modes. In the next screen, we will focus on the usage of Mahout.

31 Usage of Mahout

Mahout helps in clustering, which is one of the most popular techniques of machine learning. Clustering allows the system to group numerous entities into separate clusters or groups based on certain characteristics or features of the entities. One of the best examples of clustering is seen in the Google News section. In the next screen, we will focus on an introduction to Apache Cassandra.

32 Apache Cassandra

Apache Cassandra is a freely distributed, high-performance, extremely scalable, and fault-tolerant post-relational database. It has the following features: • It is designed keeping in mind that system or hardware failures can occur. • Cassandra follows read or write-anywhere design, which makes it different from other eco-system components. The benefits of Cassandra are that: • it performs Online Transaction Processing or OLTP operations and Online Analytical Processing or OLAP operations; and • it helps to modify real-time data and perform data analytics. In the next screen, we will discuss Apache Spark.

33 Apache Spark

Apache Spark is a fast and general MapReduce-like engine for large-scale data processing. Following are the key advantages of Spark: Firstly, spark has speed: • Spark claims to run programs up to 100 times faster than Hadoop MapReduce in memory, or ten times faster on disk. • Spark has an advanced DAG (read as D-A-G) execution engine that supports cyclic data flow and in-memory computing. Secondly, Spark is easy to use: • It offers support to write applications quickly in Java, Scala, or Python. • It offers interactive Scala and Python shells. Thirdly, Spark provides generality: • It can combine SQL, streaming, and complex analytics. • It powers a stack of high-level tools including Spark SQL, MLlib (Read as M-L-Lib) for machine learning, GraphX (Read as Graph X), and Spark Streaming. Fourthly, Spark is integrated with Hadoop: Spark can run on YARN (read as one word ‘yarn’) cluster manager of Hadoop 2, and can read any existing Hadoop data. In the next screen, we will discuss Apache Ambari.

34 Apache Ambari

Apache Ambari is a completely open operational framework for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari enables system administrators to: provision a Hadoop cluster; manage a Hadoop cluster; monitor a Hadoop cluster; and integrate Hadoop with the Enterprise operational tools. In the next screen, we will list the key features of Apache Ambari.

35 Key Features of Apache Ambari

Some of the key features of Apache Ambari are as follows: It has a wizard driven installation for Hadoop across any number of hosts. Ambari provides API driven installation of Hadoop via Ambari Blueprints for automated provisioning. It has a granular control of Hadoop service and component lifecycles. It helps in the management of Hadoop service configurations and advanced job diagnostic and visualization tools. Ambari has robust RESTful APIs for customization and integration with enterprise systems. In the next screen, we will focus on Kerberos that ensures Hadoop security.

36 Hadoop Security—Kerberos

Hadoop relies on Kerberos for secure authentication. Kerberos is a third-party authentication mechanism in which users and services that users wish to access, rely on the Kerberos server to authenticate each to the other. The Kerberos server, also known as Key Distribution Center or KDC has three parts: Principals is a database of the users and services their respective Kerberos passwords. Authentication Server or AS is meant for initial authentication and issuing a Ticket Granting Ticket or TGT (read as T-G-T). Ticket Granting Server or TGS is meant for issuing subsequent service tickets based on the ini

36 Hadoop Security—Kerberos

Hadoop relies on Kerberos for secure authentication. Kerberos is a third-party authentication mechanism in which users and services that users wish to access, rely on the Kerberos server to authenticate each to the other. The Kerberos server, also known as Key Distribution Center or KDC has three parts: Principals is a database of the users and services their respective Kerberos passwords. Authentication Server or AS is meant for initial authentication and issuing a Ticket Granting Ticket or TGT (read as T-G-T). Ticket Granting Server or TGS is meant for issuing subsequent service tickets based on the ini

37 Summary

Let us summarize the topics covered in this lesson: • Hive is a data warehouse system facilitating the analysis of large data sets in Hadoop. • HBase is a distributed column-oriented database built on top of HDFS. • Cloudera offers the user-friendly Cloudera Manager for system management. • Hortonworks Data Platform, MapR data platform, and Pivotal HD are some of the commercial distributions of Hadoop. • Some of the components of the Hadoop ecosystem are Oozie, Cassandra, and Spark. Next, we will look at a few questions based on the lessons covered.

3 Hive—Introduction

Hive is defined as a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large data sets stored in Hadoop. Following are the facts related to Hive: It provides an SQL-like (read as S-Q-L like) language called HiveQL or HQL (read as H-Q-L). Due to its SQL-like interface, Hive is a popular choice for Hadoop analytics. It provides massive scale-out and fault tolerance capabilities for data storage and processing of commodity hardware. Relying on MapReduce for execution, Hive is batch-oriented and has high latency for query execution. In the next screen, we will discuss the key characteristics of Hive.

4 Hive—Characteristics

Hive is a system for managing and querying unstructured data into a structured format. It uses the concept of MapReduce for the execution of its scripts and the Hadoop Distributed File System or HDFS for storage and retrieval of data. Following are the key principles underlying Hive: Hive commands are similar to that of SQL (read as S-Q-L). SQL is a data warehousing tool that is similar to Hive. Hence, learning Hive will not be a big challenge for those who are familiar with SQL. Hive contains extensive, pluggable MapReduce scripts in the language of your choice. These scripts include rich, user-defined data types and user-defined functions. Hive has an extensible framework to support different files and data formats. Performance is better in Hive since Hive engine uses the best in-built script to reduce the execution time, thus enabling high output in less time. In the next screen, we will discuss the system architecture and the components of Hive.

7 Data Model—Tables

Hive tables are analogous to tables in relational databases. A Hive table logically comprises the data that is stored and the associated metadata. Each table has a corresponding directory in HDFS. There are two types of tables in Hive. They are managed tables and external tables. In the next screen, we will focus on data types in Hive.

11 HBase—Introduction

Apache HBase is a distributed, column-oriented database built on top of HDFS. HBase can scale horizontally to thousands of commodity servers and petabytes of data by indexing the storage. Apache HBase is an open-source, distributed, and versioned non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. HBase supports random, real-time CRUD (read as “C-R-U-D”) operations. CRUD stands for Create, Read, Update, and Delete. The goal of HBase is to host very large tables with billions of rows and millions of columns, atop clusters of commodity hardware. In the next screen, we will focus on the key characteristics of HBase.

15 Cloudera—Introduction

Cloudera is a commercial tool used to deploy Hadoop in an enterprise setup. Following are the salient features of Cloudera: Cloudera uses 100% open-source distribution of Apache Hadoop and related projects such as Apache Pig, Apache Hive, Apache HBase and Apache Sqoop. Cloudera offers the user-friendly Cloudera Manager for system management, Cloudera Navigator for data management, dedicated technical support, and so on. In the next screen, we will explore Cloudera distribution.

25 Sqoop—Reasons to Use It

Sqoop is an Apache Hadoop ecosystem project whose responsibility is to import or export operations across relational databases such as MySQL (read as My S-Q-L), MSSQL (read as M-S S-Q-L) and Oracle to HDFS. Listed on the screen are the reasons to use Scoop. SQL servers are deployed worldwide and are the primary ways to accept the data from a user. Nightly processing is done on SQL server for years. It is essential to have a mechanism to move the data from traditional SQL DB to Hadoop HDFS as Hadoop makes its way into enterprises. Transferring the data using automated scripts is inefficient and time-consuming. Traditional DB has reporting, data visualization, and other enterprise applications built in but to handle large data, we need an ecosystem. The need to bring the processed data from Hadoop HDFS to the applications like database engine or web services is satisfied by Sqoop. In the next screen, we will continue to discuss why Sqoop is needed.

26 Sqoop—Reasons to Use It (contd.)

Sqoop is required when database is imported from Relational Database or RDB to Hadoop or vice versa. A Relational Database refers to any data in a structured format. Databases in MySQL or Oracle are the examples of RDB. While exporting databases from a Relational Database to Hadoop, users must consider consistency of data, consumption of production system resources, and preparation of data for provisioning downstream pipeline. While importing the data from Hadoop to a Relational Database, users must keep in mind that directly accessing data residing on external systems within MapReduce framework complicates applications, and exposes the production system to excessive loads originating from cluster nodes. Hence, Sqoop is needed. In the next screen, we will discuss the benefits of using Sqoop.

36 Hadoop Security—Kerberos

Hadoop relies on Kerberos for secure authentication. Kerberos is a third-party authentication mechanism in which users and services that users wish to access, rely on the Kerberos server to authenticate each to the other. The Kerberos server, also known as Key Distribution Center or KDC has three parts: Principals is a database of the users and services their respective Kerberos passwords. Authentication Server or AS is meant for initial authentication and issuing a Ticket Granting Ticket or TGT (read as T-G-T). Ticket Granting Server or TGS is meant for issuing subsequent service tickets based on the ini

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*