If you have had a look at the Hadoop Ecosystem, you may have noticed the yellow elephant trunk logo that says HIVE, but do you know what Hive is all about and what it does? At a high level, some of Hive's main features include querying and analyzing large datasets stored in HDFS. It supports easy data summarization, ad-hoc queries, and analysis of vast volumes of data stored in various databases and file systems that integrate with Hadoop. In other words, in the world of big data, Hive is huge.
Fig: Hadoop Ecosystem
In this Hive tutorial, let's start by understanding why Hive came into existence.
History of Hive
Hive has a fascinating history related to the world's largest social networking site: Facebook. Facebook adopted the Hadoop framework to manage their big data. If you have read our previous blogs, you would know that big data is nothing but massive amounts of data that cannot be stored, processed, and analyzed by traditional systems.
As we know, Hadoop uses MapReduce to process data. With MapReduce, users were required to write long and extensive Java code. Not all users were well-versed with Java and other coding languages. Users were comfortable with writing queries in SQL (Structured Query Language), and they wanted a language similar to SQL. Enter the HiveQL language. The idea was to incorporate the concepts of tables and columns, just like SQL.
Hive is a data warehouse system that is used to query and analyze large datasets stored in the HDFS. Hive uses a query language called HiveQL, which is similar to SQL.
As seen from the image below, the user first sends out the Hive queries. These queries are converted into MapReduce tasks, and that accesses the Hadoop MapReduce system.
Fig: Hive Process
In the next section of the Hive tutorial, let's now take a look at the architecture of the Hive.
Architecture of Hive
The architecture of the Hive is as shown below. We start with the Hive client, who could be the programmer who is proficient in SQL, to look up the data that is needed.
Fig: Architecture of Hive
The Hive client supports different types of client applications in different languages to perform queries. Thrift is a software framework. The Hive Server is based on Thrift, so it can serve requests from all of the programming languages that support Thrift.
Next, we have the JDBC (Java Database Connectivity) application and Hive JDBC Driver.
The JDBC application is connected through the JDBC Driver. Then we have an ODBC (Open Database Connectivity) application connected through the ODBC Driver. All these client requests are submitted to the Hive server.
In addition to the above, we also have the Hive web interface, or GUI, where programmers execute Hive queries. Commands are executed directly in CLI. Up next is the Hive driver, which is responsible for all the queries submitted. It performs three steps internally:
- Compiler - The Hive driver passes the query to the compiler, where it is checked and analyzed
- Optimizer - Optimized logical plan in the form of a graph of MapReduce and HDFS tasks is obtained
- Executor - In the final step, the tasks are executed
Metastore is a repository for Hive metadata. It stores metadata for Hive tables, and you can think of this as your schema. This is located on the Apache Derby DB. Hive uses the MapReduce framework to process queries. Finally, we have distributed storage, which is HDFS. If you have read our other Hadoop blogs, you'll know that these are on commodity machines and are linearly scalable, which means they're very affordable.
In this Hive tutorial, let's understand how does the data flow in the Hive.
Data Flow in Hive
Data flow in the Hive contains the Hive and Hadoop system. Underneath the user interface, we have driver, compiler, execution engine, and metastore. All of that goes into the MapReduce and the Hadoop file system.
Fig: Data flow in Hive
The data flow in the following sequence:
- We execute a query, which goes into the driver
- Then the driver asks for the plan, which refers to the query execution
- After this, the compiler gets the metadata from the metastore
- The metastore responds with the metadata
- The compiler gathers this information and sends the plan back to the driver
- Now, the driver sends the execution plan to the execution engine
- The execution engine acts as a bridge between the Hive and Hadoop to process the query
- In addition to this, the execution engine also communicates bidirectionally with the metastore to perform various operations, such as create and drop tables
- Finally, we have a bidirectional communication to fetch and send results back to the client
Fig: Data flow in Hive
Hive Data Modeling
That was how data flows in the Hive. Let's now take a look at Hive data modeling, which consists of tables, partitions, and buckets:
- Tables - Tables in Hive are created the same way it is done in RDBMS
- Partitions - Here, tables are organized into partitions for grouping similar types of data based on the partition key
- Buckets - Data present in partitions can be further divided into buckets for efficient querying
Fig: Hive Data Modelling
Now, as you have understood what is Hive data modeling, let us dive into the Hive data types in this Hive tutorial.
Hive Data Types
Now that you know how data is classified in Hive. Let us look into the different Hive data types. These are classified as primitive and complex data types.
Primitive Data Types:
- Numeric Data types - Data types like integral, float, decimal
- String Data type - Data types like char, string
- Date/ Time Data type - Data types like timestamp, date, interval
- Miscellaneous Data type - Data types like Boolean and binary
Complex Data Types:
- Arrays - A collection of the same entities. The syntax is: array<data_type>
- Maps - A collection of key-value pairs and the syntax is map<primitive_type, data_type>
- Structs - A collection of complex data with comments. Syntax: struct<col_name : data_type [COMMENT col_comment],…..>
- Units - A collection of heterogeneous data types. Syntax: uniontype<data_type, data_type,..>
Different Modes of Hive
Next, let us move on to understand the modes Hive operates in. Hive operates in two modes depending on the number and size of data nodes. They are:
- Local Mode - Used when Hadoop has one data node, and the amount of data is small. Here, the processing will be very fast on smaller datasets, which are present in local machines.
- Mapreduce Mode - Used when the data in Hadoop is spread across multiple data nodes. Processing large datasets can be more efficient using this mode.
Difference Between Hive and RDBMS
RDBMS, which stands for Relational Database Management System, also works with tables. But what is the difference between Hive and RDBMS?
Now, let us get to know the features of Hive in this Hive tutorial.
Features of Hive
Now that we have learned about the architecture of the Hive, the different data types of Hive, and Hive data modeling, let us look into the Hive's various features:
- The use of a SQL-like language called HiveQL in Hive is easier than long codes
- In Hive, tables are used that are similar to RDBMS, hence easier to understand
- By using HiveQL, multiple users can simultaneously query data
- Hive supports a variety of data formats
Finally, we will go through a quick Hive demo, which will help us understand how HiveQL works. Before diving into the demo, you can have a quick look at the Hive website, which is hive.apache.org. Hortonworks provides a useful Hive cheat sheet, too. It shows the different HiveQL commands and various data types.
Now, let's run our Hive demo on a Hadoop cluster. We will use the Cloudera QuickStart, which has the Hadoop setup on a single node. Hadoop and Hive are designed to run across a cluster of computers, but here we are talking about a single node. Therefore, we will start with the Cloudera QuickStart.
When you are in Cloudera, you can access the Hive in two ways. The first one is by using Hue, which has more visuals than code. The screen will look like this:
Once you click on Hive, as seen above, you can start writing queries in the query space. The downside of Hue is that it can be slow. Now, we will move on to the Linux terminal window and start writing commands. You have to start by typing the Hive; this will start the shell. Your screen will now look like the following:
Now, type this:
create database office; // We are creating a database called the office
show databases; // Shows the created database
drop database office; // Drops the office database as it is empty
drop database office cascade; // Drops the tables in the database when it is not empty
create database office; // We will recreate the database office
use office; // Sets office as the default database
Then, open another terminal window and type the following:
From the above image, you can see that we already have a file named Employee.csv, which we will be using. The content of the file is as seen above; the data has a header, and below it, all the values are given. These values are comma-separated. You will then type the following:
pwd // Displays the path
gedit Employee.csv // Lets you edit the content of the table and remove any extra spaces
Go back to the Hive shell and enter the following command:
|create table employee // No ";" as we don't want to execute the line
In addition to this, type the following schema for our database. If you already have it ready, you can paste it.
We put a semicolon at the end and run these lines. Then type show tables; you will see the employee displayed. You can also type describe employee; you will see the data and its types described. After this, you can go back to the Linux terminal, and copy the path and enter the following commands:
In the above image, we put content into the table. Then we can type:
select * from employee; // Displays all the content
select count (*) from employee; // Starts the count and finally displays the number of rows
select * from office.employee WHERE Salary>25000; // Displays the below results
Hive also gives you the ability to alter the table and rename it. You can then have a look at the renamed table:
Now, go back to the loading data and see the tables by navigating back to the terminal window. Then, go ahead and type the cat commands to display the data:
We just completed the above operation to join the different datasets.
If we completed the above steps correctly, we should be able to select the data and complete the following steps:
Now, to find some specific information related to these orders, type the following:
You will now have the final result, as displayed below:
This is very helpful if you have to find particular information. This is how the join works, and it is the widespread use of HiveQL. After this, you can also go ahead and perform the drop operation, along with cascade. You can also try out a few functions used in Hive. Let's have a look at a few of them:
hive> SELECT round(2.3) from temp; // Rounds off the value to the nearest highest integer -> 2.3 - 2
hive> SELECT floor(2.3) from temp; // Rounds off any positive or negative decimal value down to the next least integer value -> 2.3 - 2
hive> SELECT ceil(2.3) from temp; // This function is used to get the smallest integer which is greater than, or equal to, the specified numeric expression -> 2.3 - 3
This was the Hive demo as you saw; all the HiveQL queries are very similar to SQL. The commands are easy to understand and perform.
We hope this has helped you gain a better understanding of Apache Hive. You have learned about the importance of Hive, what Hive does, the various data types in Hive, the different modes in which Hive operates, and the differences between Hive and RDBMS. You also learned how Hive works through a short demo.
If you want to learn more about Big Data and Hadoop, enroll in our Caltech Post Graduate Program In Data Science today!