If you have had a look at the Hadoop Ecosystem, you may have noticed the yellow elephant trunk logo that says HIVE, but do you know what Hive is all about and what it does? At a high level, some of Hive's main features include querying and analyzing large datasets stored in HDFS. It supports easy data summarization, ad-hoc queries, and analysis of vast volumes of data stored in various databases and file systems that integrate with Hadoop. In other words, in the world of big data, Hive is huge.
Fig: Hadoop Ecosystem
In this Hive tutorial, we will talk in-depth about the Hive. Let's have a look at the topics we will be covering in this tutorial:
In this Hive tutorial, let's start by understanding why Hive came into existence.
Looking forward to becoming a Hadoop Developer? Check out the Big Data Hadoop Certification Training Course and get certified today.
Hive has a fascinating history related to the world's largest social networking site: Facebook. Facebook adopted the Hadoop framework to manage their big data. If you have read our previous blogs, you would know that big data is nothing but massive amounts of data that cannot be stored, processed, and analyzed by traditional systems.
As we know, Hadoop uses MapReduce to process data. With MapReduce, users were required to write long and extensive Java code. Not all users were well-versed with Java and other coding languages. Users were comfortable with writing queries in SQL (Structured Query Language), and they wanted a language similar to SQL. Enter the HiveQL language. The idea was to incorporate the concepts of tables and columns, just like SQL.
Hive is a data warehouse system that is used to query and analyze large datasets stored in the HDFS. Hive uses a query language called HiveQL, which is similar to SQL.
As seen from the image below, the user first sends out the Hive queries. These queries are converted into MapReduce tasks, and that accesses the Hadoop MapReduce system.
Fig: Hive Process
In the next section of the Hive tutorial, let's now take a look at the architecture of the Hive.
The architecture of the Hive is as shown below. We start with the Hive client, who could be the programmer who is proficient in SQL, to look up the data that is needed.
Fig: Architecture of Hive
The Hive client supports different types of client applications in different languages to perform queries. Thrift is a software framework. The Hive Server is based on Thrift, so it can serve requests from all of the programming languages that support Thrift.
Next, we have the JDBC (Java Database Connectivity) application and Hive JDBC Driver.
The JDBC application is connected through the JDBC Driver. Then we have an ODBC (Open Database Connectivity) application connected through the ODBC Driver. All these client requests are submitted to the Hive server.
In addition to the above, we also have the Hive web interface, or GUI, where programmers execute Hive queries. Commands are executed directly in CLI. Up next is the Hive driver, which is responsible for all the queries submitted. It performs three steps internally:
Metastore is a repository for Hive metadata. It stores metadata for Hive tables, and you can think of this as your schema. This is located on the Apache Derby DB. Hive uses the MapReduce framework to process queries. Finally, we have distributed storage, which is HDFS. If you have read our other Hadoop blogs, you'll know that these are on commodity machines and are linearly scalable, which means they're very affordable.
In this Hive tutorial, let's understand how does the data flow in the Hive.
Data flow in the Hive contains the Hive and Hadoop system. Underneath the user interface, we have driver, compiler, execution engine, and metastore. All of that goes into the MapReduce and the Hadoop file system.
Fig: Data flow in Hive
The data flow in the following sequence:
Fig: Data flow in Hive
That was how data flows in the Hive. Let's now take a look at Hive data modeling, which consists of tables, partitions, and buckets:
Fig: Hive Data Modelling
Now, as you have understood what is Hive data modeling, let us dive into the Hive data types in this Hive tutorial.
Now that you know how data is classified in Hive. Let us look into the different Hive data types. These are classified as primitive and complex data types.
Next, let us move on to understand the modes Hive operates in. Hive operates in two modes depending on the number and size of data nodes. They are:
RDBMS, which stands for Relational Database Management System, also works with tables. But what is the difference between Hive and RDBMS?
Hive | RDBMS |
|
|
Now, let us get to know the features of Hive in this Hive tutorial.
Now that we have learned about the architecture of the Hive, the different data types of Hive, and Hive data modeling, let us look into the Hive's various features:
Finally, we will go through a quick Hive demo, which will help us understand how HiveQL works. Before diving into the demo, you can have a quick look at the Hive website, which is hive.apache.org. Hortonworks provides a useful Hive cheat sheet, too. It shows the different HiveQL commands and various data types.
Now, let's run our Hive demo on a Hadoop cluster. We will use the Cloudera QuickStart, which has the Hadoop setup on a single node. Hadoop and Hive are designed to run across a cluster of computers, but here we are talking about a single node. Therefore, we will start with the Cloudera QuickStart.
When you are in Cloudera, you can access the Hive in two ways. The first one is by using Hue, which has more visuals than code. The screen will look like this:
Once you click on Hive, as seen above, you can start writing queries in the query space. The downside of Hue is that it can be slow. Now, we will move on to the Linux terminal window and start writing commands. You have to start by typing the Hive; this will start the shell. Your screen will now look like the following:
Now, type this:
create database office; // We are creating a database called the office show databases; // Shows the created database drop database office; // Drops the office database as it is empty drop database office cascade; // Drops the tables in the database when it is not empty create database office; // We will recreate the database office use office; // Sets office as the default database |
Then, open another terminal window and type the following:
From the above image, you can see that we already have a file named Employee.csv, which we will be using. The content of the file is as seen above; the data has a header, and below it, all the values are given. These values are comma-separated. You will then type the following:
pwd // Displays the path gedit Employee.csv // Lets you edit the content of the table and remove any extra spaces |
Go back to the Hive shell and enter the following command:
create table employee // No ";" as we don't want to execute the line |
In addition to this, type the following schema for our database. If you already have it ready, you can paste it.
We put a semicolon at the end and run these lines. Then type show tables; you will see the employee displayed. You can also type describe employee; you will see the data and its types described. After this, you can go back to the Linux terminal, and copy the path and enter the following commands:
In the above image, we put content into the table. Then we can type:
select * from employee; // Displays all the content select count (*) from employee; // Starts the count and finally displays the number of rows select * from office.employee WHERE Salary>25000; // Displays the below results |
Hive also gives you the ability to alter the table and rename it. You can then have a look at the renamed table:
Now, go back to the loading data and see the tables by navigating back to the terminal window. Then, go ahead and type the cat commands to display the data:
We just completed the above operation to join the different datasets.
If we completed the above steps correctly, we should be able to select the data and complete the following steps:
Now, to find some specific information related to these orders, type the following:
You will now have the final result, as displayed below:
This is very helpful if you have to find particular information. This is how the join works, and it is the widespread use of HiveQL. After this, you can also go ahead and perform the drop operation, along with cascade. You can also try out a few functions used in Hive. Let's have a look at a few of them:
hive> SELECT round(2.3) from temp; // Rounds off the value to the nearest highest integer -> 2.3 - 2 hive> SELECT floor(2.3) from temp; // Rounds off any positive or negative decimal value down to the next least integer value -> 2.3 - 2 hive> SELECT ceil(2.3) from temp; // This function is used to get the smallest integer which is greater than, or equal to, the specified numeric expression -> 2.3 - 3 |
This was the Hive demo as you saw; all the HiveQL queries are very similar to SQL. The commands are easy to understand and perform.
Test your understanding of the spark and hive concepts with the Big Data and Hadoop Developer Practice Test. Try answering now!
We hope this has helped you gain a better understanding of Apache Hive. You have learned about the importance of Hive, what Hive does, the various data types in Hive, the different modes in which Hive operates, and the differences between Hive and RDBMS. You also learned how Hive works through a short demo.
If you want to learn more about Big Data and Hadoop, enroll in our Big Data Hadoop Certification Training Course today!
Shruti is an engineer and a technophile. She works on several trending technologies. Her hobbies include reading, dancing and learning new languages. Currently, she is learning the Japanese language.
Big Data Engineer
Big Data Hadoop and Spark Developer
Big Data and Hadoop Administrator
*Lifetime access to high-quality, self-paced e-learning content.
Explore Category