Hive Tutorial: Working with Data in Hadoop

If you have had a look at the Hadoop Ecosystem, you may have noticed the yellow elephant trunk logo that says HIVE, but do you know what Hive is all about and what it does? At a high level, some of Hive's main features include querying and analyzing large datasets stored in HDFS. It supports easy data summarization, ad-hoc queries, and analysis of vast volumes of data stored in various databases and file systems that integrate with Hadoop. In other words, in the world of big data, Hive is huge.

Hadoop Ecosystem

 Fig: Hadoop Ecosystem

In this Hive tutorial, let's start by understanding why Hive came into existence.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

History of Hive

Hive has a fascinating history related to the world's largest social networking site: Facebook. Facebook adopted the Hadoop framework to manage their big data. If you have read our previous blogs, you would know that big data is nothing but massive amounts of data that cannot be stored, processed, and analyzed by traditional systems.

As we know, Hadoop uses MapReduce to process data. With MapReduce, users were required to write long and extensive Java code. Not all users were well-versed with Java and other coding languages. Users were comfortable with writing queries in SQL (Structured Query Language), and they wanted a language similar to SQL. Enter the HiveQL language. The idea was to incorporate the concepts of tables and columns, just like SQL. 

Hive is a data warehouse system that is used to query and analyze large datasets stored in the HDFS. Hive uses a query language called HiveQL, which is similar to SQL. 

As seen from the image below, the user first sends out the Hive queries. These queries are converted into MapReduce tasks, and that accesses the Hadoop MapReduce system. 

Hive Process

Fig: Hive Process

In the next section of the Hive tutorial, let's now take a look at the architecture of the Hive.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Architecture of Hive

The architecture of the Hive is as shown below. We start with the Hive client, who could be the programmer who is proficient in SQL, to look up the data that is needed. 

Architecture of Hive

Fig: Architecture of Hive

The Hive client supports different types of client applications in different languages to perform queries. Thrift is a software framework. The Hive Server is based on Thrift, so it can serve requests from all of the programming languages that support Thrift. 

Next, we have the JDBC (Java Database Connectivity) application and Hive JDBC Driver. 

The JDBC application is connected through the JDBC Driver. Then we have an ODBC (Open Database Connectivity) application connected through the ODBC Driver. All these client requests are submitted to the Hive server. 

In addition to the above, we also have the Hive web interface, or GUI, where programmers execute Hive queries. Commands are executed directly in CLI. Up next is the Hive driver, which is responsible for all the queries submitted. It performs three steps internally:

  1. Compiler - The Hive driver passes the query to the compiler, where it is checked and analyzed 
  2. Optimizer - Optimized logical plan in the form of a graph of MapReduce and HDFS tasks is obtained
  3. Executor - In the final step, the tasks are executed 

Metastore is a repository for Hive metadata. It stores metadata for Hive tables, and you can think of this as your schema. This is located on the Apache Derby DB. Hive uses the MapReduce framework to process queries. Finally, we have distributed storage, which is HDFS. If you have read our other Hadoop blogs, you'll know that these are on commodity machines and are linearly scalable, which means they're very affordable.

In this Hive tutorial, let's understand how does the data flow in the Hive.

Data Flow in Hive

Data flow in the Hive contains the Hive and Hadoop system. Underneath the user interface, we have driver, compiler, execution engine, and metastore. All of that goes into the MapReduce and the Hadoop file system.

Data Flow in Hive

  Fig: Data flow in Hive

The data flow in the following sequence:

  1. We execute a query, which goes into the driver
  2. Then the driver asks for the plan, which refers to the query execution 
  3. After this, the compiler gets the metadata from the metastore
  4. The metastore responds with the metadata
  5. The compiler gathers this information and sends the plan back to the driver
  6. Now, the driver sends the execution plan to the execution engine
  7. The execution engine acts as a bridge between the Hive and Hadoop to process the query
  8. In addition to this, the execution engine also communicates bidirectionally with the metastore to perform various operations, such as create and drop tables
  9. Finally, we have a bidirectional communication to fetch and send results back to the client

Data Flow in Hive

Fig: Data flow in Hive

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Hive Data Modeling

That was how data flows in the Hive. Let's now take a look at Hive data modeling, which consists of tables, partitions, and buckets:

  1. Tables - Tables in Hive are created the same way it is done in RDBMS 
  2. Partitions - Here, tables are organized into partitions for grouping similar types of data based on the partition key 
  3. Buckets - Data present in partitions can be further divided into buckets for efficient querying 

Hive Data Modeling

Fig: Hive Data Modelling

Now, as you have understood what is Hive data modeling, let us dive into the Hive data types in this Hive tutorial.

Hive Data Types

Now that you know how data is classified in Hive. Let us look into the different Hive data types. These are classified as primitive and complex data types. 

Primitive Data Types:

  1. Numeric Data types - Data types like integral, float, decimal
  2. String Data type - Data types like char, string
  3. Date/ Time Data type - Data types like timestamp, date, interval
  4. Miscellaneous Data type - Data types like Boolean and binary

Complex Data Types:

  1. Arrays - A collection of the same entities. The syntax is: array<data_type>
  2. Maps - A collection of key-value pairs and the syntax is map<primitive_type, data_type>
  3. Structs - A collection of complex data with comments. Syntax: struct<col_name : data_type [COMMENT col_comment],…..>
  4. Units - A collection of heterogeneous data types. Syntax: uniontype<data_type, data_type,..>

Different Modes of Hive

Next, let us move on to understand the modes Hive operates in. Hive operates in two modes depending on the number and size of data nodes. They are:

  1. Local Mode - Used when Hadoop has one data node, and the amount of data is small. Here, the processing will be very fast on smaller datasets, which are present in local machines.
  2. Mapreduce Mode - Used when the data in Hadoop is spread across multiple data nodes. Processing large datasets can be more efficient using this mode. 

Difference Between Hive and RDBMS 

RDBMS, which stands for Relational Database Management System, also works with tables. But what is the difference between Hive and RDBMS?

HiveRDBMS
  • Hive enforces schema on reading
  • Hive data size is in petabytes
  • Hive is based on the notion of write once and read many times
  • Hive resembles a traditional database by supporting SQL, but it is not a database; it is a data warehouse 
  • Easily scalable at low cost
  • RDBMS enforces schema on write
  • Data size is in terabytes
  • RDBMS is based on the notion of reading and write many times
  • RDBMS is a type of database management system, which is based on the relational model of data
  • Not scalable at low cost

Now, let us get to know the features of Hive in this Hive tutorial.

Features of Hive

Now that we have learned about the architecture of the Hive, the different data types of Hive, and Hive data modeling, let us look into the Hive's various features: 

  1. The use of a SQL-like language called HiveQL in Hive is easier than long codes
  2. In Hive, tables are used that are similar to RDBMS, hence easier to understand
  3. By using HiveQL, multiple users can simultaneously query data 
  4. Hive supports a variety of data formats

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Hive Demo

Finally, we will go through a quick Hive demo, which will help us understand how HiveQL works. Before diving into the demo, you can have a quick look at the Hive website, which is hive.apache.org. Hortonworks provides a useful Hive cheat sheet, too. It shows the different HiveQL commands and various data types. 

Now, let's run our Hive demo on a Hadoop cluster. We will use the Cloudera QuickStart, which has the Hadoop setup on a single node. Hadoop and Hive are designed to run across a cluster of computers, but here we are talking about a single node. Therefore, we will start with the Cloudera QuickStart.  

When you are in Cloudera, you can access the Hive in two ways. The first one is by using Hue, which has more visuals than code. The screen will look like this:

Hive Demo

Once you click on Hive, as seen above, you can start writing queries in the query space. The downside of Hue is that it can be slow. Now, we will move on to the Linux terminal window and start writing commands. You have to start by typing the Hive; this will start the shell. Your screen will now look like the following:

Hive Demo

Now, type this:

create database office; // We are creating a database called the office

show databases; // Shows the created database

drop database office; // Drops the office database as it is empty

drop database office cascade; // Drops the tables in the database when it is not empty

create database office; // We will recreate the database office

use office; // Sets office as the default database

Then, open another terminal window and type the following:

Hive Demo

From the above image, you can see that we already have a file named Employee.csv, which we will be using. The content of the file is as seen above; the data has a header, and below it, all the values are given. These values are comma-separated. You will then type the following:

pwd // Displays the path

gedit Employee.csv // Lets you edit the content of the table and remove any extra spaces

Go back to the Hive shell and enter the following command:

create table employee // No ";" as we don't want to execute the line

In addition to this, type the following schema for our database. If you already have it ready, you can paste it.

Hive Demo

We put a semicolon at the end and run these lines. Then type show tables; you will see the employee displayed. You can also type describe employee; you will see the data and its types described. After this, you can go back to the Linux terminal, and copy the path and enter the following commands:

Hive Demo

In the above image, we put content into the table. Then we can type:

select * from employee; // Displays all the content

select count (*) from employee; // Starts the count and finally displays the number of rows

select * from office.employee WHERE Salary>25000; // Displays the below results 

Hive Demo

Hive also gives you the ability to alter the table and rename it. You can then have a look at the renamed table:

Hive Demo

Now, go back to the loading data and see the tables by navigating back to the terminal window. Then, go ahead and type the cat commands to display the data:

Hive Demo

We just completed the above operation to join the different datasets.

Hive Demo

If we completed the above steps correctly, we should be able to select the data and complete the following steps:

Hive Demo

Now, to find some specific information related to these orders, type the following:

Hive Demo

You will now have the final result, as displayed below:

Hive Demo

This is very helpful if you have to find particular information. This is how the join works, and it is the widespread use of HiveQL. After this, you can also go ahead and perform the drop operation, along with cascade. You can also try out a few functions used in Hive. Let's have a look at a few of them:

hive> SELECT round(2.3) from temp; // Rounds off the value to the nearest highest integer -> 2.3 - 2

hive> SELECT floor(2.3) from temp; // Rounds off any positive or negative decimal value down to the next least integer value -> 2.3 - 2

hive> SELECT ceil(2.3) from temp; // This function is used to get the smallest integer which is greater than, or equal to, the specified numeric expression -> 2.3 - 3

This was the Hive demo as you saw; all the HiveQL queries are very similar to SQL. The commands are easy to understand and perform. 

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Conclusion

We hope this has helped you gain a better understanding of Apache Hive. You have learned about the importance of Hive, what Hive does, the various data types in Hive, the different modes in which Hive operates, and the differences between Hive and RDBMS. You also learned how Hive works through a short demo.

If you want to learn more about Big Data and Hadoop, enroll in our Caltech Post Graduate Program In Data Science today!

About the Author

Shruti MShruti M

Shruti is an engineer and a technophile. She works on several trending technologies. Her hobbies include reading, dancing and learning new languages. Currently, she is learning the Japanese language.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.