Advanced Hive Concepts and Data File Partitioning Tutorial

7.1 Introduction

Hello and welcome to the seventh lesson of the Big Data Hadoop and Spark Developer course offered by Simplilearn. This lesson covers an overview of the partitioning features of HIVE, which are used to improve the performance of SQL queries. You will also learn about the HIVE Query Language and how it can be extended to improve query performance. After completing this lesson, you will be able to: Improve query performance with the concepts of data file partitioning Define HIVE Query Language or HIVEQL Describe ways in which HIVEQL can be extended Overview of Data File Partitioning In the first topic of the lesson, you will get an overview of data file partitioning. In the previous lessons, you have learned that by default, all files in a data set are stored in a single Hadoop Distributed File System or HDFS directory. HIVE is considered a tool of choice for performing queries on large datasets, especially those that require full table scans. HIVE has advanced partitioning features. Data file partitioning is very useful to prune data during query, in order to reduce query times. There are many instances where users need to filter the data on specific column values. Using the partitioning feature of HIVE that subdivides the data, HIVE users can identify the columns, which can be used to organize the data. Using partitioning, the analysis can be done only on the relevant subset of data, resulting in a highly improved performance of HIVE queries. In case of partitioned tables, subdirectories are created under the table’s data directory for each unique value of a partition column. You will learn more about the partitioning features in the subsequent screens. Let’s begin with an example of a non-partitioned table. In non-partitioned tables, by default, all queries have to scan all files in the directory. This means that HIVE will need to read all the files in a table’s data directory. This can be a very slow and expensive process, especially when the tables are large. In this example, you can see that there is a State column created in HIVE. The requirement is to convert this to a state wise partition, so that separate tables are created for separate states. The customer details are required to be partitioned by state for fast retrieval of subset data pertaining to the customer category. Remember that you can perform the same queries in Impala as well. In the next screen, you will see an example of how this table is partitioned state-wise, so that a full scan on the entire table is not required. Here is an example of a partitioned table. This example shows you how the previously non-partitioned table is now partitioned. You can see that the state column is no longer included in the Create table definition, but it is included in the partition definition. Partitions are actually horizontal slices of data that allow larger sets of data to be separated in more manageable chunks. This essentially means that you can use partitioning to store data in separate files by state, as shown in the example. At the time of table creation, partitions are defined using the PARTITIONED BY clause, with a list of column definitions for partitioning. A partition column is a “virtual column” where data is not actually stored in the file. Partitioned Table Type the command In this demonstration, an external table which is partitioned by state is created. The column is string type, and the table will be stored in simplilearn/accounts by state directory in HDFS. Now, the command has created a partitioned table. This brings you to the end of this demonstration. You have now learned how to create a partitioned table in Hive. Describe Partitioned table In this demo, you will view the details of the partitioned table, named “accounts by states,” which was created in the previous demo. Type DESCRIBE accounts_by_state and execute. To see more details, you can type DESCRIBE and the Table name, which in this demonstration is accounts-by-state. You can now see all the other details along with the partitioned column details. This partition column is like a virtual column. This brings you to the end of this demonstration. You have now learned how to view the details of the partitioned table. Now that you know what partitioning is, let’s understand how you can insert data into partitioned tables. Data insertion into partitioned tables can be done in two ways or modes: Static partitioning Dynamic partitioning You will learn more about these concepts in the subsequent screens. Let’s begin with static partitioning. In the static partitioning mode, you can insert or input the data files individually into a partition table. You can create new partitions as needed, and define the new partitions using the ADD PARTITION clause. While loading data, you need to specify which partition to store the data in. This means that with each load, you need to specify the partition column value. You can add a partition in the table and move the data file into the partition of the table. As you can see in this example, you can add a partition for each new day of account data. With dynamic partitioning, partitions get created automatically at load times. New partitions can be created dynamically from existing data. Partitions are automatically created based on the value of the last column. If the partition does not already exist, it will be created. In case the partition does exist, it will be overwritten by the OVERWRITE keyword as shown in the example. As you see, a partition is being overwritten. When you have a large amount of data stored in a table, then dynamic partition is suitable. Keep in mind that by default, dynamic partitioning is disabled in HIVE to prevent accidental partition creation. To use dynamic partitioning, you need to enable it by using these settings: SET space hive dot exec dot dynamic dot partition equals true semicolon SET space hive dot exec dot dynamic dot partition dot mode equals non-strict semicolon

7.3 Overview of the Hive Query Language

Now that you have learned about the two modes of data insertion into partitioned tables, let’s take a look at some commands that are supported on Hive partitioned tables, which allow you to view and delete partitions. You can view the partitions of a partitioned table using the SHOW command, as illustrated in the image. And how do you delete the partitions? To delete drop the partitions, use the ALTER command, as shown in the image. By using the ALTER command, you can also add or change partitions. So, what are the instances when you can use or not use partitioning? Here are some instances when you use partitioning for tables: Reading the entire data set takes too long. Queries almost always filter on the partition columns. There are a reasonable number of different values for partition columns. Here are some instances when you should avoid using a partitioning: Avoid partition on columns that have too many unique rows. Be cautious while creating dynamic partition as it can lead to high number of partitions. Try to limit partition to less than 20k. You’ve seen that partitioning gives results by segregating HIVE table data into multiple files only when there are limited number of partitions. However, there may be instances where partitioning the tables results in large number of partitions. So what can be done in situations where over partitioning can become a problem? This is where the concept of bucketing comes in. Bucketing is an optimization technique similar to partitioning. You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions. So what do buckets do? They distribute the data load into user-defined set of clusters by calculating the hash code of the key mentioned in the query. Here is a syntax for creating a bucketing table. As per the syntax, the data would be classified depending on the hash number of user underscore id into 100 buckets. The processor will first calculate the hash number of the user underscore id in the query and will look for only that bucket. Overview of Hive Query Language This is the second topic of the lesson. In the previous topic, you’ve looked at the different concepts involving table partitioning in HIVE. You are aware that HIVE is a system for managing and querying structured data. In this topic, let’s look at the concept of HIVE Query Language or HIVEQL, the important principle of HIVE called extensibility, and the ways in which HIVEQL can be extended. Let’s begin with HIVEQL. It’s the SQL-like query language for HIVE to process and analyze structured data in a Metastore. Here is an example of HIVEQL query.     An important principle of HIVEQL is extensibility. HIVEQL can be extended in multiple ways: Pluggable user-defined functions Pluggable MapReduce scripts Pluggable user-defined types Pluggable data formats You will learn more about user-defined functions and MapReduce scripts in the subsequent screens. User-defined types and data formats are outside the scope of the lesson. Let’s begin with user-defined function or UDF. HIVE has the ability to define a function. UDFs provide a way of extending the functionality of HIVE with a function, written in Java that can be evaluated in HIVEQL statements. All UFDs extend the HIVE UDF class. After that, a UDF sub class then need to implement one or more methods named ‘evaluate,’ which will be called by HIVE. Evaluate should never be a void method. However, it can return null, if required. Here is a code that you can use to extend the user-defined function. After compiling the UDF, you must include it in the HIVE class path. Here is a code that you can use to register the class. Once HIVE gets started, you can use the new defined function in a query statement after registering them. This is a code to use the function in a HIVE query statement. You have learned that writing the functions in JavaScript creates its own UDF. HIVE also provides some in-built functions that can be used to avoid own UDFs from being created. Let’s take a look at what these in-built functions are. These include Mathematical, Collection, Type conversion, Date, Conditional, and String. Take a look at the examples provided for each built-in functions. For mathematical operations, you can use the examples of round, floor, and so on. For collections, you can use size, map keys, and so on. For data type conversions, you can use cast. For dates, use the following APIs like year, datediff, and so on. For conditional functions, use if, case, and coalesce. For string files, use length, reverse, and so on. Take a look at some other functions in HIVE, such as the aggregate function and the table-generating function. Aggregate functions create the output if the full set of data is given. The implementation of these functions is complex compared with that of the UDF. The user should implement a few more methods, however, the format is similar to UDF. Therefore, HIVE provides many built-in User-Defined Aggregate Functions or UDAF. Normal user-defined functions, namely concat, take in a single input row and give out a single output row. In contrast, table-generating functions transform a single input row to multiple output rows. Shown here is a lateral view that is used in conjunction with table generating functions. An SQL script in lateral view is displayed on the screen. Consider the base table named pageAds. It contains two columns: pageid, which is the name of the page and adid underscore list, which is an array of ads appearing on the page. A lateral view with explode can be used to convert the adid underscore list into separate rows using the given query. Let’s take a look at the MapReduce Scripts that helps extend the HIVEQL. MapReduce scripts are written in scripting languages such as Python. Users can plug in their own custom mappers and reducers in the data stream. To run a custom mapper script and reducer script, the user can issue a command that uses the TRANSFORM clause to embed the mapper and the reducer scripts. Look at the script shown on the screen. Here the key-value pairs will be transformed to STRING and delimited by TAB before feeding to the user script by default. The method strip returns a copy of all of the words in which whitespace characters have been stripped from the beginning and the end of the word. The method split returns a list of all of the words using TAB as the separator. Now that you have learned about UDF/UDAF and MapReduce scripts, let’s compare the user-defined and user-defined aggregate functions with MapReduce scripts. The table here shows the comparison: User-defined functions are written in Java while MapReduce scripts can be written in any language. Both user-defined functions and MapReduce scripts support 1 to 1, ‘n’ to 1, and 1 to ‘n’ input to output. However, user-defined functions are much faster compared to MapReduce scripts since the latter spawns new processes for different operations.

7.5 Key Takeaways

Now let’s summarize what we learned in this lesson. Partitions are actually horizontal slices of data that allow larger sets of data to be separated in more manageable chunks. In the static partitioning mode, you can insert or input the data files individually into a partition table. When you have a large amount of data stored in a table, then dynamic partition is suitable. Use the SHOW command to view partitions. To delete or add partitions, use the ALTER command. Use partitioning when reading the entire data set takes too long, queries almost always filter on the partition columns, and there are a reasonable number of different values for partition columns. HIVEQL is a query language for HIVE to process and analyze structured data in a Metastore. HIVEQL can be extended with the help of user-defined functions, MapReduce scripts, user-defined types, and data formats.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*