Querying with Hive and Impala Tutorial
2.1 Querying with Hive and Impala
Hello and welcome to the second lesson of the Impala—An Open Source SQL Engine for Hadoop course offered by Simplilearn. This lesson provides an overview of querying with Hive and Impala. Let us explore the objectives of this lesson in the next screen.
After completing this lesson you will be able to: • Discuss the SQL, DDL, and DML(pronounce D-D-L and D-M-L) statements of Impala • Explain how to query data using impala SQL • Recall how to use different SQL statements to perform CRUD operations in Impala • Explain how to load data into Impala tables • Differentiate between SQL statements in Hive and Impala Let us begin with the SQL language statements in the next screen.
2.3 SQL Language Statements
Impala uses Structure Query Language or SQL to perform a database query. It follows the SQL-92 syntax (pronounced as sequel ninety two sintax) version of the language. Also, Impala is compatible with the Hive Query Language or HiveQL (Pronounced as Hive Q-L) to query the data stored in Hadoop Distributed File System, HDFS, or HBase. To record information related to table structure and properties, both Impala and Hive use the same metadata store. Therefore, Impala can access tables either defined by using the native Impala command that is CREATE TABLE, or the tables that are created in Hive using data definition language or DDL. Similar to HiveQL, Impala supports most of the data manipulation or DML statements and clauses, such as JOIN, AGGREGATE, DISTINCT, UNION ALL, ORDER BY, LIMIT, and uncorrelated sub-query in the FROM clause. In addition, it also supports INSERT INTO and INSERT OVERWRITE. Let us discuss about DDL statements in the next screen.
2.4 DDL Statements
Data Definition Language or DDL is a standard for the statements and commands that help create, modify, and remove objects such as tables, indexes, and users from a database. Impala DDL statements are almost similar to the SQL statements used by people in the relational databases or data warehouse products. Typically, these statements start with the keywords such as CREATE, DROP, or ALTER. In Impala, statements such as ALTER Table, Alter View; Compute Stats; Create database; Create function; CREATE Role; CREATE Table; CREATE View; Drop Table; DROP function; DROP role; GRANT and REVOKE can be used. Let us learn about DML statements in the next screen.
2.5 DML Statements
Data Manipulation Language or DML is a subset of the SQL statements that helps to modify the data stored in tables. Impala supports only a small set of DML statements, such as the INSERT statement and the Load Data statement. The ‘INSERT statement’ in Impala is optimized to insert a large amount of data in a single statement, thereby making an effective use of the multi-megabyte HDFS blocks. To insert one or a few rows at a time INSERT into table VALUES statement be used. Whereas, the ‘LOAD DATA statement’ is used to move existing data files into an Impala table directory so that, they are immediately available for Impala queries. Let us understand the method of creating databases in the next screen.
2.6 CREATE DATABASE
In Impala, you can create two types of databases. They are the logical construct and the physical construct. A logical construct in Impala groups related tables, views, and functions under same names. Whereas, the physical construct is denoted by a tree directory in HDFS. Objects such as tables, partitions, and data files are all stored under this directory. After creating the database, you can use it with the help of the Database command. Note that, a database created in Impala can also be used in Hive. Let us understand the ‘CREATE Table’ command in the next screen.
2.7 CREATE TABLE
In Impala, a table can be created by using the ‘CREATE Table’ command. Let us understand the general syntax of creating a table in Impala with the help of the commands shown on the screen. The ‘PARTITIONED BY’ clause partitions data files based on one or more specified columns values. When the STORED AS clause is used the underlying data files formats are identified, by default. When no ‘STORED AS’ clause is used, the Impala table data files are saved as text files. Typically, for an external table, you add a ‘LOCATION’ clause to specify the path to the HDFS directory where Impala reads and writes files for the table. External table will be explained in detail later in the lesson. Let us look at some examples of the ‘CREATE Table’ command in the next screen.
2.8 CREATE TABLE - Examples
Some examples of ‘CREATE table’ command are as follows: The first command shown on the screen helps to create a managed table in the Impala. For the managed table, data has to be loaded explicitly. The second command helps to create a Partition managed table, by specifying the ‘’PARTITIONED BY’ clause in the create SQL statement. Third command on the screen helps to create an external table whose data is in the HDFS directory. The path of the directory is /user/hive/tpcds/customer_address (Pronunciation guide: slash user slash hive slash t-p-c-d-s slash customer underscore address). You need not load data explicitly for the external tables, data will be automatically loaded into this table based on the location clause in the CREATE Table command. Let us look at the two types of tables, that is, Internal and External in the next screen.
2.9 Internal and External Tables
Two types of tables can be created in Impala. They are—Managed or Internal table and External table. For Managed or Internal tables, Impala manages a directory in HDFS and stores the data files. Impala also manages the underlying data files, and deletes the related data files when a table is dropped. External Tables point at existing data files that potentially reside in HDFS out of Impala directories. You need to issue the ‘REFRESH’ command when adding or replacing data using HDFS Operation. In addition, in case of External table, the DROP TABLE statement does not physically remove the underlying data. Let us understand the method of Loading Data to an Impala Table in the next screen.
2.10 Loading Data into Impala Table
For the Managed or Internal table, data needs to be loaded by executing the ‘load data inpath’ command. The general syntax of loading data into managed impala table is given on the screen. In addition, the load command can be used to load the data into Partitioned table. An example of command used to load data into a partitioned table is given on the screen. Let us understand how to alter the properties of impala table using the ‘ALTER TABLE Statement’ in the next screen.
2.11 ALTER TABLE
An ALTER TABLE statement is used to change the structure or properties of an existing Impala table such as: rename table, add new column, drop existing column, change column name, replace column data type and add or drop partition. ALTER TABLE in Impala is a logical function that updates the table metadata in the common metastore shared with Hive. Let us discuss about the Drop Table statement in the next screen.
2.12 DROP TABLE
The Drop SQL table statement allows the deletion of an Impala table. On using this statement, the associated data files for the table and the HDFS directory is removed by default. However, if the table is created with the EXTERNAL clause, then Impala does not delete the files and directories. The general syntax of DROP TABLE command is given on the screen. Note that, before issuing a DROP DATABASE statement, use a DROP TABLE statement and remove all the tables and data files in that database. Let us learn about the ‘DROP DATABASE’ command in the next screen.
2.13 DROP DATABASE
Drop Database command is used to delete a database. However, before using the Drop Database command, use a combination of DROP TABLE; DROP VIEW; ALTER TABLE; and ALTER VIEW statements, to delete all the tables and views in the database or move them to other databases. The general syntax of DROP DATABASE command is given on the screen. Let us discuss the Describe Statement in the next screen.
2.14 DESCRIBE Statement
A table metadata, such as column names and data values can be viewed by executing the Describe statement. The ‘DESCRIBE FORMATTED’ variation shows further information in a format that the Apache Hive users are familiar with. This additional information includes minor details, such as: the table type is Internal or external, date of creation, file format, and data location in HDFS. It also shows if the object is a table or a view. If the object is a view, it shows query text from the view definition. The general syntax of Describe Statement is given on the screen. Let us understand the Explain Statement in the next screen.
2.15 EXPLAIN Statement
Explain Statement’ returns a plan to execute a statement. You can review the query plan generated by the Explain statement to check whether the query is performing efficiently or not. You can also alter the query and/or (pronounced as and or) the schema, if they are not performing as expected. Some performance enhancement steps that can be initiated are as follows: • Change the tests in the WHERE clause, • Add hints to make join operations more efficient, • Introduce sub queries, • Change the order of tables in a join operation, • Add or change partitioning for a table, and • Collect column statistics and/or (pronounced as and or) table statistics in Hive An example of EXPLAIN statement is given on the screen. The EXPLAIN Statement notifies if table or column statistics are missing from any table involved in the query. These statistics are important for optimizing queries involving large tables or multi-table joins. Let us learn about the ‘Show Table’ Statement in the next screen.
2.16 SHOW TABLE Statement
You can see all the tables and views present inside a database by executing the Show Table command. Now the question arises ‘what happens when the authorization in the database is enabled?’ When the authorization is enabled, the SHOW table command displays only those objects for which you have privilege. Let us learn about the Insert Statement in the next screen.
2.17 INSERT Statement
INSERT statement is used to insert data into tables and partitions. The two types of INSERT statements are as follows: The INSERT INTO syntax appends data to a table. In this syntax, the existing data files are left untouched whereas, the inserted data is put into one or more new data files. The other INSERT statement that is the INSERT OVERWRITE syntax, replaces the data in the table. The overwritten data files are deleted immediately thus, they do not go through the HDFS trash mechanism. The general syntax of using the INSERT Statement is given on the screen. Let us look at a few examples of the Insert Statement in the next screen.
2.18 INSERT Statement - Examples
Some examples of ‘INSERT statement’ are as follows: First statement is for inserting a row into table called sample. The second Insert statement is for inserting two rows in the table. The third Insert command is for inserting data in Partition. Let us learn about the Select Statement in the next screen.
2.19 SELECT Statement
The SELECT statement performs queries, retrieves data from tables and produces result sets consisting of rows and columns. In a Select statement, you can use following clauses: ● JOIN, ORDER BY, GROUP BY, HAVING, LIMIT, OFFSET, UNION, and WITH clause In this statement, DISTINCT Operator, subqueries, and hints like; SHUFFLE or NONSHUFFLE, can be included. Typically, the Insert Statement in Impala for copying data from one table to another ends with a SELECT statement. Let us discuss the Impala data type in the next screen.
2.20 Data Type
The correct data type for the columns must be chosen, while defining the table schema. The different data types supported in Impala are: BIGINT Data type: It is an 8-byte integer data type. BOOLEAN Data type: It is a data type that represents a single true or false choice. Note that, Impala generally does not convert a data type to BOOLEAN values automatically. To convert to BOOLEAN, you must use an explicit call to the CAST (pronounce as cast) function. Next data type is, CHAR Data type or CDH 5.2 or higher only: It is a fixed-length character type, which is followed by trailing spaces to achieve the specified length. If the values are longer than the specified length, then Impala truncates the trailing characters, if any. The DECIMAL Data type or CDH 5.1 or higher only is a numeric data type with fixed scale and precision. This data type is suitable for financial and other arithmetic calculations. The DOUBLE Data type is a floating-point data type with double precision and 8-byte memory. Let us continue discussing the remaining data types in the next screen.
2.21 Data Type (contd.)
The other data types chosen while defining the table schema are: FLOAT Data is a single precision floating-point data type that contains 4-byte on the disk. The other data type is, INT (pronounce as intt) Data Type: It is an integer data type that contains 4-byte on the disk. REAL Data Type: It is an alias for the DOUBLE data type. Next is, SMALLINT (pronounce as small intt) Data Type: It is a 2-byte integer data type. Last data type is the STRING Data Type: It is a data type represents varying character set. Let us learn about the SQL Operators in the next screen.
SQL operators are the comparison functions used within the WHERE clauses of SELECT statements. Following operators are supported in Impala. The ‘BETWEEN Operator’ is used in a ‘WHERE clause’ to compare an expression to a lower and upper bound. • Comparison Operators are used in WHERE clause to compare and check equality and sort the column data types sequence. The IN operator compares an argument value to a set of values. It returns TRUE, if the argument finds any matches in the set. • The IS NULL operator, and the IS NOT NULL operator in a WHERE clause, test whether a specified value is NULL or not. • The LIKE Operator is used for STRING data. It has the basic wildcard capability and uses ‘_’ (Pronounce as underscore) to find a single character and the % (pronounce as Percentage) to find multiple characters. • The Regular Expression Operator in the WHERE Clause tests if a value matches a regular expression. • The RLIKE (pronounce R like) Operator is a synonym for Regular Expression. Let us discuss the built-in Functions in the next screen.
Impala supports several built-in functions. They are: mathematical, type conversion, date and time, conditional, string, and aggregate functions. These functions let you perform mathematical calculations, string manipulation, date calculations, and any data transformation tasks directly in SELECT statements. These built-in functions let a SQL query return quick results with formatting, calculation, and type conversions applied. Let us look at the Query-specific SQL statements in the next screen.
2.24 CREATE VIEW in Impala
The CREATE VIEW statement is used to simplify the complicated queries. CREATE VIEW is just a logical construct that does not metalize on the disk. The syntax to create view in Impala is given on the screen. By using CREATE VIEW, you can query data from VIEW the same way you query data from an Impala table. These queries can be executed from applications, scripts, or interactive queries in the Impala-shell. CREATE VIEW in a SQL statement simplifies complicated queries that involve joins between multiple tables, complicated expressions, and SQL syntax that are difficult to understand and debug. The statement can also be used to hide the underlying table and column names. Note that, to modify the created view, ALTER VIEW statement is used. Let us discuss the differences between Hive and Impala Query Syntax in the next screen.
2.25 Hive and Impala Query Syntax
In some cases, Impala SQL and HiveQL use similar SQL statements and clause names. However, the semantics of Impala SQL statements are different from HiveQL. For query hints, Impala uses different syntax and names, such as, [SHUFFLE] and [NOSHUFFLE]. (Pronounced as Shuffle and no shuffle) On the other hand, HIVE does not use syntax. • Impala does not expose MapReduce (Pronounced as map reduce) specific features of SORT BY, DISTRIBUTE BY, or CLUSTER BY. Hive exposes MAPReduce specific features. • For Impala queries, FROM clause is not mandatory. Whereas, for Hive queries, FROM clause is mandatory. Impala supports only a limited set of implicit casts. This helps to avoid undesired results from unexpected casting behavior. • Impala does not implicit cast between string and numeric or Boolean values. It always uses CAST (pronounce cast) for these conversions. Whereas, hive implicit casts between string and numeric or Boolean values • Impala performs implicit casts among the numeric types, when converting a smaller or less precise type to a larger or more precise one. For example, Impala implicitly converts a SMALLINT (pronounce small intt) to a BIGINT (pronounce big intt)or FLOAT. However, to convert from DOUBLE to FLOAT or INT (pronounce intt)to TINYINT (pronounce tiny intt) Impala requires a call to CAST(pronounce cast) in the query. On the other hand, Hive does not performs implicit casts among the numeric types. • Impala also performs implicit casts from string to timestamp. Impala has a restricted set of literal formats for the TIMESTAMP data type and the from_unixtime (pronounce from underscore unix time) format string. Whereas, Hive does not perform implicit casts from string to timestamp. In the next screen, we will look at a demo on using Impala Shell for DDL and DDML SQL statements.
2.26 Demo - Using Impala Shell for DDL and DDML SQL Statements
In this demo, you will learn to use impala-shell to execute SQL statements.
2.27 Demo - Using Impala Shell for DDL and DDML SQL Statements (contd.)
Let’s start with DROP STATEMENT. Execute “DROP database simplilearn_test” to drop the ‘simplilearn_test’ (pronounce simplilearn underscore test) database. Verify deletion of the database by using SQL statement “Show databases”. After dropping existing database, create a new database ‘’simplilearn_test” (pronounce simplilearn underscore test) by executing SQL statement shown on the screen. To go to the created database, use SQL statement shown on the screen. Inside the new database, create a table named “logs” by executing SQL command shown on the screen. After creating the table, use “show tables” SQL to verify, whether the table got created or not. Now, create a PARTITION Table by executing SQL statement shown on the screen. Use “show tables” SQL to verify, whether the table got created or not. Now, let’s see how to create an external table. To create an external table use the SQL statement shown on the screen. In this statement, you have an EXTERNAL keyword before the table and you can optionally provide a location clause at the end of the SQL statement. Use “show tables” SQL to verify, whether the table got create or not. Execute “show tables” command to verify the number of tables in simplilearn_test database. Let’s see how to load data into an Internal Impala table. Execute “Load data inpath ‘/user/hive/serverLog’ (pronounce slash user slash hive slash server log) into table logs” Now, query data from internal Impala table by using SQL statement shown on the screen. To insert the data into impala table, use ‘INSERT’ SQL Statement shown on the screen. Query the data from Impala table to verify whether data got inserted or not, using the SQL statement shown on the screen. Then, execute the statement shown on the screen to show the query with ‘where clause’ Now, query data from this external table by executing SQL statement shown on the screen. Use the SQL statement shown on the screen to fetch some specific columns such as, ca_zip (pronounce c-a underscore zip), ca_country (pronounce c-a underscore country), ca_street_name (pronounce c-a underscore street underscore name) for city LONDON. In this case, since city name is stored as London hence this query will not retrieve any data. Let’s get the data for city “Bangalore”. You will see that this query has fetched one row from the database. Change the metadata of a table by using ‘ALTER TABLE’ SQL statement. You can rename an existing table by executing “ALTER TABLE logs rename to logs_new” SQL statement. Verify whether table got renamed or not by using “show tables” command. You can also add more columns to the existing table by using the SQL statement shown on the screen. Use “describe logs_new” (pronounce describe logs underscore new) statement, to add two new columns. We can also use ‘ALTER’ SQL statement shown on the screen, to change the data type. This will change the name of the field from field5 to logdata, and type as, String. Execute “describe logs_new” (pronounce describe logs underscore new) statement to verify the schema of a table. Execute the statement shown on the screen to show that, column name has changed after ALTER command. Let’s use EXPLAIN command to see the SQL execution plan generated by impala server. Execute the SQL statement shown on the screen, to show how query is going to be executed internally by Impala SQL Engine. The output of EXPLAIN statement which was executed in the previous step is highlighted on the screen. Here, you can see the memory requirement is 32MB, it is HDFS scan and it is reading single partition to retrieve the data. Use simplilearn_test (pronounce simplilearn underscore test) statement, to go to simplilearn test database. Use ‘show tables’ command to list down all the tables. Use SQL statement shown on the screen to create VIEW on Impala table, to simplify SQL and sometime for hiding columns to be exposed. Use “show tables” command to list down the Views as well. Select data from this created logveiw and type ‘”select * logview” (pronounce select asterisk logview) Use DESCRIBE statement to see the schema of VIEW. Execute “describe logveiw” statement. Use SQL statement “DROP VIEW logview” to drop this view. Execute “show tables” command to list down all the tables in the simplilearn test database. Go to Impala shell by executing the statement shown on the screen. Execute “show databases” command to list down all the available databases. Now, select “simplilearn_test” database by executing “use simplilearn_test”command. Execute “show tables” command to list down all the tables in the simplilearn test database. Use SQL statement “DROP table logs” to drop table logs in the “simplilearn_test” database. Use SQL statement “DROP table logPartitoned” to drop partition table from “simplilearn_test” database. This concludes demo on using Impala-shell to execute SQL statements. Let us now proceed to the quiz section.
A few questions will be presented in the following screens. Select the correct option and click submit to see the feedback.
Let us summarize the topics covered in this lesson: • Impala uses three types of language statements: SQL, DDL, and DML. • The two types of databases created in Impala are: Logical construct and Physical construct. • The CREATE Table and the ALTER Table commands are used to create a table and change the structure or properties of an existing table respectively. • The built-in functions helps in performing mathematical calculations, such as: string manipulation, and date calculations. • The Create View statements help to simplify complicated queries. • The semantics used in Impala SQL queries are different from HiveQL
This concludes the lesson ‘Querying with Hive and Impala’. The next lesson will focus on the ‘Data Storage and File Format’.
About the On-Demand Webinar
About the Webinar