Working with Impala Tutorial

4.1 Working with Impala

Hello and welcome to the fourth lesson of the Impala, an Open Source SQL (Pronounce as Sequel) Server Engine for Hadoop course offered by Simplilearn. This lesson provides an introduction to working with Impala. Let us discuss the objectives of this lesson in the next screen.

4.2 Objectives

After completing this lesson, you will be able to describe the Impala architecture and explain the functions of its three main components. You will also be able to describe the complete flow of a SQL (Pronounce as Sequel) query execution in Impala. You will then be able to provide an overview of using user defined functions in Impala. Finally, you will be able to list the factors that improve Impala performance. Let us begin with understanding the Impala architecture in the next screen

4.3 Impala Architecture

Impala server is a SQL query execution engine of hadoop. Impala is designed as massively parallel processing or MPP (M-P-P) engine for distributed clustering environment. It consists of various daemon (pronounce as demon) processes that run on specific hosts within your Hadoop cluster. The three main components of Impala are Impala daemon, Impala statestore (state-store), and Impala catalog service, represented by the daemons impalad (impala-dee), statestored (statestore-dee), and catalogd (catalog-dee) respectively. The diagram on screen depicts the internal architecture of Impala. In the next screen, let us understand the impalad (impala-dee) daemon process.

4.4 Impala Daemon

The core component of Impala is the daemon process running on each individual Impala cluster node. The daemon processes are run in the form of physical impalad process. The impalad process reads and writes data files and handles queries sent via impala-shell command, such as Hue (Hadoop User Experience), JDBC (J-D-B-C), or ODBC (O-D-B-C). It logically divides a query into smaller parallel queries and distributes them to different nodes in the Impala cluster. When you submit a query to the Impala daemon running on any node, the node serves as the coordinator node for that query. Impalad transmits intermediate query results back to the central coordinator node. The coordinator constructs the final query output. When you run an experiment using the impala-shell command, it may connect you to the same Impala daemon process for convenience. In the next screen, let us understand the statestored (statestore-dee) daemon process.

4.5 Impala Statestore

The Impala daemons and the statestore are in continuous communication to identify the nodes that are healthy and are capable of accepting new work. The statestore component then relays this information to the daemons. The name of the Impala statestore daemon process is statestored. In the next screen, we will learn about the catalogd (catalog-dee) daemon process.

4.6 Impala Catalog Service

The catalog service Impala component broadcast the metadata changes from Impala SQL statements to all the nodes in a cluster. This broadcast is physically represented by catalogd. Such a process is required only on one node in the cluster. The request then passes through statestored daemon. Therefore, the statestored and catalogd services are present on the same node. Catalogd sends messages to statestored when an Impala node in the cluster creates, alters, or drops any type of object; or when an INSERT or LOAD DATA statement is processed through Impala. Catalog service is a relatively new component that was introduced in Impala version 1.2. In earlier versions, if you issued the CREATE DATABASE, DROP DATABASE, CREATE TABLE, ALTER TABLE, or DROP TABLE commands on one node, you need to issue the command INVALIDATE METADATA on other nodes before making a query request. Otherwise, the changes to schema objects would not have been picked up. The catalog service component eliminates the need for the REFRESH and INVALIDATE METADATA statements if the metadata changes are made by Impala statements. However, when you perform tasks such as creating a table and loading data through Hive, you need to issue these statements before executing a query. Let us take a look at the query execution flow in Impala in the next screen.

4.7 Query Execution Flow in Impala

The query execution process works in the following manner. The client sends a query to impalad (impala-dee) using HiveQL (Hive-Q-L) via the Thrift Application Program Interface or API (A-P-I). The frontend planner generates a Query Plan Tree using the metadata information. Meanwhile, the backend coordinator sends an execution request to all the query execution engines and the backend query execution engine executes a query fragment. HDFS Scan, Aggregation, and Merge are examples of this process. Note that statestore notifies impalad when cluster state changes. Let us next discuss user-defined functions in Impala.

4.8 User - Defined Functions

In an Impala query, a user-defined function or UDF (U-D-F) lets you code your own application logic for adding column values. Different types of UDFs produce different numbers of input and output values. The most general kind is typically referred to by the abbreviation UDF. It requires one input value and gives one output value. A user-defined aggregate function or UDAF takes a group of values and produces a single output value. It is generally used for summarizing the values in a group of rows. Currently, Impala supports User Defined Functions and User-Defined Aggregate Functions, but it does not support other categories of UDFs such as user-defined table functions. Let us next understand how to run UDFs written for Hive in Impala.

4.9 Hive UDFs with Impala

In Impala 1.1, The Hive shell was required to use UDFs. From Impala 1.2, Impala can run both UDFs in C++ (C Plus Plus) and Hive UDFs written in Java. For enhanced performance, Impala supports the writing of UDFs in the native C++ language rather than in Java. Impala can run Java-based UDFs written for Hive without any changes if they fulfill two conditions. First, the parameters and return value must use Impala supported data types. Currently, Impala does not support Hive UDFs that accept or return the TIMESTAMP data type. Second, the return type should be a "writable" type such as Text or IntWritable and not a Java primitive type such as String or int. Otherwise, the UDF will return NULL. Impala does not support Hive UDAFs and UDTFs. Typically, a Java UDF is executed at a lesser speed in Impala than the equivalent native UDF written in C++. In the next screen, we will look at a demo on UDFs.

4.10 Demo - UDF in Impala

In this demo, you will learn to use User Defined Function written in impala.

4.11 Demo - UDF in Impala(contd.)

Copy the jar file containing your UDF class into HDFS directory /user/hive/ (slash user slash hive slash) by executing the command shown on screen. Once the jar is copied into the HDFS file, open Impala shell by typing the impala-shell command In the impala shell , go to the simplilearn test database by typing command use simplilearn_test (use simplilearn underscore test) In this database, you need to create a function named encrypt by typing executing command ‘create function encrypt(String)’ (create function encrypt open bracket string close bracket). This returns the String location. Execute the ‘show function’ command to list the encrypt function. Use encrypt function by executing SQL Statement select encrypt (“SIMPLILEARN”); (select encrypt simplilearn) You see that SIMPLILEARN has been encrypted by this function. This concludes the demo on using User Defined Functions written in Impala. Let us explore how to improve Impala performance in the next screen.

4.12 Improving Impala Performance

The four factors that help improve Impala performance are partitioning Impala tables, performance considerations for Join queries, collecting table and column statistics, and controlling Impala resource usage. Let us discuss each of these. Partitioning is a technique that physically divides the data in frequently queried columns based on different values. Thus, it allows queries to skip reading a large percentage of the data in a table. Joins are the main class of queries that you can improve at the SQL level. You do not have to change physical factors such as the file format or the hardware configuration to improve the performance of joins. Let us move to the next factor, collecting table statistics and column statistics. You can gather table and column statistics using the COMPUTE STATS (Compute stats) statement. This helps Impala to automatically optimize the performance for Join queries without making any changes to SQL query statements. Finally, let us discuss the control of Impala Resource Usage. Memory utilization is directly proportional to performance in Impala. Greater the memory Impala utilizes, better is the query performance. In a cluster running other kinds of workloads, make adjustments and ensure that all Hadoop components have enough memory to perform well so that Impala can utilize maximum memory. Let us now proceed to the quiz section.

4.13 Quiz

A few questions will be presented in the following screens. Select the correct option and click submit to see the feedback.

4.14 Summary

Let us summarize what we learned in this lesson. The three important components of the Impala architecture are Impala daemon, Impala statestore, and Impala catalog service User defined functions and user defined aggregate functions in Java and native C++ can extend the functionalities of Impala SQL statements. The recommended performance optimization techniques for Impala are partitioning impala tables, performance consideration of Joins, collecting Table Statistics and Column Statistics, and Controlling Impala Resource Usage.

4.15 Conclusion

This concludes the lesson Working with Impala. Thank you and happy learning!

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Phone Number*
Job Title*