Welcome to the third lesson of the Impala Training Course. This lesson provides an introduction to data storage and file format considerations in Impala. Let us discuss the objectives of this lesson.
After completing this lesson, you will be able to:
Describe partitioning of Impala tables
Explain the benefits of partitioning
Describe how file format can affect performance in Impala
List the various file formats that are supported in Impala.
Let us begin with understanding partitioning of tables in the next section.
Typically, all the data files of an Impala table reside in a single HDFS directory. Partitioning is a technique to divide the data into multiple HDFS sub-directories physically. Partitioning is a key concept that relates to data storage in Impala.
Partitioning is appropriate for:
tables that contain a large amount of data and it is time-consuming to read the entire data.
tables that are regularly queried with conditions on the partitioning columns.
Let us next understand the SQL statements for partitioned tables in the next section.
Planning to get Impala Certified? Enroll in our Impala Certification Training course here!
In terms of Impala SQL syntax, partitioning affects three statements:
CREATE TABLE: For creating a partition, you need to select those columns that have reasonable cardinality. In Impala, you can use either the CREATE TABLE or the ALTER TABLE statement to create a partition. With the CREATE TABLE statement, you can include the PARTITIONED BY clause to identify names and data types of the partitioning columns. However, these columns are excluded from the main list of the table columns.
ALTER TABLE: With the ALTER TABLE statement, you can add or drop partitions to work with different parts of a huge data set. In addition, you can designate the HDFS directory that holds the data files for a specific partition. When you partition data by date values, you can "age out" outdated or irrelevant data.
INSERT: When inserting data into a partitioned table using the INSERT statement, you can identify the partitioning columns.
Let us next discuss how file format affects performance in Impala.
Impala supports several file formats used in Apache Hadoop. The file format used in an Impala table has a significant impact on its performance.
For example:
Some file formats in Impala tables enable compression that affects the size of data on the disk and the amount of I/O and CPU resources needed to deserialize data.
This, in turn, can limit query performance since querying often involves moving and decompressing data.
To reduce the potential impact on query performance, data is often compressed. This transfers a smaller number of bytes from the disk to memory and reduces the data transfer time. File formats in Impala can be structured. In this case, they may include metadata and built-in compression.
Let us next discuss the various file formats that are supported in Impala.
File types and formats supported in Impala include Parquet, Text, Avro, RCFile, and SequenceFile. A structured file format can include metadata and built-in compression. Impala also supports compression techniques such as Snappy, Gzip, Deflate, Bzip2, and LZO. The following table summarises the above points:
File Type | Format | Compression Codecs |
Parquet | Structured | Snappy, gzip; currently Snappy by default |
Text | Unstructured | LZO, gzip, bzip2, Snappy |
Avro | Structured | Snappy, gzip, deflate, bzip2 |
RCFile | Structured | Snappy, gzip, deflate, bzip2 |
SequenceFile | Structured | Snappy, gzip, deflate, bzip2 |
Are you curious to know what Impala Training is all about? Watch our Course Preview for free!
Let us summarize the topics covered in this lesson:
Partitioning is a technique to physically divide the data in an HDFS directory into multiple HDFS sub-directories.
In terms of Impala SQL syntax, partitioning affects three statements: CREATE TABLE, ALTER TABLE, and INSERT.
The file format used in an Impala table has a significant impact on its performance.
File types and formats supported in Impala include Parquet, Text, Avro, RCFile, and SequenceFile.
This concludes the lesson on Data Storage and File Format. The next lesson will focus on working with Impala.
To learn more, take the Course
Impala Training Certification Training100% Money Back Gaurantee
A Simplilearn representative will get back to you in one business day.