Before 2006, programs were written only on MapReduce using the Java programming language.
Developers had to mind the map, sort shuffle, and reduce fundamentals while creating a program for which they needed common operations such as joining, filtering, and so on. The challenges kept building up while maintaining, optimizing, and extending the code. Consequently, production time increased. Also, data flow in MapReduce was quite rigid, where the output of one task could be used as the input of another. To overcome these issues, Pig was developed in late 2006 by Yahoo researchers. It later became an Apache open-source project. Pig is another language, besides Java, in which MapReduce programs can be written.
Get trained on Yarn, MapReduce, Pig, Hive, HBase, and Apache Spark with the Caltech Post Graduate Program In Data Science. Click to enroll now!
What is Pig in Hadoop?
Pig is a scripting platform that runs on Hadoop clusters designed to process and analyze large datasets. Pig is extensible, self-optimizing, and easily programmed.
Programmers can use Pig to write data transformations without knowing Java. Pig uses both structured and unstructured data as input to perform analytics and uses HDFS to store the results.
Pig - Example
Yahoo scientists use grid tools to scan through petabytes of data. Many of them write scripts to test a theory or gain deeper insights; however, in the data factory, data may not be in a standardized state. This makes Pig a good option as it supports data with partial or unknown schemas and semi or unstructured data.
Components of Pig
There are two major components of the Pig:
- Pig Latin script language
- A runtime engine
Pig Latin script language
The Pig Latin script is a procedural data flow language. It contains syntax and commands that can be applied to implement business logic. Examples of Pig Latin are LOAD and STORE.
A runtime engine
The runtime engine is a compiler that produces sequences of MapReduce programs. It uses HDFS to store and retrieve data. It is also used to interact with the Hadoop system (HDFS and MapReduce).
The runtime engine parses, validates, and compiles the script operations into a sequence of MapReduce jobs.
How Pig Works and Stages of Pig Operations
Pig operations can be explained in the following three stages:
Stage 1: Load data and write Pig script
In this stage, data is loaded and Pig script is written.
A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; |
Stage 2: Pig Operations
In the second stage, the Pig execution engine Parses and checks the script. If it passes the script optimized and a logical and physical plan is generated for execution.
The job is submitted to Hadoop as a job defined as a MapReduce Task. Pig Monitors the status of the job using Hadoop API and reports to the client.
Stage 3: Execution of the plan
In the final stage, results are dumped on the section or stored in HDFS depending on the user command.
Let us now understand a few salient features of Pig
Salient Features of Pig
Developers and analysts like to use Pig as it offers many features. Some of the features are as follows:
- Provision for step-by-step procedural control and the ability to operate directly over files
- Schemas that, though optional, can be assigned dynamically
- Support to User Defined Functions, or UDFs, and to various data types
Data Model in Pig
As part of its data model, Pig supports four basic types.
- Atom: It is a simple atomic value like int, long, double, or string.
- Tuple: It is a sequence of fields that can be of any data type.
- Bag: It is a collection of tuples of potentially varying structures and can contain duplicates.
- Map: It is an associative array.
The key must be a char array, but the value can be of any type. By default, Pig treats undeclared fields as byte arrays, which are collections of uninterpreted bytes. Pig can infer a field’s type based on the use of operators that expect a certain type of field. It can also use User Defined Functions or UFDs, with a known or explicitly set return type. Furthermore, it can infer the field type based on schema information provided by a LOAD function or explicitly declared using an AS clause. Please note that type conversion is lazy, which means the data type is enforced at the point of execution-only.
Nested Data Model
Pig Latin has a fully-nestable data model with Atomic values, Tuples, Bags or lists, and Maps. This implies one data type can be nested within another, as shown in the image. Pig Latin Nested Data Model is shown in the following diagram.
The advantage is that this is more natural to programmers than flat Tuples. Also, it avoids expensive joins. Now we will look into different execution modes pig works in.
Pig Execution Modes
Pig works in two execution modes: Local and MapReduce.
Local mode
In the local mode, the Pig engine takes input from the Linux file system and the output is stored in the same file system. Pig Execution local mode is explained below.
MapReduce mode
In MapReduce mode, the Pig engine directly interacts and executes in HDFS and MapReduce as shown in the diagram given below.
Let us now look into interactive modes of Pig.
Pig Interactive Modes
The two modes in which a Pig Latin program can be written are Interactive and Batch.
Interactive mode
Interactive mode means coding and executing the script, line by line, as shown in the image given below.
Batch mode
In Batch mode, all scripts are coded in a file with the extension .pig and the file is directly executed as shown in the diagram given below.
Since we have already learned about Hive and Impala which works on SQL, let’s now see how Pig is different from SQL.
Pig vs. SQL
Given below are some differences between Pig and Sql.
Difference |
Pig |
SQL |
Definition |
Pig is a scripting language used to interact with HDFS. |
SQL is a query language used to interact with databases residing in the database engine. |
Query Style |
Pig offers a step-by-step execution style. |
SQL offers the single block execution style. |
Evaluation |
Pig does a lazy evaluation, which means that data is processed only when the STORE or DUMP command is encountered. |
SQL offers immediate evaluation of a query. |
Pipeline Splits |
Pipeline Splits are supported in Pig. |
In SQL, you need to run the “join” command twice for the result to be materialized as an intermediate result. |
Now that we have gone through the differences between Pig and SQL, let us now understand further with an example.
Pig vs. SQL - Example
The illustration given below is an example to help you understand the SQL command and its Pig equivalent command script.
Track customers in Texas who spend more than $2,000.
Pig
SELECT c_id , SUM(amount) AS CTotal FROM customers c JOIN sales s ON c.c_id = s.c_id WHERE c.city = ‘Texas' GROUP BY c_id HAVING SUM(amount) > 2000 ORDER BY CTotal DESC |
SQL
The SQL command focuses on the customer table with columns c_id and CTotal, which is the sum of the amounts. It joins the sales table with reference to c_id, where the c.city is Texas. The grouping of c_id is performed by ensuring the sum of the amounts is greater than 2000 ordered in descending order.
customer = LOAD '/data/customer.dat' AS (c_id,name,city); sales = LOAD '/data/sales.dat' AS (s_id,c_id,date,amount); salesBLR = FILTER customer BY city == ‘Texas'; joined= JOIN customer BY c_id, salesTAX BY c_id; grouped = GROUP joined BY c_id; summed= FOREACH grouped GENERATE GROUP, SUM(joined.salesTX::amount); spenders= FILTER summed BY $1 > 2000; sorted = ORDER spenders BY $1 DESC; DUMP sorted; |
Now, examine the same function using Pig.
In Pig, you create two entities, customer and sales‚ where you load equivalent data with the schema. You filter the customers based on location, for example, Texas. Both data are joined using the c_id row. The sum of the amounts of individual c_ids is calculated. Now, isolate those customers who spend more than $2,000. Later, sort the customers in descending order.
In the next section of this Apache Pig tutorial, you will learn how to load and store data in the Pig engine using the command console.
Loading and Storing Methods in Pig
In order to load and store data in the Pig engine, we use loading and Storing methods as explained below.
Loading
Loading refers to loading relations from the files in the Pig buffer. This is done using the keyword LOAD followed by the name of the variable for which data is to be loaded as shown below.
A series of transformation statements processes the data.
Storing
Storing refers to writing output to the file system. This is done using the keyword STORE followed by the name of the variable whose data is to be stored along with the location of storage.
You can use the keyword DUMP to display the output in the section.
Pig Script Interpretation
Pig processes Pig Latin statements in the following manner:
- Pig validates the syntax and semantics of all statements.
- It type checks with the schema.
- It verifies references. Pig performs limited optimization before execution.
- If Pig encounters a DUMP or STORE, it will execute the statements.
A Pig Latin script execution plan consists of logical, optimized logical, physical, and MapReduce plans as shown in the below diagram.
Various Relations Performed by Developers
Some of the relations performed by Big Data and Hadoop Developers are:
- Filtering: Filtering refers to the filtering of data based on a conditional clause, such as grade and pay.
- Transforming: Transforming refers to making data presentable to extract logical data.
- Grouping: Grouping refers to generating a group of meaningful data.
- Sorting: Sorting refers to arranging the data in ascending or descending order.
- Combining: Combining refers to performing a union operation of data stored in the variable.
- Splitting: Splitting refers to separating the data with a logical meaning.
In the next section of this Pig tutorial, we will see some Pig commands which are frequently used by analysts.
Pig Commands
Given below in the table are some frequently used Pig Commands.
Command |
Function |
load |
Reads data from the system |
Store |
Writes data to file system |
foreach |
Applies expressions to each record and outputs one or more records |
filter |
Applies predicate and removes records that do not return true |
Group/cogroup |
Collects records with the same key from one or more inputs |
join |
Joins two or more inputs based on a key |
order |
Sorts records based on a key |
distinct |
Removes duplicate records |
union |
Merges data sets |
split |
Splits data into two or more sets based on filter conditions |
stream |
Sends all records through a user-provided binary |
dump |
Writes output to stdout |
limit |
Limits the number of records |
Getting Datasets for Pig Development
These are some of the popular URLs shown to download different datasets for Pig development.
Datasets |
URL |
Books |
(war_and_peace.text) |
Wikipedia Database |
|
Open database from Amazon S3 data |
|
Open database from national climate data |
To summarize the tutorial:
- Pig in Hadoop is a high-level data flow scripting language and has two major components: Runtime engine and Pig Latin language.
- Pig runs in two execution modes: Local and MapReduce.
- Pig engine can be installed by downloading the mirror web link from the website: pig.apache.org.
- Three parameters need to be followed before setting the environment for Pig Latin: ensure that all Hadoop services are running properly, Pig is completely installed and configured, and all required datasets are uploaded in the HDFS.
Next Step to Success
To learn more and get an in-depth understanding of Hadoop and you can enroll in the Big Data Engineer Master’s Program. This program in collaboration with IBM provides online training on the popular skills required for a successful career in data engineering. Master the Hadoop Big Data framework, leverage the functionality of AWS services, and use the database management tool MongoDB to store data.