Apache Pig Tutorial

Welcome to the ninth lesson ‘Apache Pig’ of Big Data Hadoop tutorial which is a part of ‘Big Data Hadoop and Spark Developer Certification course’ offered by Simplilearn. This lesson will focus on Pig, which is the analytics component in the Hadoop ecosystem.

In the next section, we will discuss the objectives of this Pig tutorial.


After completing this lesson, you will be able to:

  • Explain the concept of Pig

  • Understand the types of Data Models supported by Pig

  • Differentiate Pig and SQL

  • Understand the functionalities to perform Pig script operations

  • Understand Various Pig Commands

Introduction to Pig Hadoop

Before 2006, programs were written only on MapReduce using the Java programming language.

Developers had to mind the map, sort shuffle, and reduce fundamentals while creating a program for which they needed common operations such as joining, filtering, and so on.

The challenges kept building up while maintaining, optimizing, and extending the code. Consequently, the production time increased. Also, data flow in MapReduce was quite rigid, where the output of one task could be used as the input of another.

To overcome these issues, Pig was developed in the late 2006 by Yahoo researchers. It later became an Apache open-source project.

Pig is another language, besides Java, in which MapReduce programs can be written.

Let us now understand what Pig is.

What is Pig in Hadoop?

Pig is a scripting platform that runs on Hadoop clusters designed to process and analyze large datasets. Pig is extensible, self-optimizing, and easily programmed.

Programmers can use Pig to write data transformations without knowing Java. Pig uses both structured and unstructured data as input to perform analytics and uses HDFS to store the results.

Pig - Example

Yahoo scientists use grid tools to scan through petabytes of data. Many of them write scripts to test a theory or gain deeper insights; however, in the data factory, data may not be in a standardized state.

This makes Pig a good option as it supports data with partial or unknown schemas and semi or unstructured data.

Let now discuss the components of Pig

Components of Pig

There are two major components of Pig:

  • Pig Latin script language

  • A runtime engine

Pig Latin script language

The Pig Latin script is a procedural data flow language. It contains syntax and commands that can be applied to implement business logic.

Examples of Pig Latin are LOAD and STORE.

A runtime engine

The runtime engine is a compiler that produces sequences of MapReduce programs. It uses HDFS to store and retrieve data. It is also used to interact with the Hadoop system (HDFS and MapReduce).

The runtime engine parses, validates, and compiles the script operations into a sequence of MapReduce jobs.

How pig works?-Stages of Pig Operations

Pig operations can be explained in the following three stages:

Stage 1: Load data and write Pig script

In stage, data is loaded and Pig script is written.

A = LOAD ‘myfile’

AS (x, y, z);

B = FILTER A by x > 0;



x, COUNT(B);

STORE D INTO ‘output’;

Stage 2: Pig Operations

In the second stage, the Pig execution engine Parses and checks the script. If it passes the script optimized and a logical and physical plan is generated for execution.

The job is submitted to Hadoop as a job comprising as a MapReduce Task. Pig Monitors the status of job using Hadoop API and reports to the client.

Stage 3: Execution of the plan

In final stage, results are dumped on section or stored in HDFS depending on the user command.

Let us now understand few salient features of Pig

Salient Features of Pig

Developer and analysts like to use Pig as it offers many features. Some of the features are as follows:

  • Provision for step-by-step procedural control and the ability to operate directly over files

  • Schemas that, though optional, can be assigned dynamically

  • Support to User Defined Functions, or UDFs, and to various data types

Let’s understand data model in Pig.

Data Model in Pig

As part of its data model, Pig supports four basic types.

  1. Atom: It is a simple atomic value like int, long, double, or string.

  2. Tuple: It is a sequence of fields that can be of any data type.

  3. Bag: It is a collection of tuples of potentially varying structures and can contain duplicates.

  4. Map: It is an associative array.

The key must be a char array, but the value can be of any type. By default, Pig treats undeclared fields as byte arrays, which are collections of uninterpreted bytes.

Pig can infer a field’s type based on the use of operators that expect a certain type of field. It can also use User Defined Functions or UFDs, with a known or explicitly set return type.

Furthermore, it can infer the field type based on schema information provided by a LOAD function or explicitly declared using an AS clause.

Please note that type conversion is lazy, which means the data type is enforced at the point of execution only.

Nested Data Model

Pig Latin has a fully-nestable data model with Atomic values, Tuples, Bags or lists, and Maps. This implies one data type can be nested within another, as shown in the image. Pig Latin Nested Data Model is shown in the following diagram.


The advantage is that this is more natural to programmers than flat Tuples. Also, it avoids expensive joins.

Now we will look into different execution modes pig works in.

Pig Execution Modes

Pig works in two execution modes: Local and MapReduce.

Local mode

In the local mode, the Pig engine takes input from the Linux file system and the output is stored in the same file system. Pig Execution local mode is explained below.


MapReduce mode

In MapReduce mode, the Pig engine directly interacts and executes in HDFS and MapReduce as shown in the diagram given below.


Let us now look into interactive modes of Pig.

Pig Interactive Modes

The two modes in which a Pig Latin program can be written are Interactive and Batch.

Interactive mode

Interactive mode means coding and executing the script, line by line, as shown in the image given below.


Batch mode

In Batch mode, all scripts are coded in a file with the extension .pig and the file is directly executed as shown in the diagram given below.


Since we have already learned about Hive and Impala which works on SQL, let’s now see how pig is different from SQL.

Pig vs. SQL

Given below are some differences between Pig and Sql.





Pig is a scripting language used to interact with HDFS.

SQL is a query language used to interact with databases residing in the database engine.

Query Style

Pig offers a step-by-step execution style.

SQL offers the single block execution style.


Pig does a lazy evaluation, which means that data is processed only when the STORE or DUMP command is encountered.

SQL offers immediate evaluation of a query.

Pipeline Splits

Pipeline Splits are supported in Pig.

In SQL, you need to run the “join” command twice for the result to be materialized as an intermediate result.

Now that we have gone through the differences between Pig and SQL, let us now understand further with an example.

Pig vs. SQL - Example

The illustration given below is an example to help you understand the SQL command and its Pig equivalent command script.

Track customers in Texas who spend more than $2,000.


SELECT c_id ,

SUM(amount) AS CTotal

FROM customers c

JOIN sales s ON c.c_id = s.c_id

WHERE c.city = ‘Texas'


HAVING SUM(amount) > 2000



The SQL command focuses on the customer table with columns c_id and CTotal, which is the sum of the amounts. It joins the sales table with reference to c_id, where the c.city is Texas.

The grouping of c_id is performed by ensuring the sum of the amounts is greater than 2000 ordered in descending order.

customer = LOAD '/data/customer.dat' AS (c_id,name,city);

sales = LOAD '/data/sales.dat' AS (s_id,c_id,date,amount);

salesBLR = FILTER customer BY city == ‘Texas';

joined= JOIN customer BY c_id, salesTAX BY c_id;

grouped = GROUP joined BY c_id;

summed= FOREACH grouped GENERATE GROUP, SUM(joined.salesTX::amount);

spenders= FILTER summed BY $1 > 2000;

sorted = ORDER spenders BY $1 DESC; DUMP sorted;

Now, examine the same function using Pig.

In Pig, you create two entities, customer and sales‚ where you load equivalent data with the schema. You filter the customers based on location, for example, Texas.

Both data are joined using the c_id row. The sum of the amounts of individual c_ids is calculated.

Now, isolate those customers who spend more than $2,000. Later, sort the customers in descending order.

In the next section of this apache Pig tutorial, Let’s look at how to load and store data in the Pig engine using the command console.

Loading and Storing Methods in Pig

In order to load and store data in the Pig engine we use loading and Storing methods as explained below.


Loading refers to loading relations from the files in the Pig buffer. This is done using the keyword LOAD followed by the name of the variable for which data is to be loaded as shown below.


A series of transformation statements processes the data.


Storing refers to writing output to the file system. This is done using the keyword STORE followed by the name of the variable whose data is to be stored along with the location of storage.


You can use the keyword DUMP to display the output on the section.

Pig Script Interpretation

Pig processes Pig Latin statements in the following manner:

  • Pig validates the syntax and semantics of all statements.

  • It type checks with the schema.

  • It verifies references. Pig performs limited optimization before execution.

  • If Pig encounters a DUMP or STORE, it will execute the statements.

A Pig Latin script execution plan consists of logical, optimized logical, physical, and MapReduce plans as shown in the below diagram.


In the next section of this Pig Tutorial, we will learn some of the relations that Big Data and Hadoop Developers execute.

Various Relations Performed by Developers

Some of the relations performed by Big Data and Hadoop Developers are:

  • Filtering: Filtering refers to filtering of data based on a conditional clause, such as grade and pay.

  • Transforming: Transforming refers to making data presentable to extract logical data.

  • Grouping: Grouping refers to generating a group of meaningful data.

  • Sorting: Sorting refers to arranging the data in ascending or descending order.

  • Combining: Combining refers to performing a union operation of data stored in the variable.

  • Splitting: Splitting refers to separating the data with a logical meaning.

In next the section of this Pig tutorial, we will see some Pig commands which are frequently used by analysts.

Pig Commands

Given below in table are some frequently used Pig Commands.




Reads data from system


Writes data to file system


Applies expressions to each record and outputs one or more records


Applies predicate and removes records that do not return true


Collects records with the same key from one or more inputs


Joins two or more inputs based on a key


Sorts records based on a key


Removes duplicate records


Merges data sets


Splits data into two or more sets based on filter conditions


Sends all records through a user-provided binary


Writes output to stdout


Limits the number of records

Getting Datasets for Pig Development

These are some of the popular URLs shown to download different datasets for Pig development.






Wikipedia Database


Open database from Amazon S3 data


Open database from national climate data



Let us summarize the topics covered in this lesson:

  • Pig in Hadoop is a high-level data flow scripting language and has two major components: Runtime engine and Pig Latin language.

  • Pig runs in two execution modes: Local and MapReduce.

  • Pig engine can be installed by downloading the mirror web link from the website: pig.apache.org.

  • Three parameters need to be followed before setting the environment for Pig Latin: ensure that all Hadoop services are running properly, Pig is completely installed and configured, and all required datasets are uploaded in the HDFS.

How about investing your time in Big Data Hadoop and Spark Developer Certification course? Check out our Course Preview now!


This concludes the lesson on Pig. In the next lesson of this tutorial we will focus on the Basics of Apache Spark.

Find our Big Data Hadoop and Spark Developer Online Classroom training classes in top cities:

Name Date Place
Big Data Hadoop and Spark Developer 7 Mar -18 Apr 2020, Weekend batch Your City View Details
Big Data Hadoop and Spark Developer 15 Mar -6 Apr 2020, Weekdays batch Dallas View Details
Big Data Hadoop and Spark Developer 27 Mar -8 May 2020, Weekdays batch New York City View Details
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Work Email*
Phone Number*
Job Title*