Apache Pig Tutorial

Welcome to the ninth lesson ‘Apache Pig’ of Big Data Hadoop tutorial which is a part of ‘Big Data Hadoop and Spark Developer Certification course’ offered by Simplilearn. This lesson will focus on Pig, which is the analytics component in the Hadoop ecosystem.

In the next section, we will discuss the objectives of this Pig tutorial.

Objectives

After completing this lesson, you will be able to:

  • Explain the concept of Pig

  • Understand the types of Data Models supported by Pig

  • Differentiate Pig and SQL

  • Understand the functionalities to perform Pig script operations

  • Understand Various Pig Commands

Introduction to Pig Hadoop

Before 2006, programs were written only on MapReduce using the Java programming language.

Developers had to mind the map, sort shuffle, and reduce fundamentals while creating a program for which they needed common operations such as joining, filtering, and so on.

The challenges kept building up while maintaining, optimizing, and extending the code. Consequently, the production time increased. Also, data flow in MapReduce was quite rigid, where the output of one task could be used as the input of another.

To overcome these issues, Pig was developed in the late 2006 by Yahoo researchers. It later became an Apache open-source project.

Pig is another language, besides Java, in which MapReduce programs can be written.

Let us now understand what Pig is.

What is Pig in Hadoop?

Pig is a scripting platform that runs on Hadoop clusters designed to process and analyze large datasets. Pig is extensible, self-optimizing, and easily programmed.

Programmers can use Pig to write data transformations without knowing Java. Pig uses both structured and unstructured data as input to perform analytics and uses HDFS to store the results.

Pig - Example

Yahoo scientists use grid tools to scan through petabytes of data. Many of them write scripts to test a theory or gain deeper insights; however, in the data factory, data may not be in a standardized state.

This makes Pig a good option as it supports data with partial or unknown schemas and semi or unstructured data.

Let now discuss the components of Pig

Components of Pig

There are two major components of Pig:

  • Pig Latin script language

  • A runtime engine

Pig Latin script language

The Pig Latin script is a procedural data flow language. It contains syntax and commands that can be applied to implement business logic.

Examples of Pig Latin are LOAD and STORE.

A runtime engine

The runtime engine is a compiler that produces sequences of MapReduce programs. It uses HDFS to store and retrieve data. It is also used to interact with the Hadoop system (HDFS and MapReduce).

The runtime engine parses, validates, and compiles the script operations into a sequence of MapReduce jobs.

How pig works?-Stages of Pig Operations

Pig operations can be explained in the following three stages:

Stage 1: Load data and write Pig script

In stage, data is loaded and Pig script is written.

A = LOAD ‘myfile’

AS (x, y, z);

B = FILTER A by x > 0;

C = GROUP B BY x;

D = FOREACH A GENERATE

x, COUNT(B);

STORE D INTO ‘output’;

Stage 2: Pig Operations

In the second stage, the Pig execution engine Parses and checks the script. If it passes the script optimized and a logical and physical plan is generated for execution.

The job is submitted to Hadoop as a job comprising as a MapReduce Task. Pig Monitors the status of job using Hadoop API and reports to the client.

Stage 3: Execution of the plan

In final stage, results are dumped on section or stored in HDFS depending on the user command.

Let us now understand few salient features of Pig

Salient Features of Pig

Developer and analysts like to use Pig as it offers many features. Some of the features are as follows:

  • Provision for step-by-step procedural control and the ability to operate directly over files

  • Schemas that, though optional, can be assigned dynamically

  • Support to User Defined Functions, or UDFs, and to various data types

Let’s understand data model in Pig.

Data Model in Pig

As part of its data model, Pig supports four basic types.

  1. Atom: It is a simple atomic value like int, long, double, or string.

  2. Tuple: It is a sequence of fields that can be of any data type.

  3. Bag: It is a collection of tuples of potentially varying structures and can contain duplicates.

  4. Map: It is an associative array.

The key must be a char array, but the value can be of any type. By default, Pig treats undeclared fields as byte arrays, which are collections of uninterpreted bytes.

Pig can infer a field’s type based on the use of operators that expect a certain type of field. It can also use User Defined Functions or UFDs, with a known or explicitly set return type.

Furthermore, it can infer the field type based on schema information provided by a LOAD function or explicitly declared using an AS clause.

Please note that type conversion is lazy, which means the data type is enforced at the point of execution only.

Nested Data Model

Pig Latin has a fully-nestable data model with Atomic values, Tuples, Bags or lists, and Maps. This implies one data type can be nested within another, as shown in the image. Pig Latin Nested Data Model is shown in the following diagram.

pig-latin-nested-data-model

The advantage is that this is more natural to programmers than flat Tuples. Also, it avoids expensive joins.

Now we will look into different execution modes pig works in.

Pig Execution Modes

Pig works in two execution modes: Local and MapReduce.

Local mode

In the local mode, the Pig engine takes input from the Linux file system and the output is stored in the same file system. Pig Execution local mode is explained below.

pig-execution-local-mode

MapReduce mode

In MapReduce mode, the Pig engine directly interacts and executes in HDFS and MapReduce as shown in the diagram given below.

pig-execution-mapreduce-mode

Let us now look into interactive modes of Pig.

Pig Interactive Modes

The two modes in which a Pig Latin program can be written are Interactive and Batch.

Interactive mode

Interactive mode means coding and executing the script, line by line, as shown in the image given below.

pig-interactive-mode-coding-and-executing

Batch mode

In Batch mode, all scripts are coded in a file with the extension .pig and the file is directly executed as shown in the diagram given below.

pig-batch-mode-to-code-all-scripts

Since we have already learned about Hive and Impala which works on SQL, let’s now see how pig is different from SQL.

Pig vs. SQL

Given below are some differences between Pig and Sql.

Difference

Pig

SQL

Definition

Pig is a scripting language used to interact with HDFS.

SQL is a query language used to interact with databases residing in the database engine.

Query Style

Pig offers a step-by-step execution style.

SQL offers the single block execution style.

Evaluation

Pig does a lazy evaluation, which means that data is processed only when the STORE or DUMP command is encountered.

SQL offers immediate evaluation of a query.

Pipeline Splits

Pipeline Splits are supported in Pig.

In SQL, you need to run the “join” command twice for the result to be materialized as an intermediate result.

Now that we have gone through the differences between Pig and SQL, let us now understand further with an example.

Pig vs. SQL - Example

The illustration given below is an example to help you understand the SQL command and its Pig equivalent command script.

Track customers in Texas who spend more than $2,000.

Pig

SELECT c_id ,

SUM(amount) AS CTotal

FROM customers c

JOIN sales s ON c.c_id = s.c_id

WHERE c.city = ‘Texas'

GROUP BY c_id

HAVING SUM(amount) > 2000

ORDER BY CTotal DESC

SQL

The SQL command focuses on the customer table with columns c_id and CTotal, which is the sum of the amounts. It joins the sales table with reference to c_id, where the c.city is Texas.

The grouping of c_id is performed by ensuring the sum of the amounts is greater than 2000 ordered in descending order.

customer = LOAD '/data/customer.dat' AS (c_id,name,city);

sales = LOAD '/data/sales.dat' AS (s_id,c_id,date,amount);

salesBLR = FILTER customer BY city == ‘Texas';

joined= JOIN customer BY c_id, salesTAX BY c_id;

grouped = GROUP joined BY c_id;

summed= FOREACH grouped GENERATE GROUP, SUM(joined.salesTX::amount);

spenders= FILTER summed BY $1 > 2000;

sorted = ORDER spenders BY $1 DESC; DUMP sorted;

Now, examine the same function using Pig.

In Pig, you create two entities, customer and sales‚ where you load equivalent data with the schema. You filter the customers based on location, for example, Texas.

Both data are joined using the c_id row. The sum of the amounts of individual c_ids is calculated.

Now, isolate those customers who spend more than $2,000. Later, sort the customers in descending order.

In the next section of this apache Pig tutorial, Let’s look at how to load and store data in the Pig engine using the command console.

Loading and Storing Methods in Pig

In order to load and store data in the Pig engine we use loading and Storing methods as explained below.

Loading

Loading refers to loading relations from the files in the Pig buffer. This is done using the keyword LOAD followed by the name of the variable for which data is to be loaded as shown below.

loading-method-in-pig-to-loading-relations

A series of transformation statements processes the data.

Storing

Storing refers to writing output to the file system. This is done using the keyword STORE followed by the name of the variable whose data is to be stored along with the location of storage.

storing-method-in-pig-to-writing-output

You can use the keyword DUMP to display the output on the section.

Pig Script Interpretation

Pig processes Pig Latin statements in the following manner:

  • Pig validates the syntax and semantics of all statements.

  • It type checks with the schema.

  • It verifies references. Pig performs limited optimization before execution.

  • If Pig encounters a DUMP or STORE, it will execute the statements.

A Pig Latin script execution plan consists of logical, optimized logical, physical, and MapReduce plans as shown in the below diagram.

a-pig-latin-script-execution-plan 

In the next section of this Pig Tutorial, we will learn some of the relations that Big Data and Hadoop Developers execute.

Various Relations Performed by Developers

Some of the relations performed by Big Data and Hadoop Developers are:

  • Filtering: Filtering refers to filtering of data based on a conditional clause, such as grade and pay.

  • Transforming: Transforming refers to making data presentable to extract logical data.

  • Grouping: Grouping refers to generating a group of meaningful data.

  • Sorting: Sorting refers to arranging the data in ascending or descending order.

  • Combining: Combining refers to performing a union operation of data stored in the variable.

  • Splitting: Splitting refers to separating the data with a logical meaning.

In next the section of this Pig tutorial, we will see some Pig commands which are frequently used by analysts.

Pig Commands

Given below in table are some frequently used Pig Commands.

Command

Function

load

Reads data from system

Store

Writes data to file system

foreach

Applies expressions to each record and outputs one or more records

filter

Applies predicate and removes records that do not return true

Group/cogroup

Collects records with the same key from one or more inputs

join

Joins two or more inputs based on a key

order

Sorts records based on a key

distinct

Removes duplicate records

union

Merges data sets

split

Splits data into two or more sets based on filter conditions

stream

Sends all records through a user-provided binary

dump

Writes output to stdout

limit

Limits the number of records

Getting Datasets for Pig Development

These are some of the popular URLs shown to download different datasets for Pig development.

Datasets

URL

Books

http://www.gutenberg.org/

(war_and_peace.text)

Wikipedia Database

https://dumps.wikimedia.org/enwiki/

Open database from Amazon S3 data

https://aws.amazon.com/datasets/

Open database from national climate data

http://cdo.ncdc.noaa.gov/qclcd_ascii

Summary

Let us summarize the topics covered in this lesson:

  • Pig in Hadoop is a high-level data flow scripting language and has two major components: Runtime engine and Pig Latin language.

  • Pig runs in two execution modes: Local and MapReduce.

  • Pig engine can be installed by downloading the mirror web link from the website: pig.apache.org.

  • Three parameters need to be followed before setting the environment for Pig Latin: ensure that all Hadoop services are running properly, Pig is completely installed and configured, and all required datasets are uploaded in the HDFS.

How about investing your time in Big Data Hadoop and Spark Developer Certification course? Check out our Course Preview now!

Conclusion

This concludes the lesson on Pig. In the next lesson of this tutorial we will focus on the Basics of Apache Spark.

Find our Big Data Hadoop and Spark Developer Online Classroom training classes in top cities:


Name Date Place
Big Data Hadoop and Spark Developer 3 Dec -24 Dec 2018, Weekdays batch Your City View Details
Big Data Hadoop and Spark Developer 14 Dec -26 Jan 2019, Weekdays batch Dallas View Details
Big Data Hadoop and Spark Developer 22 Dec -3 Feb 2019, Weekend batch Houston View Details
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*