Before 2006, programs were written only on MapReduce using the Java programming language.
Developers had to mind the map, sort shuffle, and reduce fundamentals while creating a program for which they needed common operations such as joining, filtering, and so on. The challenges kept building up while maintaining, optimizing, and extending the code. Consequently, production time increased. Also, data flow in MapReduce was quite rigid, where the output of one task could be used as the input of another. To overcome these issues, Pig was developed in late 2006 by Yahoo researchers. It later became an Apache open-source project. Pig is another language, besides Java, in which MapReduce programs can be written
In this Pig tutorial, you will learn:
Get trained on Yarn, MapReduce, Pig, Hive, HBase, and Apache Spark with the Big Data Hadoop Certification Training Course. Click to enroll now!
Pig is a scripting platform that runs on Hadoop clusters designed to process and analyze large datasets. Pig is extensible, self-optimizing, and easily programmed.
Programmers can use Pig to write data transformations without knowing Java. Pig uses both structured and unstructured data as input to perform analytics and uses HDFS to store the results.
Yahoo scientists use grid tools to scan through petabytes of data. Many of them write scripts to test a theory or gain deeper insights; however, in the data factory, data may not be in a standardized state. This makes Pig a good option as it supports data with partial or unknown schemas and semi or unstructured data.
There are two major components of the Pig:
Pig Latin script language
The Pig Latin script is a procedural data flow language. It contains syntax and commands that can be applied to implement business logic. Examples of Pig Latin are LOAD and STORE.
A runtime engine
The runtime engine is a compiler that produces sequences of MapReduce programs. It uses HDFS to store and retrieve data. It is also used to interact with the Hadoop system (HDFS and MapReduce).
The runtime engine parses, validates, and compiles the script operations into a sequence of MapReduce jobs.
Pig operations can be explained in the following three stages:
In this stage, data is loaded and Pig script is written.
A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; |
In the second stage, the Pig execution engine Parses and checks the script. If it passes the script optimized and a logical and physical plan is generated for execution.
The job is submitted to Hadoop as a job defined as a MapReduce Task. Pig Monitors the status of the job using Hadoop API and reports to the client.
In the final stage, results are dumped on the section or stored in HDFS depending on the user command.
Let us now understand a few salient features of Pig
Developers and analysts like to use Pig as it offers many features. Some of the features are as follows:
As part of its data model, Pig supports four basic types.
The key must be a char array, but the value can be of any type. By default, Pig treats undeclared fields as byte arrays, which are collections of uninterpreted bytes. Pig can infer a field’s type based on the use of operators that expect a certain type of field. It can also use User Defined Functions or UFDs, with a known or explicitly set return type. Furthermore, it can infer the field type based on schema information provided by a LOAD function or explicitly declared using an AS clause. Please note that type conversion is lazy, which means the data type is enforced at the point of execution-only.
Pig Latin has a fully-nestable data model with Atomic values, Tuples, Bags or lists, and Maps. This implies one data type can be nested within another, as shown in the image. Pig Latin Nested Data Model is shown in the following diagram.
The advantage is that this is more natural to programmers than flat Tuples. Also, it avoids expensive joins. Now we will look into different execution modes pig works in.
Pig works in two execution modes: Local and MapReduce.
In the local mode, the Pig engine takes input from the Linux file system and the output is stored in the same file system. Pig Execution local mode is explained below.
In MapReduce mode, the Pig engine directly interacts and executes in HDFS and MapReduce as shown in the diagram given below.
Let us now look into interactive modes of Pig.
The two modes in which a Pig Latin program can be written are Interactive and Batch.
Interactive mode means coding and executing the script, line by line, as shown in the image given below.
In Batch mode, all scripts are coded in a file with the extension .pig and the file is directly executed as shown in the diagram given below.
Since we have already learned about Hive and Impala which works on SQL, let’s now see how Pig is different from SQL.
Given below are some differences between Pig and Sql.
Difference |
Pig |
SQL |
Definition |
Pig is a scripting language used to interact with HDFS. |
SQL is a query language used to interact with databases residing in the database engine. |
Query Style |
Pig offers a step-by-step execution style. |
SQL offers the single block execution style. |
Evaluation |
Pig does a lazy evaluation, which means that data is processed only when the STORE or DUMP command is encountered. |
SQL offers immediate evaluation of a query. |
Pipeline Splits |
Pipeline Splits are supported in Pig. |
In SQL, you need to run the “join” command twice for the result to be materialized as an intermediate result. |
Now that we have gone through the differences between Pig and SQL, let us now understand further with an example.
The illustration given below is an example to help you understand the SQL command and its Pig equivalent command script.
Track customers in Texas who spend more than $2,000.
SELECT c_id , SUM(amount) AS CTotal FROM customers c JOIN sales s ON c.c_id = s.c_id WHERE c.city = ‘Texas' GROUP BY c_id HAVING SUM(amount) > 2000 ORDER BY CTotal DESC |
The SQL command focuses on the customer table with columns c_id and CTotal, which is the sum of the amounts. It joins the sales table with reference to c_id, where the c.city is Texas. The grouping of c_id is performed by ensuring the sum of the amounts is greater than 2000 ordered in descending order.
customer = LOAD '/data/customer.dat' AS (c_id,name,city); sales = LOAD '/data/sales.dat' AS (s_id,c_id,date,amount); salesBLR = FILTER customer BY city == ‘Texas'; joined= JOIN customer BY c_id, salesTAX BY c_id; grouped = GROUP joined BY c_id; summed= FOREACH grouped GENERATE GROUP, SUM(joined.salesTX::amount); spenders= FILTER summed BY $1 > 2000; sorted = ORDER spenders BY $1 DESC; DUMP sorted; |
Now, examine the same function using Pig.
In Pig, you create two entities, customer and sales‚ where you load equivalent data with the schema. You filter the customers based on location, for example, Texas. Both data are joined using the c_id row. The sum of the amounts of individual c_ids is calculated. Now, isolate those customers who spend more than $2,000. Later, sort the customers in descending order.
In the next section of this Apache Pig tutorial, you will learn how to load and store data in the Pig engine using the command console.
In order to load and store data in the Pig engine, we use loading and Storing methods as explained below.
Loading refers to loading relations from the files in the Pig buffer. This is done using the keyword LOAD followed by the name of the variable for which data is to be loaded as shown below.
A series of transformation statements processes the data.
Storing refers to writing output to the file system. This is done using the keyword STORE followed by the name of the variable whose data is to be stored along with the location of storage.
You can use the keyword DUMP to display the output in the section.
Pig processes Pig Latin statements in the following manner:
A Pig Latin script execution plan consists of logical, optimized logical, physical, and MapReduce plans as shown in the below diagram.
Some of the relations performed by Big Data and Hadoop Developers are:
In the next section of this Pig tutorial, we will see some Pig commands which are frequently used by analysts.
Given below in the table are some frequently used Pig Commands.
Command |
Function |
load |
Reads data from the system |
Store |
Writes data to file system |
foreach |
Applies expressions to each record and outputs one or more records |
filter |
Applies predicate and removes records that do not return true |
Group/cogroup |
Collects records with the same key from one or more inputs |
join |
Joins two or more inputs based on a key |
order |
Sorts records based on a key |
distinct |
Removes duplicate records |
union |
Merges data sets |
split |
Splits data into two or more sets based on filter conditions |
stream |
Sends all records through a user-provided binary |
dump |
Writes output to stdout |
limit |
Limits the number of records |
These are some of the popular URLs shown to download different datasets for Pig development.
Datasets |
URL |
Books |
(war_and_peace.text) |
Wikipedia Database |
|
Open database from Amazon S3 data |
|
Open database from national climate data |
To summarize the tutorial:
To learn more and get an in-depth understanding of Hadoop and you can enroll in the Big Data Engineer Master’s Program. This program in collaboration with IBM provides online training on the popular skills required for a successful career in data engineering. Master the Hadoop Big Data framework, leverage the functionality of AWS services, and use the database management tool MongoDB to store data.
Name | Date | Place | |
---|---|---|---|
Big Data Engineer | Class starts on 3rd Feb 2021, Weekdays batch | Your City | View Details |
Big Data Engineer | Class starts on 6th Feb 2021, Weekend batch | Chicago | View Details |
Big Data Engineer | Class starts on 19th Feb 2021, Weekdays batch | Los Angeles | View Details |
Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud Computing, Project Management, Data Science, IT, Software Development, and many other emerging technologies.
Big Data Engineer
Big Data Hadoop and Spark Developer
Big Data and Hadoop Administrator
*Lifetime access to high-quality, self-paced e-learning content.
Explore CategoryHive vs. Pig: What Is the Best Platform for Big Data Analysis
Apache Spark Interview Guide
What is Hive?: Introduction To Hive in Hadoop
Sqoop Tutorial: Your Guide to Managing Big Data on Hadoop the Right Way
Hive Tutorial: Working with Data in Hadoop
Hadoop Interview Guide