HDFS Federation	HDFS High Availability
There is no limitation to the number of NameNodes and the NameNodes are not related to each other All the NameNodes share a pool of metadata in which each NameNode will have its dedicated pool Provides fault tolerance, i.e., if one NameNode goes down, it will not affect the data of the other NameNode	There are two NameNodes that are related to each other. Both active and standby NameNodes work all the time One at a time, active NameNodes will be up and running, while standby NameNodes will be idle and updating its metadata once in a while It requires two separate machines. First, the active NameNode will be configured, while the secondary NameNode will be configured on the other system

Identity Mapper	Chain Mapper
This is the default mapper that is chosen when no mapper is specified in the MapReduce driver class. It implements identity function, which directly writes all its key-value pairs into output. It is defined in old MapReduce API (MR1) in: org.apache.Hadoop.mapred.lib.package	This class is used to run multiple mappers in a single map task. The output of the first mapper becomes the input to the second mapper, second to third and so on. It is defined in: org.apache.Hadoop.mapreduce.lib.chain.ChainMapperpackage

Map-side join	Reduce-side join
The mapper performs the join Each input data must be divided into the same number of partitions Input to each map is in the form of a structured partition and is in sorted order	The reducer performs the join Easier to implement than the map side join, as the sorting and shuffling phase sends the value with identical keys to the same reducer No need to have the dataset in a structured form (or partitioned)

External Table	Managed Table
External tables in Hive refer to the data that is at an existing location outside the warehouse directory Hive deletes the metadata information of a table and does not change the table data present in HDFS	Also known as the internal table, these types of tables manage the data and move it into its warehouse directory by default If one drops a managed table, the metadata information along with the table data is deleted from the Hive warehouse directory

Hive	Pig
It uses a declarative language, called HiveQL, which is similar to SQL for reporting. Operates on the server-side of the cluster and allows structured data. It does not support the Avro file format by default. This can be done using “Org.Apache.Hadoop.Hive.serde2.Avro” Facebook developed it and it supports partition	Uses a high-level procedural language called Pig Latin for programming Operates on the client-side of the cluster and allows both structured and unstructured data Supports Avro file format by default. Yahoo developed it, and it does not support partition

Pig	MapReduce
It has fewer lines of code compared to MapReduce. A high-level language that can easily perform join operation. On execution, every Pig operator is converted internally into a MapReduce job Works with all versions of Hadoop	Has more lines of code. A low-level language that cannot perform join operation easily. MapReduce jobs take more time to compile. A MapReduce program written in one Hadoop version may not work with other versions

Sqoop	Flume
Sqoop works with RDBMS and NoSQL databases to import and export data Loading data in Sqoop is not event-driven Works with structured data sources and Sqoop connectors are used to fetch data from them It imports data from RDBMS into HDFS and exports it back to RDBMS	Flume works with streaming data that is generated continuously in the Hadoop environment. Example: log files Loading data in Flume is completely event-driven Fetches streaming data, like tweets or log files, from web servers or application servers Data flows from multiple channels into HDFS

Tutorial Playlist

Hadoop Tutorial for Beginners

What is Hadoop? Components of Hadoop and Its Uses

Hadoop Ecosystem

Hadoop Technology

What is Hadoop Architecture and its Components?

How To Install Hadoop On Ubuntu

Cloudera Quickstart VM Installation - The Best Way

HDFS Tutorial

Mapreduce Tutorial: Everything You Need To Know

MapReduce Example in Apache Hadoop

Yarn Tutorial

HBase Tutorial

Sqoop Tutorial: Your Guide to Managing Big Data on Hadoop the Right Way

Hive Tutorial: Working with Data in Hadoop

Apache Pig Tutorial

Hive vs. Pig: What Is the Best Platform for Big Data Analysis

Top 80 Hadoop Interview Questions and Answers

Top 80 Hadoop Interview Questions and Answers: Sqoop, Hive, HDFS and more

Hadoop Tutorial for Beginners

What is Hadoop? Components of Hadoop and Its Uses

Hadoop Ecosystem

Hadoop Technology

What is Hadoop Architecture and its Components?

How To Install Hadoop On Ubuntu

Cloudera Quickstart VM Installation - The Best Way

HDFS Tutorial

Mapreduce Tutorial: Everything You Need To Know

MapReduce Example in Apache Hadoop

Yarn Tutorial

HBase Tutorial

Sqoop Tutorial: Your Guide to Managing Big Data on Hadoop the Right Way

Hive Tutorial: Working with Data in Hadoop

Apache Pig Tutorial

Hive vs. Pig: What Is the Best Platform for Big Data Analysis

Top 80 Hadoop Interview Questions and Answers

Table of Contents

Hadoop Interview Questions

HDFS Interview Questions - HDFS

1. What are the different vendor-specific distributions of Hadoop?

Learn Job Critical Skills To Help You Grow!

2. What are the different Hadoop configuration files?

3. What are the three modes in which Hadoop can run?

4. What are the differences between regular FileSystem and HDFS?

5. Why is HDFS fault-tolerant?

Learn Job Critical Skills To Help You Grow!

6. Explain the architecture of HDFS.

NameNode

DataNode

7. What are the two types of metadata that a NameNode server holds?

8. What is the difference between a federation and high availability?

9. If you have an input file of 350 MB, how many input splits would HDFS create and what would be the size of each input split?

Master Gen AI Strategies for Businesses with

10. How does rack awareness work in HDFS?

11. How can you restart NameNode and all the daemons in Hadoop?

12. Which command will help you find the status of blocks and FileSystem health?

13. What would happen if you store too many small files in a cluster on HDFS?

14. How do you copy data from the local system onto HDFS?

15. When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?

Learn Job Critical Skills To Help You Grow!

16. Is there any way to change the replication of files on HDFS after they are already written to HDFS?

17. Who takes care of replication consistency in a Hadoop cluster and what do under/over replicated blocks mean?

Under-replicated blocks:

Over-replicated blocks:

MapReduce Interview Questions

18. What is the distributed cache in MapReduce?

Learn Job Critical Skills To Help You Grow!

19. What role do RecordReader, Combiner, and Partitioner play in a MapReduce operation?

RecordReader

Combiner

Partitioner

20. Why is MapReduce slower in processing data in comparison to other processing frameworks?

21. Is it possible to change the number of mappers to be created in a MapReduce job?

22. Name some Hadoop-specific data types that are used in a MapReduce program.

23. What is speculative execution in Hadoop?

24. How is identity mapper different from chain mapper?

25. What are the major configuration parameters required in a MapReduce program?

26. What do you mean by map-side join and reduce-side join in MapReduce?

Learn Job Critical Skills To Help You Grow!

27. What is the role of the OutputCommitter class in a MapReduce job?