Cracking the Hadoop Developer Interview
The results of a global study reveal that the percentage of organizations that plan to or already have implemented a Big Data project is around 84%, as Google Trends attests to. Given the size and ferocity of the competition, therefore, cracking Hadoop interviews is no longer a walk in the park, and requires focused, concerted effort.
There are a number of different job profiles related to and within the Hadoop domain. In the current article, we shall discuss some of the more common interview questions at the entry-level for both experienced (but new to Hadoop) and fresh aspirants.
What Is The Hadoop Framework?
Ans. Hadoop is an open source framework written in Java and devised by the Apache software foundation. This framework is used to write software applications which process vast amounts of data. The framework works in-parallel on large clusters which could have thousands of computers in a single node, and multiple nodes per cluster. It also processes data in a reliable and fault-tolerant manner. The base programming model of Hadoop is based on Google’s MapReduce.
Hadoop is a platform that offers both, distributed storage, and computational capabilities. Hadoop was first conceived to fix a scalability issue that existed in Nutch, an open source crawler and search engine. At the time, Google had published papers that described its novel distributed filesystem, the Google File System (GFS), and Map-Reduce, a computational framework for parallel processing. The successful implementation of these papers’ concepts in Nutch resulted in its split into two separate projects, the second of which became Hadoop, a first-class Apache project.
What is HDFS?
Filesystems that manage the storage across a network of machines are called distributed filesystems Hadoop comes with a Java-based file system called HDFS, that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has many similarities with other distributed file systems, but is different in several respects. One noticeable difference is HDFS's write-once-read-multiple-times model that relaxes concurrency control requirements, simplifies data coherency, and enables high-throughput access.
Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near the data rather than moving the data to the application space. In HDFS, Data Blocks are distributed across local drives of all machines in a cluster. HDFS is designed to work with the MapReduce System, since computation is moved to data. HDFS runs on a cluster of machines and provides redundancy using replication protocol.
Hadoop Vs. Traditional System
Hadoop was designed for large, distributed data processing that addresses every file in the database, which is a type of processing that takes time. For tasks where performance isn’t critical, such as running end-of-day reports to review daily transactions, scanning historical data, and performing analytics where a slower time-to-insight is acceptable, Hadoop is ideal.
On the other hand, in cases where organizations rely on time-sensitive data analysis, a traditional database is the better fit. That’s because shorter time-to-insight isn’t about analyzing large unstructured datasets, which Hadoop does so well. It’s about analyzing smaller data sets in real or near-real time, which is what traditional databases are well equipped to do.
RDBMS only work better when an entity relationship model (ER model) is defined perfectly as it follows Codd’s 12 rule and, therefore, the database schema or structure can grow. The emphasis is on strong consistency, referential integrity, abstraction from the physical layer, and complex queries through the SQL, whereas the Hadoop framework works very well with structured and unstructured data. This also supports a variety of data formats in real time, such as XML, JSON, and text based flat file formats.
What is Mapreduce?
Ans. MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. MapReduce jobs usually split the input data-set into independent chunks, while Map task will process these chunks in a completely parallel manner on different nodes. The job of the framework is to sort the outputs of the maps. The reducer produces the final result with the help of the output from the previous step.
How Do Read Operations Function In Hadoop?
Read\write operations in HDFS consist of a single master + multiple slaves architecture, where Namenode acts as master and Datanodes as slaves. All the metadata information is with the Namenode and the actual data is stored on the Datanodes.
Source : http://data-flair.training/blogs/hadoop-hdfs-data-read-and-write-operations/
2. DistributedFileSystem calls the namenode, using remote procedure calls (RPCs) and the Client moves the request to NameNode.
3. NameNode provides block information on which data node has the file, and the client then proceeds to read data from datanodes.
4. The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file searches) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.
5. The Client reads data from all datanodes in parallel (to quickly access data in case of any failure of any datanode, which is why Hadoop reads data in parallel).
6. After the reading is complete, the connection with the datanode cluster is closed. During reading, if the DFSInputStream encounters an error while communicating with a datanode, it will try the next closest one for that block. It will also remember datanodes that have failed so that it doesn’t needlessly retry them for later blocks.
How Does The Write Operation In Hadoop Function?
1. User asks HDFS client to write a file. The client creates the file by calling create() on DistributedFileSystem DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace
2. The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file.
3. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException.
4. The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data to.
5. As the client writes data, the DFSOutputStream splits it into packets, which it writes to an internal queue called the data queue.
6. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas.
7. The list of datanodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the first datanode in the pipeline, which stores each packet and forwards it to the second datanode in the pipeline.
8. Similarly, the second datanode stores the packet and forwards it to the third datanode in the pipeline.
9. The DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue.
10. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.
What Are Edge Or Gateway Nodes?
Edge nodes are the interface between the Hadoop cluster and the outside network. For this reason, they’re sometimes referred to as gateway nodes. Most commonly, edge nodes are used to run client applications and cluster administration tools. Typically edge-nodes are kept separate from the nodes that contain Hadoop services such as HDFS, MapReduce, etc, mainly to keep computing resources separate. Edge nodes running within the cluster allow for centralized management of all the Hadoop configuration entries on the cluster nodes which helps to reduce the amount of administration needed to update the config files.
The fact is, given the limited security within Hadoop itself, even if your Hadoop cluster operates in a local- or wide-area network behind an enterprise firewall, you may want to consider a cluster-specific firewall to more fully protect non-public data that may reside in the cluster. In this deployment model, think of the Hadoop cluster as an island within your IT infrastructure — for every bridge to that island you should consider an edge node for security.
What Is Apache Yarn?
Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. Originally described by Apache as a redesigned resource manager, YARN is the next-generation computation and resource management framework in Apache Hadoop, and was introduced in Hadoop 2 to improve the MapReduce implementation, enabling Hadoop to support more varied processing approaches and a broader array of applications.
Source: Hadoop Definitive Guide book
Difference Between Mapr1 & Mapr2 (Yarn)?
In MapReduce 1, there are two types of daemons that control the job execution process: a jobtracker and one or more tasktrackers.
The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.
Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker. In MapReduce 1, the jobtracker takes care of both job scheduling (matching tasks with tasktrackers) and task progress monitoring (keeping track of tasks, restarting failed or slow tasks, and doing task bookkeeping, such as maintaining counter totals).
By contrast, in YARN, these responsibilities are handled by separate entities - the resource manager and an application master (one for each MapReduce job). The jobtracker is also responsible for storing job history for completed jobs; in YARN, the equivalent role is that of the timeline server, which stores application history.
How Does Secondary NameNode Complement NameNode?
The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree.
This information is stored persistently on the local disk in the form of two files: the namespace image, and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located; however, it does not store block locations persistently, because this information is reconstructed from datanodes when the system starts.
It is also possible to run a secondary namenode, which, despite its name, does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large. The secondary namenode usually runs on a separate physical machine because it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing. However, the state of the secondary namenode lags behind that of the primary, so in the event of total failure of the primary, data loss is almost certain.
What Is Resource Manager?
The job tracker serves as both a resource manager and history server in MRv1, which limits scalability. In YARN, the job tracker's role is split between a separate resource manager and history server to improve scalability. ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works with the per node NodeManagers (NMs) and the per-application ApplicationMasters (AMs). The jobtracker’s responsibilities are split between the resource manager and the application master in YARN, making the service highly accessible. The divide-and conquer approach handles resource management and job scheduling on Hadoop systems and supports the parsing and condensing of data sets in parallel. Thus, the ResourceManager is primarily limited to scheduling - i.e., only arbitrating available resources in the system among the competing applications and not concerning itself with per-application state management.
What Is Node Manager?
The YARN equivalent of a tasktracker is a node manager. NodeManagers take instructions from the ResourceManager and manage resources available on a single node, and are therefore called per-node agents. The NodeManager is the per-machine/per-node framework agent who is responsible for containers, monitoring their resource usage and reporting the same to the ResourceManager. In contrast to a fixed number of slots for map and reduce tasks in MRV1, the NodeManager of MRV2 has a number of dynamically created resource containers. All container processes running on a slave node are initially provisioned, monitored, and tracked by that slave node’s Node Manager daemon.
What Is Application Manager?
ApplicationMasters are responsible for negotiating resources with the ResourceManager and for working with NodeManagers to start the containers. The ApplicationMaster communicates with a YARN cluster, and handles application execution. The main tasks of the ApplicationMaster are communicating with the ResourceManager to negotiate and allocate resources for future containers and, after container allocation, communicating with NodeManagers to launch application containers on them. The Application Manager is the actual owner of the job. As the Application Manager is launched within a container which may share a physical host with other containers, given the multi-tenancy nature, amongst other issues, it cannot make any assumptions of things like pre-configured ports that it can listen on.
What Is A Container?
A container is a collection of all the resources necessary to run an application: CPU cores, memory, network bandwidth, and disk space. A deployed container runs as an individual process on a slave node in a Hadoop cluster. A container represents an allocated resource in the cluster. The ResourceManager is the sole authority to allocate any Container to applications. The allocated Container is always on a single node and has a unique ContainerId, and has a specific amount of Resources allocated.
What Is WritableComparable?
WritableComparator is a general-purpose implementation of RawComparator for WritableComparable classes. IntWritable implements the WritableComparable interface, which is just a subinterface of the Writable and java.lang.Comparable interfaces: package org.apache.hadoop.io; public interface WritableComparable
What Is NullWritable?
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes are written to or read from the stream. It is used as a placeholder; anything writing or reading NullWritables will know ahead of time that it will be dealing with that type - for example, in MapReduce, a key or a value can be declared as a NullWritable when you don’t need to use that position, effectively storing a constant empty value. NullWritable can also be useful as a key in a SequenceFile when you want to store a list of values, as opposed to key-value pairs.
What Is The Use Of Context Object?
The Context object allows the mapper to interact with the rest of the Hadoop eco-system. It includes configuration data for the job, as well as interfaces which allow it to deliver output. The context objects are used for emitting key-value pairs. The new API makes extensive use of context objects that allow the user code to communicate with the MapReduce system. The new Context, for example, essentially unifies the role of the JobConf, the OutputCollector, and the Reporter from the old API.
What Is A Mapper?
Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or to many output pairs. The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
Explain The Shuffle?
MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort — and transfers the map outputs to the reducers as inputs — is known as the shuffle. Input to the Reducer is the sorted output of the mappers. In this phase, the framework fetches the relevant partition of the output of all the mappers, via HTTP.
What Are The Core Methods And Techniques The Reducer Uses?
The Reducer reduces and shrinks a set of intermediate values which share a key to a smaller set of values. The API of Reducer is very similar to that of Mapper: there's a run() method that receives a Context containing the job's configuration as well as interfacing methods that return data from the reducer itself back to the framework. The run() method calls setup() once and reduce() once for each key associated with the reduce task, and cleanup() once at the end. Each of these methods can access the job's configuration data by using Context.getConfiguration(). As with Mapper, any or all of these methods can be overridden with custom implementations. If none of these methods are overridden, the default reducer operation is the identity function; values are passed through without further processing.
The heart of Reducer is its reduce () method. This is called the once per key; the second argument is an Iterable which returns all the values associated with that key.
What is the use of Combiner?
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. The combine will operate like a Reducer, but only on the subset of the Key/Values output from each Mapper. Combiners are basically mini-reducers which lessen the workload which is passed on further to the reducers. Your mapper may be emitting more than one record per key and they would ultimately be aggregated and passed as a single call in the reducer method.
So if these records per key can be combined even before passing them to reducers, the amount of data which is shuffled across the network in order to get it to the right reducer will be reduced, ultimately enhancing the job's performance. It is an optional component or class, and can be specified via Job.setCombinerClass (Class Name) to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer. The combiner function doesn’t replace the reduce function but it can help cut down the amount of data shuffled between the mappers and the reducers, and for this reason alone it is always worth considering whether you can use a combiner function in your MapReduce job.
What Is HDFS federation?
Multiple independent namenodes will be part of the cluster introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /account, say, and a second namenode might handle files under /finance.
What Is Split-Brain?
Shifting from Hadoop 1 to Hadoop 2 has complicated the NameNode part. One of the common problems that is not unique to the Hadoop cluster but affects any distributed systems is a "split-brain" scenario. Split-brain is a situation where two NameNodes decide they both play an active role and start writing changes to the editlog. To prevent such an issue from occurring, the HA configuration maintains a marker in ZooKeeper, clearly stating which NameNode is active, and JournalNodes accepts writes only from that node.
What Is Fencing?
Fencing is a technique that effectively addresses the Split-Brain problem in Hadoop 2. To be absolutely sure that the two NameNodes don't become active at the same time, a technique called fencing is used during failover. The idea is to force the shutdown of the active NameNode before transferring the active state to a standby to avoid confusion, to do which it waits for some time to ensure the active namenode is down.
As a last resort, the previously active name-node can be fenced with a technique rather graphically known as STONITH, or “shoot the other node in the head,” which uses a specialized power distribution unit to forcibly power down the host machine.
What Are containers?
A container executes an application-specific process with a constrained set of resources (memory, CPU, and so on).
Container represents an allocated resource in the cluster. The
ResourceManager is the sole authority to allocate any
Container to applications. The allocated
Container is always on a single node and has a unique
ContainerId. It has a specific amount of
Resource allocated. Typically, an
ApplicationMaster receives the
Container from the
ResourceManager during resource-negotiation and then talks to the
NodManager to start/stop containers.
What Do You Mean By Task Instance?
Task instances are the actual MapReduce jobs which run on each slave node. The Task Tracker initiates a separate JVM processes to do the actual work (known as Task Instance) to ensure that process failure does not take down the entire task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on the task tracker. By default, a new task instance JVM process is spawned for a task.
What Is Checksum?
The usual way of detecting corrupted data is by computing a checksum for the data when it first enters the system, and again whenever it is transmitted across a channel that is unreliable and hence capable of corrupting the data. Datanodes are responsible for verifying the data they receive before storing the data and its checksum.
This applies to data that they receive from clients and from other datanodes during replication. When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanodes. Each datanode keeps a persistent log of checksum verifications, so it knows the last time each of its blocks was verified. When a client successfully verifies a block, it tells the datanode, which updates its log. Maintaining statistics such as these is invaluable in detecting bad disks.
What Is Compression?
File compression brings two major benefits to the table: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk. When dealing with large volumes of data, both of these savings can be significant. Which compression format you use depends on such considerations as file size, format, and the tools you are using for processing.
Use a container file format such as sequence files, Avro datafiles, ORCFiles, or Parquet files, all of which support both compression and splitting. A fast compressor such as LZO, LZ4, or Snappy is generally a good choice. Use a compression format that supports splitting, such as bzip2 (although bzip2 is fairly slow), or one that can be indexed to support splitting, such as LZO.
What Is Serialization/Deserialization?
Serialization is the process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage. The purpose is twofold: one, transmission over a network (inter-process communication), and two, for writing to persistent storage.
Deserialization is the reverse process of turning a byte stream back into a series of structured objects. Serialization is used in two quite distinct areas of distributed data processing: for inter-process communication and for persistent storage.
What Are Remote Procedure Calls (RPCs)?
In Hadoop, the inter-process communication between nodes in a system is done by using remote procedure calls, i.e., RPCs. The RPC protocol uses serialization to turn the message into a binary stream to be sent to the remote node, which receives and de-serializes the binary stream into the original message.
RPC serialization format is expected to be:
2. Fast: Very little performance overhead is expected for the serialization and de-serialization processes.
3. Extensible: To adapt to new changes and requirements.
4. Interoperable: The format needs to be designed to support clients that are written in different languages to the server.
How Does The Client Communicate With the HDFS?
The Client communication to HDFS is via the Hadoop HDFS API. Client applications talk to the Name Node whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The Name Node responds to successful requests by returning a list of relevant Data Node servers where the data lives. Client applications can talk directly to a Data Node, once the Name Node has provided the location of the data.
What Are The Restrictions To The Key And Value Class?
The key and value classes have to be serialized by the framework. To make them serializable, Hadoop provides a Writable interface. As you would know from the Java itself that the key of the Map should be comparable, it would need to implement one more interface to become Writable Comparable.
What Is SSH?
SSH (Secure Shell) is a secure shell that usually runs on top of SSL and has a built-in username/password authentication scheme that can be used for secure access to a remote host. To work seamlessly, SSH needs to be set up to allow password-less login for HDFS and YARN users from machines in the cluster. The simplest way to achieve this is to generate a public/private key pair and place it in an NFS location that is shared across the cluster.
Get a taste of our Big Data and Hadoop developer training tutorial. Hope you find it beneficial.
Preparing for a career in Data Science? Take this test to know where you stand!
About the On-Demand Webinar
About the Webinar