With the growing volumes of data being generated, the storage capacity of a machine may often not be enough to store all of it. However, this can be resolved if you store the data across a network of machines. And these networks of filesystems are called distributed filesystems.
This is where Hadoop comes into play and provides a reliable filesystem, commonly known as HDFS (Hadoop Distributed File System). HDFS is a unique design that provides storage for extremely large files.
Now, let us begin our HDFS tutorial by understanding what is HDFS.
What is HDFS?
HDFS is a distributed file system that provides access to data across Hadoop clusters. A cluster is a group of computers that work together. Like other Hadoop-related technologies, HDFS is a key tool that manages and supports analysis of very large volumes; petabytes and zettabytes of data.
Before 2011, storing and retrieving petabytes or zettabytes of data had the following three major challenges: Cost, Speed, Reliability. Traditional file system approximately costs $10,000 to $14,000, per terabyte. Searching and analyzing data was time-consuming and expensive. Also, if search components were saved on different servers, fetching data was difficult. Here’s how HDFS resolves all the three major issues of traditional file systems:
HDFS is open-source software so that it can be used with zero licensing and support costs. It is designed to run on a regular computer.
Large Hadoop clusters can read or write more than a terabyte of data per second. A cluster comprises multiple systems logically interconnected in the same network.
HDFS can easily deliver more than two gigabytes of data per second, per computer to MapReduce, which is a data processing framework of Hadoop.
HDFS copies the data multiple times and distributes the copies to individual nodes. A node is a commodity server which is interconnected through a network device.
HDFS then places at least one copy of data on a different server. In case, any of the data is deleted from any of the nodes; it can be found within the cluster.
A regular file system, like a Linux file system, is different from HDFS with respect to the size of the data. In a regular file system, each block of data is small, usually about 51 bytes. However, in HDFS, each block is 128 Megabytes by default.
A regular file system provides access to large data but may suffer from disk input/output problems mainly due to multiple seek operations.
On the other hand, HDFS can read large quantities of data sequentially after a single seek operation. This makes HDFS unique since all of these operations are performed in a distributed mode.
Let us list the characteristics of HDFS.
Characteristics of HDFS
Below are some characteristics of HDFS:
- HDFS has high fault-tolerance
- HDFS may consist of thousands of server machines. Each machine stores a part of the file system data. HDFS detects faults that can occur on any of the machines and recovers it quickly and automatically.
- HDFS has high throughput
- HDFS is designed to store and scan millions of rows of data and to count or add some subsets of the data. The time required in this process is dependent on the complexities involved.
- It has been designed to support large datasets in batch-style jobs. However, the emphasis is on high throughput of data access rather than low latency.
- HDFS is economical
- HDFS is designed in such a way that it can be built on commodity hardware and heterogeneous platforms, which is low-priced and easily available.
Similar to the example explained in the previous section, HDFS stores files in a number of blocks. Each block is replicated to a few separate computers. The count of replication can be modified by the administrator. Data is divided into 128 Megabytes per block and replicated across local disks of cluster nodes. Metadata controls the physical location of a block and its replication within the cluster. It is stored in NameNode. HDFS is the storage system for both input/output of MapReduce jobs. Let’s understand how HDFS stores files with an example.
How Does HDFS Work?
Here’s how HDFS stores files. Example - A patron gifted a collection of popular books to a college library. The librarian decided to arrange the books on a small rack and then distribute multiple copies of each book on other racks. This way the students could easily pick up a book from any of the racks.
Similarly, HDFS creates multiple copies of a data block and keeps them in separate systems for easy access.
Let’s discuss the HDFS Architecture and Components in the next section.
HDFS Architecture and Components
Broadly, HDFS architecture is known as the master and slave architecture which is shown below.
A master node, that is the NameNode, is responsible for accepting jobs from the clients. Its task is to ensure that the data required for the operation is loaded and segregated into chunks of data blocks.
HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks, stored, and replicated in the slave nodes known as the DataNodes as shown in the section below.
The data blocks are then distributed to the DataNode systems within the cluster. This ensures that the replicas of the data are maintained. DataNode serves to read or write requests. It also creates, deletes, and replicates blocks on the instructions from the NameNode. We discussed in the previous topic that it is the metadata that stores the block location and its replication. It is explained in the below diagram.
There is a Secondary NameNode which performs tasks for NameNode and is also considered as a master node. Prior to Hadoop 2.0.0, the NameNode was a Single Point of Failure, or SPOF, in an HDFS cluster.
Each cluster had a single NameNode. In case of an unplanned event, such as a system failure, the cluster would be unavailable until an operator restarted the NameNode.
Also, planned maintenance events, such as software or hardware upgrades on the NameNode system, would result in cluster downtime.
The HDFS High Availability, or HA, feature addresses these problems by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.
This allows a fast failover to a new NameNode in case a system crashes or an administrator initiates a failover for the purpose of a planned maintenance.
In an HA cluster, two separate systems are configured as NameNodes. At any instance, one of the NameNodes is in an Active state, and the other is in a Standby state.
The Active NameNode is responsible for all client operations in the cluster, while the Standby simply acts as a slave, maintaining enough state to provide a fast failover if necessary.
An HDFS cluster can be managed using the following features:
- Quorum-based storage: Quorum-based Storage refers to the HA implementation that uses Quorum Journal Manager, or QJM. During this implementation, the Standby node keeps its state synchronized with the Active node through a group of separate daemons called JournalNodes Daemons are long-running processes that typically start up with the system and listen for requests from the client processes. Each daemon runs in its own Java Virtual Machine (JVM). When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of the JournalNodes. The Standby node reads the edits from the JournalNodes and constantly watches for changes to the edit log. As the Standby node edits, it applies them to its own namespace. In the event of a failover, the Standby ensures that it has read all the edits from the JournalNodes before it promotes itself to the active state. This ensures the namespace state is fully synchronized before a failover occurs.
- Shared storage using Network File System: In shared storage using NFS implementation, the Standby node keeps its state synchronized with the Active node through access to a directory on a shared storage device.
The main components of HDFS are:
- Secondary Namenode
- File system
The NameNode server is the core component of an HDFS cluster. There can be only one NameNode server in an entire cluster. Namenode maintains and executes the file system namespace operation such as opening, closing, and renaming of files and directories, which are present in HDFS.
The namespace image and the edit log stores information of the data and the metadata. NameNode also determines the linking of blocks to DataNodes. Furthermore, the NameNode is a single point of failure. The DataNode is a multiple instance server. There can be several numbers of DataNode servers. The number depends on the type of network and the storage system.
The DataNode servers, stores, and maintains the data blocks. The NameNode Server provisions the data blocks on the basis of the type of job submitted by the client.
DataNode also stores and retrieves the blocks when asked by clients or the NameNode. Furthermore, it reads/writes requests and performs block creation, deletion, and replication of instruction from the NameNode. There can be only one Secondary NameNode server in a cluster. Note that you cannot treat the Secondary NameNode server as a disaster recovery server. However, it partially restores the NameNode server in case of a failure.
The Secondary NameNode server maintains the edit log and namespace image information in sync with the NameNode server. At times, the namespace images from the NameNode server are not updated; therefore, you cannot totally rely on the Secondary NameNode server for the recovery process.
HDFS exposes a file system namespace and allows user data to be stored in files. HDFS has a hierarchical file system with directories and files. The NameNode manages the file system namespace, allowing clients to work with files and directories.
A file system supports operations like create, remove, move, and rename. The NameNode, apart from maintaining the file system namespace, records any change to metadata information.
Now that we have learned about HDFS components, let us see how NameNode works along with other components.
NameNode maintains two persistent files; one a transaction log called an Edit Log and the other, a namespace image called a FsImage. The Edit Log records every change that occurs in the file system metadata such as creating a new file.
The NameNode is a local filesystem that stores the Edit Log. The entire file system namespace including mapping of blocks, files, and file system properties is stored in FsImage. This is also stored in the NameNode local file system.
When new DataNodes join a cluster, metadata loads the blocks that reside on a specific DataNode into its memory at startup. Metadata then periodically loads the data at user-defined or default intervals.
When the NameNode starts up, it retrieves the Edit Log and FsImage from its local file system. It then updates the FsImage with Edit Log information and stores a copy of the FsImage on the file system as a checkpoint.
The metadata size is limited to the RAM available on the NameNode. A large number of small files would require more metadata than a small number of large files. Hence, the in-memory metadata management issue explains why HDFS favors a small number of large files.
If a NameNode runs out of RAM, it will crash, and the applications will not be able to use HDFS until the NameNode is operational again.
Data block split is an important process of HDFS architecture. As discussed earlier, each file is split into one or more blocks stored and replicated in DataNodes.
DataNodes manage names and locations of file blocks. By default, each file block is 128 Megabytes. However, this potentially reduces the amount of parallelism that can be achieved as the number of blocks per file decreases.
Each map task operates on one block, so if tasks are fewer than nodes in the cluster, the jobs will run slowly. However, this issue is lesser when the average MapReduce job involves more files or larger individual files.
Let us look at some of the benefits of the data block approach.
The data block approach provides:
- Simplified replication
It also helps by shielding users from storage sub-system details.
Block Replication Architecture
Block replication refers to creating copies of a block in multiple data nodes. Usually, the data is split into the forms of parts such as part and part one.
HDFS performs block replication on multiple data nodes so that if an error exists on one of the data nodes servers. The job tracker service resubmits the job to another data node server. The job tracker service is present in the name node server.
In the replication method, each file is split into a sequence of blocks. All blocks except the last one in the file are of the same size. Blocks are replicated for fault tolerance.
The block replication factor is usually configured at the cluster level but it can also be configured at the file level.
The name node receives a heartbeat and a block report from each data node in the cluster. The heartbeat denotes that the data node is functioning properly. A block report lists the blocks on a data node.
Data Replication Topology
The topology of the replicas is critical to ensure the reliability of HDFS. Usually, each data is replicated thrice where the suggested replication topology is as follows.
Place the first replica on the same node as that of the client. Place the second replica on a different rack from that of the first replica. Place the third replica on the same rack as that of the second one but on a different node. Let's understand data replication through a simple example.
Data Replication Topology - Example
The diagram illustrates a Hadoop cluster with three racks. A diagram for Replication and Rack Awareness in Hadoop is given below.
Each rack consists of multiple nodes. R1N1 represents node 1 on rack 1. Suppose each rack has eight nodes. The name node decides which data node belongs to which rack. Block 1 which is B1 is first written to node 4 on rack 1.
A copy is then written to a different node on a different rack which is node 5 on rack 2. The third and final copy of the block is written to the same rack of the second copy but to a different node which is rack 2 node 1.
And now that you have learned about data replication topology or placement, let's discuss how a file is stored in HDFS.
How Are Files Stored?
Let's say we have a large data file that is divided into four blocks. Each block is replicated three times as shown in the diagram.
You might recall that the default size of each block is 128 megabytes.
The name node then carries metadata information of all blocks and its distribution. Let's work on an example of HDFS which gives a deep understanding of all the points discussed so far.
HDFS in Action - Example
Suppose you have 2 log files that you want to save from a local file system to the HDFS cluster.
The cluster has 5 data nodes: node A, node B, node C, node D, and node E.
Now the first log is divided into three blocks: b1 b2 and b3 and the other log is divided into two blocks: b4 and b5.
Now the blocks b1 b2 b3 b4 and b5 are distributed to the node A, node B, node C, and no D respectively as shown in the diagram.
Each block will also be replicated three times on the five data nodes. All of the information related to the list of blocks and the replication known as the metadata information of the five blocks will be stored in namenode.
Now suppose the client asks for one log that you have stored. The inquiry goes to the namenode and the client gets the information about this log file as shown in the diagram.
Based on the information from the namenode, the client receives the file information from the respective data nodes. HDFS provides various access mechanisms. A Java API can be used for applications. There is also a Python and AC language wrapper for non-java applications. A web GUI can also be utilized through an HTTP browser. An FS shell is available for executing commands on HDFS.
Let's look at the commands for HDFS in the command-line interface.
HDFS Command Line
Following are a few basic command lines of HDFS:
To copy the file simplilearn.txt from the local disk to the user's directory, type the command line:
$ hdfs dfs -put
This will copy the file to /user/username/simplilearn.txt
To get a directory listing of the user's home directory, type the command line:
$hdfs dfs –ls
To create a directory called testing under the user's home directory, type the command line:
$hdfs dfs –mkdir
To delete the directory testing and all of its components, type the command line:
hdfs dfs -rm -r
Hue File Browser
The file browser in Hue lets you view and manage your HDFS directories and files. Additionally, you can create, move, rename, modify, upload, download, and delete directories and files. You can also view the file contents.
Next Step to Success
To learn more and get an in-depth understanding of Hadoop and you can enroll in the Big Data Hadoop Administrator Certification Training. This course provides online training on the popular skills required for a successful career in data engineering. In addition to this, you can enroll for the Big Data Engineer Program and master the Hadoop Big Data framework, leverage the functionality of AWS services, and use the database management tool MongoDB to store data.