HBase Tutorial

Have you ever wondered how emails are stored and processed? Back in the days before the advent of emails, RDBMS was used to store data. However,  with the rise of massive amounts of semi-structured data like emails, RDBMS failed to store and process this data. And this task was taken up by HBase.The Hadoop ecosystem consists of various units that are dedicated for different roles - one of which is HBase.

Introduction to HBase

A few decades ago, the internet wasn’t available, that is also when the data generated was much lesser and also was structured in nature. Structured data means data that has a definite structure and which has a standard order. This data was stored in the Relational Database (RDBMS) without any hassle. 

With the evolution of the internet, we heard terms such as Big Data where huge volumes of structured and semi-structured data started getting generated. Semi-structured data includes your emails, JSON, XML, and .csv files to name a few. Loads of semi-structured data was created across the globe. As a result, storing and processing this data became a major challenge. And the solution?- Apache HBase. Let’s now have a look at the history of HBase.

solution

HBase History

Learn Job Critical Skills To Help You Grow!

Post Graduate Program In Data EngineeringExplore Program
Learn Job Critical Skills To Help You Grow!

Back in November 2006, Google released the paper on BigTable. Then in February 2007, the HBase prototype was created as a Hadoop contribution. In October 2007, the first usable HBase along with Hadoop 0.15.0 was released, and HBase became the subproject of Hadoop in January 2008. HBase 0.81.1, 0.19.0 and 0.20.0 were released between Oct 2008 and Sep 2009. Finally, in May 2010, HBase became Apache top-level project.

What is HBase?

HBase is modeled after Google's Bigtable, which is a distributed storage system for structured data. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Some of the companies that use HBase as their core program are Facebook, Netflix, Yahoo, Adobe, and Twitter. The goal of HBase is to host large tables with billions of rows and millions of columns on top of clusters of commodity hardware.

Why HBase?

  • It can store huge amounts of data in a tabular format for extremely fast reads and writes.
  • HBase is mostly used in a scenario that requires regular, consistent insertion and overwriting of data.

We know that HDFS stores, processes, and manages large amounts of data efficiently.

However, it performs only batch processing where the data is accessed in a sequential manner. This means one has to search the entire dataset for even the simplest of jobs. Hence, a solution was required to access, read, or write data any time regardless of its sequence in the clusters of data.

HBase Real Life Connect - Example

You may be aware that Facebook has introduced a new Social Inbox integrating email, IM, SMS, text messages, and on-site Facebook messages. They need to store over 135 billion messages a month.

Facebook chose HBase because it needed a system that could handle two types of data patterns:

  • An ever-growing dataset that is rarely accessed
  • An ever-growing dataset that is highly volatile You read what's in your Inbox, and then you rarely look at it again.

Characteristics of HBase

HBase is a type of NoSQL database and is classified as a key-value store. Some characteristics of HBase are:

  • Value is identified with a key.
  • Both key and values are Byte Array, which means binary formats can be stored easily.
  • Values are stored in key-orders.
  • Values can be quickly accessed by their keys

HBase is a database in which tables have no schema; column families and not columns are defined at the time of table creation.

Applications of HBase 

There are a number of HBase applications across various industries, from healthcare to e-commerce to sports sector. For instance:

  1. In the healthcare sector, HBase is used for storing genome sequences and disease history of people or a particular area.
  2. In the field of e-commerce, HBase is used for storing logs about customer search history and it also performs analytics and target advertisement for better business insights.
  3. In sports, HBase is used to store match details and the history of each match. It uses this data for better prediction. 

HBase vs RDBMS

Does HBase and RDBMS sound similar? Here are some of  the primary differences between them.

HBase

RDBMS

  • Hbase does not have a fixed schema. Here, only column families are defined

  • HBase works well with structured and semi-structured data

  • It can have denormalized data 

  • It is built for wide tables that can be scaled horizontally
  • RDBMS has a fixed schema which describes the structure of the tables

  • Works well with structured data


  • RDBMS can store only normalized data

  • It is built for thin tables that is hard to scale



Features of HBase 

HBase has a number of features like:

  1. Scalable: HBase allows data to be scaled across various nodes as it is stored in HDFS.
  2. Automatic failure support: Write ahead Log across clusters are present that provides automatic support against failure.
  3. Consistent read and write:  HBase provides consistent read and write of data.
  4. JAVA API for client access:  HBase provides easy to use JAVA API for clients.
  5. Block cache and Bloom filters:  It supports block cache and bloom filters for high volume query optimization.

HBase Architecture

The architecture of HBase is as shown below:

The Apache Zookeeper monitors the system, and the HBase Master assigns regions and load balancing. The Region server serves data to read and write. The Region Server is all the different computers in the Hadoop cluster. It consists of Region, HLog, Store, MemoryStore, and different files. All this is a part of the HDFS storage system. Let’s now move and have an in-depth knowledge of each of these architectural components and see how it works together.

HBase Architectural Components: Regions

As seen in the below diagram, HBase tables are divided horizontally by row key range into “Regions”. Regions are assigned to the nodes in the cluster, called “Region Servers”. Regions are assigned to the nodes in the cluster, called “Region Servers”. These servers serve data for reading and writing. 

HBase Architectural Components: ZooKeeper

HBase has a distributed environment where HMaster couldn’t manage everything on its own. And this is where ZooKeeper came into play. ZooKeeper is a distributed coordination service to maintain server state in the cluster. It maintains and tracks which servers are alive and available, and provides server failure notification. Here’s how  ZooKeeper operates:

1. Active HMaster sends a heartbeat signal to ZooKeeper indicating that it’s active.

zookeeper

2. Region servers send their status to ZooKeeper indicating they are ready to read and write operations. 

3. The inactive server acts as a backup. If the active HMaster fails, it will come to rescue. 

zookeeper-3.

Now let’s see how each of these components work together. So, Active HMaster and Region Servers connect with a session to ZooKeeper.

zookeeper-5

The ZooKeeper maintains ephemeral nodes for active sessions via heartbeats to indicate that region servers are up and running.

zookeeper-5

Now let’s move onto our next topic and see how HBase operates read and write.

Learn Job Critical Skills To Help You Grow!

Post Graduate Program In Data EngineeringExplore Program
Learn Job Critical Skills To Help You Grow!

HBase Read or Write

There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster. When a client reads or writes data to HBase, the following takes place:

The client gets the Region Server that hosts the META table from ZooKeeper. Then the client will query the .META server to get the region server corresponding to the row key it wants to access. The client caches this information along with the META table location. 

It will then get the Row from the corresponding Region Server:

hbase-1

HBase Meta Table

In HBase, the table is used to find the Region for a given Table key. Special HBase catalog table that maintains a list of all the Region Servers in the HBase storage system:

meta-table

HBase Write Mechanism

The mechanism works in four steps, and here’s how:

1. Write Ahead Log (WAL) is a file used to store new data that is yet to be put on permanent storage. It is used for recovery in the case of failure. When a client issues a put request, it will write the data to the write-ahead log (WAL).

wal

2. MemStore is the write cache that stores new data that has not yet been written to disk. There is one MemStore per column family per region. Once data is written to the WAL, it is then copied to the MemStore.

memstore

3. Once the data is placed in MemStore, the client then receives the acknowledgment.

wal3

4. Hfiles store the rows of data as sorted KeyValue on disk. When the MemStore reaches the threshold, it dumps or commits the data into an HFile. 

wal-hfile

Now that we have understood the theory part of HBase, you can learn how HBase works through a demo. 

Demo

Before starting off with the demo, you can navigate to hbase.apache.org to gain some information on HBase and you can also go through the HBase reference guide. In this demo, we will be working on the Oracle VirtualBox and we will use the Cloudera QuickStart installed here. 

You can start off by selecting the HBase Master as shown below from the Hue interface. 

hue-interface

Once you click on the Master, you will get an overview of the region servers, tables, tasks, the ZooKeeper version, and various other software attributes. Then open up a terminal window to start off with the demo. You can zoom in for better visibility while typing inside the terminal window. First, to start off, you should open up the HBase shell and for that you need to type:

hbase shell //Opens the HBase shell

hbase-shell

After a couple of seconds, you’ll be inside the HBase shell where you can type the HBase commands. You can start off by typing the following commands:

list //Lists down all the tables present in HBase

create ‘newtbl’, ‘knowledge’ //Creates a new table

describe ‘newtbl’ //Checks if the table was created

status ‘summary’    //Checks the status of HBase

Now that we have created a new table, let’s put some data into it. 

put ‘newtbl’, ‘r1’, ‘knowledge:sports’, ‘cricket’

put ‘newtbl’, ‘r1’, ‘knowledge:science’, ‘chemistry’

put ‘newtbl’, ‘r1’, ‘knowledge:science’, ‘physics’

put ‘newtbl’, ‘r2’, ‘knowledge:economics’, ‘macro economics’

put ‘newtbl’, ‘r2’, ‘knowledge:music’, ‘pop music’

Let’s now list the contents of the table by typing:

scan ‘newtbl’

The output will be as seen below:

custom-class

As you can see from the above image, we cannot see “chemistry” as it will only display the latest update which in this case is “physics”. Now we can type the following commands:

is_enabled ‘newtbl’ //Checks if the table is enabled

disable ‘newtbl’ //Disables the table

scan ‘newtbl’ //Lists the contents of the table. Note that this will throw an error as the table is disabled.

Now before we move and enable the table, let’s do an alteration on it. 

alter ‘newtbl’, ‘test_info’ //Updates column family in the table

enable ‘newtbl’ //Enables the table

describe ‘newtbl’ //Checks the column families after updating

The output will be as seen below:

above

Then extract values for one particular row and also see how to add new information to a row by using the following commands:

get ‘newtbl’, ‘r1’ //Extracts the values for r1 in the table

put ‘newtbl’, ‘r1’, ‘knowledge:economics’, ‘market economics’  //Adds new information to r1 for economics. Note that this will update the table but will not override the information

get ‘newtbl’, ‘r1’//Displays the results for r1

The output will be as seen below:

cloudera-hbase

You can go back to Cloudera HBase Master status and see that user tables are one. You could click on details to view the data we fed in. This brings us to the end of this quick demo on HBase. 

Conclusion

We hope this tutorial on HBase has helped you gain a better understanding of how HBase works. You understood what HBase is, an HBase use case, various applications of HBase. You also saw the differences between HBase and RDBMS, learned about the HBase storage, and it’s architectural components. You also learned how HBase works through a short demo.

If you want to learn more about big data and Hadoop, enroll in our Professional Certificate Program In Data Engineering today!

Check out this video https://m.youtube.com/watch?v=V1fXSCASVDc to learn more about HBase. 





 




About the Author

Shruti MShruti M

Shruti is an engineer and a technophile. She works on several trending technologies. Her hobbies include reading, dancing and learning new languages. Currently, she is learning the Japanese language.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.