Introduction to NoSQL databases Tutorial

1.1 NoSQL Database Introduction

Hello and welcome to Lesson 1 of the MongoDB Developer course offered by Simplilearn. This lesson provides an introduction to the NoSQL (Pronounce as No sequel) database. Let us explore the objectives of this lesson in the next screen.

1.2 Objectives

After completing this lesson, you will be able to: • Explain what NoSQL databases are • Explain the purpose of No SQL databases • List the benefits of NoSQL database over traditional RDBMS database • Identify various types of NoSQL databases • List the differences between NoSQL and RDBMS and • Explain Mongo DB in relation to the CAP theorem Let us begin with understating what NoSQL is in the next screen.

1.3 What is NoSQL?

Traditionally, the software industries use relational databases to store and manage data persistently. Not only SQL or NoSQL is a new set of database that has emerged in the recent past as an alternative solution to relational databases. Carl Strozzi (Pronounce as Stro-jee ) introduced the term NoSQL to name his file-based database in 1998. NoSQL refers to all databases and data stores that are not based on the Relational Database Management Systems or RDBMS principles. It relates to large data sets accessed and manipulated on a Web scale. NoSQL does not represent single product or technology. It represents a group of products and a various related data concepts for storage and management. NoSQL was a hashtag that was chosen for a tech meetup to discuss the new databases. We will continue our discussion on NoSQL in the next screen.

1.4 What is NoSQL?(contd.)

Typically, NoSQL does not have a prescriptive definition. The common characteristics of a NoSQL database are as follows: • It does not use the relational model. • It runs well on clusters. • It mostly has an open-source. • It is built for the new generation Web applications. • It is schema-less. We will discuss the purpose of using NoSQL in the next screen.

1.5 Why NoSQL?

With the explosion of social media, user driven content has grown rapidly and has increased the volume and type of data that is produced, managed, analyzed, and archived. In addition, new sources of data, such as sensors, Global Positioning Systems or GPS, automated trackers, and other monitoring systems generate huge volumes of data on a regular basis. These large volumes of data sets also called big data have introduced new challenges and opportunities for data storage, management, analysis, and archival. In addition, data is becoming increasingly semi-structured and sparse. This means that RDBMS databases which require upfront schema definition and relational references are examined. To resolve the problems related to large-volume and semi-structured data, a class of new database products have emerged. These new classes of database products consist of column-based data stores, key/value pair databases, and document databases. Together, these are called NoSQL. The NoSQL database consists of diverse products with each product having unique sets of features and value propositions. In the next screen, we will identify the differences between NoSQL and RDBMS.

1.6 Difference Between RDBMS and NoSQL Databases

NoSQL differs from RDBMS in terms of the following features. Data storage: In RDBMS, data is stored in a relational model in tabular format with numerous rows and columns. Rows contain information about an item type, and columns contain various values attributed to the item. For example, a row may contain information about an item, and the columns will contain specific information, such as ‘Model’, ‘Date of Manufacture’, ‘Color’, and so on. NoSQL comprises a host of different databases with different data storage models. Schemas and Flexibility: Each record in an RDBMS follows a fixed schema. The columns are defined and locked before data entry. In addition, each row contains data for each column. Although this format can be modified, it will require altering the entire database and going offline. On the other hand, schemas in NoSQL are dynamic. You can add columns anytime. Unlike RDBMS, each row need not contain data for each column. Scalability: RDBMS supports vertical scaling. Typically, to handle more data, a bigger server is required. However, this will increase the cost. Although you can scale an RDBMS across multiple servers, it is a challenging and time-consuming process. Scaling is horizontal in NoSQL. You can scale across multiple servers. Multiple servers are cheap commodity hardware or cloud instances, which make scaling cost-effective compared to vertical scaling. Many NoSQL technologies automatically distribute data across different servers. Atomicity, Consistency, Isolation, Durability or ACID Compliancy: Relational databases are mostly ACID (Pronounce as a single word) compliant. However, most NoSQL databases compromise ACID compliancy for performance and scalability. In the next screen, we will discuss the benefits of NoSQL.

1.7 Benefits of NoSQL

The desired technical characteristics of an enterprise-class NoSQL solution are as follows. Primary and Analytic Data Source Capability: The first criterion of an enterprise-class NoSQL solution is—it must serve as a primary or active datasource that receives data from different business applications. It also must act as a secondary data source or analytic database that enhances business intelligence applications. From business perspective, the NoSQL database must be capable of quickly integrating all types of data—structured, semi-structured, or unstructured. In addition, it must be able to execute high-performance queries. Once all the required data is collected in the database, data administrators may want to perform an analysis in real time and in map/reduce form. An enterprise-class NoSQL database can easily handle such requests using the same database. It does not require loading the data into a separate analytic database for analysis. Big Data Capability: NoSQL databases are not restricted to working with big data. However, an enterprise-class NoSQL solution can scale to manage large volumes of data from terabytes to petabytes. In addition to storing large volumes of data, it delivers high performance for data velocity, variety, and complexity. Continuous Availability or No Single Point of Failure: To be considered enterprise-class, a NOSQL database must offer continuous availability, with no single point of failure. Moreover, rather than providing the continuous availability feature outside the software, the NoSQL solution delivers inherent continuous availability. The NoSQL databases must include the following key features: • All nodes in a cluster must be able to serve read request even if some machines are down. • Must be capable of easily replicating and segregating data between different physical shelves in a data center. This helps avoid hardware outages. • Must be able to support data distribution designs that are multi-data centers, on premises or in the cloud. Multi-Data Center Capability: Typically, business enterprises own highly distributed databases that are spread across multiple data centers and geographic locales. Data replication is a feature offered by all legacy RDBMS. However, none can offer a simple mode of data distribution between various data centers without any performance issue. Simple method includes the ability to handle multiple data centers without concerning about the occurrences of the read and write operations. A good NoSQL enterprise solution must support multi data-center deployment and must provide configurable option to maintain a balance between performance and consistency. Easy Replication for Distributed Location-Independent Capabilities To avoid data loss affecting an application, a good NoSQL solution provides strong replication abilities. These include a read-anywhere and write-anywhere capability with full location-independence support. This means you can write data to any node in a cluster, have it replicated on other nodes, and make it available to all users irrespective of their location. In addition, the write capability on any node must ensure that the data is safe in the event of a power failure or any other incident. No Need for Separate Caching Layer: A good NoSQL solution is capable of using multiple nodes and distributing data among all the participating nodes. Thus, it does not require a specific caching layer to store data. The memory caches of all participating nodes can store data for quick input and output or I/O access. NoSQL database eliminates the problem of synchronizing cache data with the persistent database. Thus, it supports simple scalability with fewer management issues. We will discuss some more benefits of NoSQL benefits in the next screen.

1.8 Benefits of NoSQL (contd.)

Following are few more benefits of NoSQL. Cloud-Ready: As adaption of cloud infrastructure is increasing day by day, an enterprise-class NoSQL solution must be cloud-ready. A NoSQL database cluster must be able to function in a cloud setting, such as Amazon EC2, and also must be able expand and contract a cluster when necessary. It also must support a hybrid solution where part of the database is hosted within the enterprise premise and another part is hosted on a cloud setting. High Performance with Linear Scalability: An enterprise-class NoSQL database can enhance performance by adding nodes to a cluster. Typically, the performance of database systems may go down when additional nodes are added to a cluster. However, a good NoSQL solution increase performance for both read and writes operations when additional nodes are added. These performance gains are linear in nature. Flexible Schema Support: An enterprise-class NoSQL database offers a flexible or dynamic schema design to manage all types of data—structured, semi-structured, and non-structured. Therefore, the need to have different vendors to support the different data types does not arise. NoSQL databases may support various schema formats, such as columnar/Bigtable and document. Therefore, choosing an appropriate database based on application requirement is a key design decision. The flexible or dynamic schema support ensures that you can make schema changes to a structure without making the structure offline. This support is critical considering the near-zero downtime and round-the-clock availability for business applications. Support Key Developer Languages and Platforms: Ideally, an enterprise-class NoSQL solution must support all major operating systems. In addition, it must run on a product hardware that does not require any tweaks or other proprietary add-ons. The NoSQL database must provide client interfaces and drivers for all common developer languages. It must offer a structured query language or SQL or a similar language that helps store and access data in a NoSQL database. Easy to Implement, Maintain, and Grow: A NoSQL database must be simple but robust. In other words, it must be easy to implement and use and must offer sturdy functionality to handle various enterprise applications. In addition, the NoSQL vendor must supply good management tools that assist data professionals perform various administrative tasks, such as adding capacity to a cluster, running utility tasks, and so on. The NoSQL database must allow easy growth without making any change to the front-end of the business application. Thriving Open Source Community: For an open source NoSQL database, having a vibrant community is essential to make regular contribution to enhance the core software. Moreover, open source communities generally provide excellent quality assurance or QA testing. This sometimes eliminates the need for software companies to hire, train, and retain a QA team. To encourage a thriving open source community, include activities on mailing lists and technical forums, initiate technical discussions, and participate in conferences. In the next screen, we will discuss the various types of NoSQL databases.

1.9 Types of NoSQL

There are four basic types of NoSQL databases. 1. Key-Value database – It has a big hash table of keys and values. Riak (Pronounce as REE-awk), Tokyo Cabinet, Redis server, Memcached ((Pronounce as mem-cached), and Scalaris are examples of key-value store. 2. Document-based database- It stores documents made up of tagged elements. Examples include MongoDB, CouchDB, OrientDB, and RavenDB 3. Column-based database- Each storage block contains data from only one column, Examples are BigTable, Cassandra, Hbase, and Hypertable 4. Graph-based database-It is a network database that uses nodes to represent and store data. Examples are Neo4J, InfoGrid, Infinite Graph, and FlockDB The availability of choices in NoSQL databases has its own advantages and disadvantages. The advantage is, it allows you to choose a design according to your system requirements. However, because you have to make a choice based on requirements, there is always a chance that the same database product may not be used properly. We will learn about the key-value database in the next screen.

1.10 Key-Value Database

From an Application Program Interface or API perspective, a key-value database is the simplest NoSQL database. This database stores every single item as a key with a value. You can get the value for a key, add a value for a key, or delete a key. The value is a blob that the database stores without knowing its content. The responsibility lies with the application to understand what is stored. Typically, key-value databases use primary-key access. Therefore, they generally offer enhanced performance and scalability. All key-value databases may not have the same features. For example, data is not persistent in Memcached while it is in Riak. These features are important when implementing certain solutions. For example, you need to implement caching of user preferences. If you implement them in Memcached, you may lose all the data when the node goes down and may need to get them from the source database. If you store the same data in Riak, you may not lose data but must consider how to update the stale data. It is important to select a key-value database based on your requirements. We will continue with the discussion on key-value database in the next screen.

1.11 Key-Value Database (contd.)

The key value store does not have a defined schema. It contains client defined semantics for understanding what the values are. A key value store is simple to build and easy to scale. It also tends to have great performance because the access pattern can be optimized to suit your requirement. The advantages of key-value store include the following. Queries: You can perform a query by using the key. Even range queries on the key are usually not possible. Schema: Key value databases have the following schema—key is a string, value is a blob. The client determines how to parse data. Usages: Key value databases are handy when you need to access data using a key. Key-value type database suffer from major weaknesses. The disadvantages are as follows. It does not provide any traditional database capabilities, such as consistency when multiple transactions are executed simultaneously. These capabilities must be provided by the application itself. As the volume of data increases, maintaining unique values as keys become difficult. To address this issue, you need to use character strings that will remain unique among large sets of data. In the next screen, we will discuss Document database.

1.12 Document Database

This NoSQL database type is based on the concept of documents. It stores and retrieves various documents in formats, such as XML, JavaScript Object Notation or JSON (Pronounce as JAY- sahn) , Binary JSON or BSON (Pronounce as bee sahn), and so on. These documents are self-descriptive, hierarchical tree data structures which consist of maps, collections, and scalar values. The stored documents can be similar to each other, but not necessarily the same. It stores documents in the value part of the key-value database. You can consider the document databases as key-value stores where you can examine the values. In the next screen, we will focus on examples of document database

1.13 Document Database Example

MongoDB is an example of document database that provides a rich query language and constructs elements such as database, indexes allowing easy transition from relational databases. MongoDB is capable of scaling out with many of the most useful features of relational databases, such as secondary indexes, range queries, and sorting. MongoDB has many useful features such as built-in support for MapReduce-style aggregation and geospatial indexes. Apache CouchDB is a database that uses JSON for documents, JavaScript for MapReduce indexes, and regular HTTP for its API. In the next screen, we will learn about Column-based database.

1.14 Column-Based Database

Column-based databases store data in column families as rows. These rows contain multiple columns associated with a row key. Column families are groups of related data that is accessed together. For example, you may access customer profile information at the same time, but not their order history. Each column family is like a container of rows in an RDBMS table where the key identifies the rows. Each row consists of multiple columns. However, the various rows need not have the same columns. Moreover, you can add a column to any row at any time without adding it to other rows. In the next screen, we will continue our discussion on Column-based database.

1.15 Column-Based Database (contd.)

The goal of a Column-based database is to efficiently read and write data to and from hard disk storage to quickly return a query. In this database, all column one values are physically together, followed by all the column two values. The data is stored in record order, so that the 100th entry for column one and the column two are from the same input record. This allows you to access individual data elements, such as customer name, as a group in columns, rather than individually row-by-row. The compression permits columnar operations like MIN, MAX, SUM, COUNT and AVG— to be performed very rapidly. A column-based database management system or DBMS is self-indexing, therefore it uses less disk space than a RDBMS containing the same data. We will continue our discussion column-based database in the next screen.

1.16 Column-Based Database (contd.)

The diagram provided on the screen depicts that data is getting stored in column rather than row format. It shows columns for the same column family are stored together in one file on the hard disk. Therefore, these data can be retrieved fast in an efficient manner. In the next screen, we will focus on an example of column-based database.

1.17 Column-Based Database Example

Cassandra is one of the popular column-based databases. Cassandra is fast and easily scalable with write operations spread across the cluster. The cluster does not have a master node, hence, any node can handle the read and write operations. In the next screen, we will discuss Graph database.

1.18 Graph Database

There are no isolated pieces of information but rich, connected domains in this connected world. Therefore only those databases that treat relationships as a core aspect of their data model can efficiently store, process, and query connections. In comparison to other general purpose databases, a graph database makes relationships readily available for any join-like execution. Accessing those already persistent connections allows you to quickly access millions of connections per second per core. A graph database lets you store data and its relationships with other data in the form of nodes and edges. Each relation can have a set of properties. Edges have direction which has its own significance and enable you to explore the relationship in both the direction. All the nodes in the graph are organized by relationships that help explore interesting and hidden patterns between the nodes. We will continue our discussion on Graph database in the next screen.

1.19 Graph Database (contd.)

The various available graph databases are —Neo4J (pronounce as Neo- four-J), Infinite Graph, OrientDB, and FlockDB. FlockDB only supports single-depth relationships or adjacency lists, where you cannot traverse more than one level deep for relationships. Neo4J is one of the most popular graph databases, which is ACID compliant. It is the product of the company Neo Technologies. It is Java based but has bindings for other languages, including Ruby and Python. FlockDB was created by Twitter for relationship related analytics. In the graph database, the labeled property graph model is used for modeling the data. It is same as the entity relationships or ER model used in RDBMS. The property graph contains connected entities, such as the nodes which can hold any number of attributes or key-value-pairs. In the next screen, we will discuss Consistency, Availability, and Partition tolerance or CAP theorem.

1.20 CAP Theorem

In a distributed system, the following three properties are important. • Consistency: Each client must have consistent or the same view of the data. • Availability: The data must be available to all clients for read and write operations. • Partition toleration: System must work well across distributed networks. We will continue with the CAP Theorem in the next screen.

1.21 CAP Theorem (contd.)

The CAP theorem was proposed by Eric Brewer. According to this theorem, in any distributed system, you can use only two of the three properties—consistency, availability, or partition tolerance simultaneously. That means to get a network partition, you may have to trade off availability of data or consistency. You may have to exchange durability for latency, to survive failures with replicated data. Many NoSQL databases provide options for a developer to choose to adjust the database as per requirement. For this, understanding the following requirements is important: • How the data is consumed by the system, • Whether the data is read or write heavy, • If there is a need to query data with random parameters, and • If the system is capable of handling inconsistent data. We will focus on consistency in the next screen.

1.22 Consistency

Consistency in CAP theorem refers to atomicity and isolation. Consistency means consistent read and write operations for the same sets of data so that concurrent operations see the same valid and consistent data state, without any stale data. Consistency in ACID means if the data does not satisfy predefined constraints, it is not persisted. Consistency in CAP theorem is different. In a single-machine database, consistency is achieved using the ACID semantics. However, in the case of NoSQL databases which are scaled out and distributed providing consistency gets complicated. We will discuss availability in the next screen.

1.23 Availability

According to the CAP theorem, availability means the database system must be available to operate when required. This means that a system that is busy, uncommunicative, unresponsive, or inaccessible is not available. If a system is not available to serve a request at a time it is needed, it is unavailable. In the next screen, we will discuss partition tolerance.

1.24 Partition Tolerance

Proven methods, such as parallel processing and scaling are adopted as the model for scalability and performance instead of scaling up and building huge super computers. Instead of building giant computers, you can add several commodity hardware units in a cluster and make them work together. This is a cost and resource-effective solution. Cloud computing is a testimony to this solution. As NoSQL databases are distributed system by design, hence partitioning and occasional faults in a cluster are unavoidable. Partition tolerance or fault-tolerance is the third element of the CAP theorem. Partition tolerance measures the ability of a system to continue its service when some of its clusters become unavailable. In the next screen, we will learn about Mongo DB in terms of the CAP theorem.

1.25 Mongo DB as Per CAP

By default, MongoDB offers strong consistency. This means after you perform a write operation, you cannot read the same data until the write operation is successful. MongoDB is a single-master system and by default, all reads go to the primary node. Optionally, if you enable reading from the secondary node, MongoDB becomes eventually consistent and allows reading of out-of-date results. In addition, MongoDB handles network partition very well by keeping same data on multiple nodes or replica set. Therefore, Mongo DB is a consistent and partition tolerant database which comprises on the availability aspect.

1.26 Quiz

With this, we come to the end of this lesson. Following are few questions to test your understanding of the concepts discussed here.

1.27 Summary

Here is a quick recap of what was covered in this lesson: • NoSQL represents a class of products and a collection of diverse or related data concepts for storage and manipulation. • NoSQL databases are used to efficiently manage large-volume and semi-structured data. • The four basic NoSQL database types are— Key-Value, Document-based, Column-based, and Graph-based. • According to the CAP theorem, a distributed computer system cannot provide all the three properties together—consistency, availability, and partition tolerance

1.28 Conclusion

This concludes the lesson on Introduction to NoSQL databases. In the next lesson, we will discuss databases for the modern Web.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Phone Number*
Job Title*