Cassandra Data Model Tutorial

5.1 Cassandra Data Model

Hello and welcome to the fifth lesson of the Apache CassandraTM course offered by Simplilearn. This lesson will focus on the data model for Cassandra.

5.2 Course Map

The Apache Cassandra™ course by Simplilearn is divided into eight lessons, as listed: • Lesson 0—Course Overview • Lesson 1—Overview of Big Data and NoSQL Database • Lesson 2—Introduction to Cassandra • Lesson 3—Cassandra Architecture • Lesson 4—Cassandra Installation and Configuration • Lesson 5—Cassandra Data Model • Lesson 6—Cassandra Interfaces • Lesson 7—Cassandra Advanced Architecture and Cluster Management • Lesson 8—Hadoop Ecosystem around Cassandra This is the fifth lesson, ‘Cassandra Data Model.’

5.3 Objectives

After completing this lesson, you will be able to describe Cassandra data model and the components of Cassandra data model. You will also be able to explain the functions of DDL and DML statements and discuss the SELECT statement restrictions in Cassandra.

5.4 Cassandra Data Model

Cassandra is a NoSQL database, which is a key-value store. Some of the features of Cassandra data model are as follows: Data in Cassandra is stored as a set of rows that are organized into tables. Tables are also called column families. Each Row is identified by a primary key value. Data is partitioned by the primary key. You can get the entire data or some data based on the primary key.

5.5 Cassandra Data Model Components

Cassandra data model provides a mechanism for data storage. The components of Cassandra data model are keyspaces, tables, and columns. Cassandra data model consists of keyspaces at the highest level. Keyspaces are the containers of data, similar to the schema or database in a relational database. Typically, keyspaces contain many tables. Within the keyspaces, the tables are defined. Tables are also referred to as Column Families in the earlier versions of Cassandra. Tables contain a set of columns and a primary key, and they store data in a set of rows. Columns define the structure of data in a table. Each column has an associated type, such as integer, text, double, and Boolean. These Cassandra data model components will be discussed in detail in this lesson.

5.6 Keyspaces

As mentioned earlier, keyspaces in Cassandra are similar to the schema in a relational database. Some of the features of keyspaces are: Keyspace needs to be defined before creating tables, as there is no default keyspace. A keyspace can contain any number of tables and a table belongs only to one keyspace. This represents a one-to-many relationship. Replication is specified at the keyspace level. For example, replication of three implies that each data row in the keyspace will have three copies. Further, you need to specify the replication factor during the creation of keyspace. However, the replication factor can be modified later. The given sample query shows the command for creating a keyspace with the name as TestDB, replication as SimpleStrategy, and replication factor as three. It also shows the use command to start using the keyspace.

5.7 Tables

A table contains data in the horizontal and vertical formats, referred to as rows and columns respectively. Some of the features of tables are: Tables have multiple rows and columns. As mentioned earlier, a table is also called Column Family in the earlier versions of Cassandra. It is still referred to as column family in some of the error messages and documents of Cassandra. It is important to define a primary key for a table. The given query shows the statement for creating a table named employee with four columns empid, empfirstname, emplastname, and empsalary (Read as: employee id, employee first name, employee last name, and employee salary). Empid is of the integer type, empfirstname and emplastname are the text type, and empsalary is defined as a double. Note that empid is defined as the primary key for the table. In database terms, primary key refers to a column that has unique values for each row in a table. For example, in an organization, two employees may have the same first and the last names. However, each employee is assigned a unique employee ID. Therefore, if you create a table to hold employee data, the employee ID can be used as the primary key, as it is unique for all the rows of the employee table. On the other hand, employees’ first or last name cannot be used as a primary key, as it cannot uniquely identify a row in the table.

5.8 Columns

Column represents a single piece of data in Cassandra and has a type defined. Some of its features are: Column consists of various types, such as integer, big integer, text, float, double, and Boolean. Cassandra also provides collection types such as set, list, and map. Further, column values have an associated time stamp representing the time of update. This timestamp can be retrieved using the function writetime.

5.9 UUID and TimeUUID

There are two special column types for sequence numbers and timestamps called UUID (Read as: you you eye dee) and TimeUUID (Read as: time you you eye dee). Universal Unique Identity or UUID is similar to the sequence numbers in relational databases. It is a 128 bit integer, whereas TimeUUID contains a time stamp and guarantees no duplication. TimeUUID uses time in 100 nanosecond intervals. You can use the function now() (Read as: now open parentheses closed parentheses) to get the TimeUUID. The given examples for these columns are represented as hexadecimal digits with hyphens in between: Sample UUID is 01234567-0123-0123-0123-0123456789ab. Sample TimeUUID is d2177dd0-eaa2-11de-a572-001b779c76e3.

5.10 Counter

Counter is a special column which stores a number that is incremented. It can be used to count the occurrences of an event or a process. You can only update the value by adding or subtracting from the current value. Before the first update, a counter will have an initial value of zero. The update using the current value is allowed only for the counter columns in Cassandra. The given query shows a counter column being added called numEmployees (Read as: numemployees) to the table Department. It also shows the process of incrementing the column value based on the previous value for a row where deptID = 1000 (Read as: department id equals one thousand). Note that if deptID = 1000 is the first record, then a row will be inserted with counter value of 1. On subsequent updates, the previous value will be used to update to the next value.

5.11 Compound Keys

The primary key in Cassandra can be a compound key, which refers to a key consisting of multiple columns. For example, a multi-year stock market data has the primary key (tradeDate, ticker) (Read as: tradedate ticker), where tradeDate is the date of the stock trade and ticker is the stock symbol. When compound keys are specified, the first key is considered for partitioning of data and is called the partition key. The data in each partition is clustered and ordered by the remaining keys. Further, if you want to partition the data by multiple columns, group the data as first part of the primary key, as shown in the following two instances: 1. Primary key ((key1), key2, key3, key4) 2. Primary key ((key1, key2), key3, key4) In the first instance, the data is partitioned by key1 and clustered by key2, key3, key4 In the second instance, the data is partitioned by key1, key2 and clustered by key3, key4 Note that you can also specify the order for the cluster keys. In databases, clustering refers to the grouping of data together. Suppose the data is clustered by key3, key4, the rows with the key3 values are stored together on a physical storage, so that the queries such as key3 = 100 (Read as: key3 equals one hundred) can be fetched quickly.

5.12 Indexes

Indexes can be used to speed up queries. Index column can be used to search for data rows. Cassandra allows only one column in an index. You can create multiple indexes on the same table. The given query shows an index named first_index (Read as: first underscore index) being created on an employee table in the keyspace called testDB. The index is created on the column empFirstName. After creating the index, the given query that gets a row from the table with empFirstName = ‘Jack’ will use the index to search for data faster.

5.13 Collection Columns

Collection columns are used to represent a group of data in a single column. Cassandra provides three types of collection columns, such as set, list, and map. Set refers to an unordered group of values. Set values are enclosed within curly brackets, also called braces. For example, { ‘XYZ’, ‘ABC’,’PQR’ } List is an ordered group of values, like an array. List values are enclosed within square brackets. For example, [ ‘2011’, ‘2012’, ‘2013’ ] Finally, map is a set of key value pairs. Map values are also enclosed within braces. For example, { ‘key1’: ‘value1’, ‘key2’: ‘value2’ The given query shows a table being created with empid as primary key and roles as a set of type text You can add new roles to an employee by using the update command as shown: Update employee set roles = roles + {‘supervisor’} Here, the new role ‘supervisor’ is enclosed within the braces.

5.14 Collection Columns Set

The functions of collection column—Set are: You can use ALTER TABLE statement to add a collection to an existing table, using the given command: create table stocks (ticker text, year int, value double, primary key(ticker, year)); alter table stocks add supportNumber set; First, create table stocks with the columns ticker, year, and value; with primary key ticker and year. Next, use the alter table command to add supportNumber, which represents the set of support numbers for the stock. This is added as a set of text so it can hold the support number strings. Next, use the update command to initialize the data of a Set column using the given command: update stocks set supportNumber = {'1800800800'} where ticker = 'ABC' and year = 2011; Here, update the supportNumber column with value of ‘1800800800’ (Read as: one eight oh oh eight oh oh eight oh oh), enclosed in curly brackets to indicate a set. Finally, to add a new support number to an existing Set, use the given command: update stocks set supportNumber = supportNumber + {'1800800801'} where ticker = 'ABC' and year = 2011; Here, use the plus symbol to add value to an existing column. Note that the other collection types, such as lists and maps function in a similar manner.

5.15 DDL and DML Statements

Cassandra supports Cassandra Query Language or CQL, which includes Data Definition Language or DDL statements and Data Manipulation Language or DML statements. DDL statements are CREATE TABLE, ALTER TABLE, and DROP TABLE. DML statements are INSERT, UPDATE, SELECT, and DELETE. COPY is an optional DML statement of Cassandra.

5.16 DDL Statements

Cassandra includes DDL statements such as CREATE TABLE, ALTER TABLE, and DROP TABLE. The CREATE TABLE statement can be used to create a table, as this statement defines the schema of the table. You have to specify the primary key while creating the table. Once a table is created, ALTER TABLE statement can be used to modify the table definition. Further, the DROP TABLE statement can be used to remove an existing table. You need to be careful before you drop a table as you will lose any data in the table. The given query shows a table being created to hold the stock market data: create table stocks(ticker text, tradeDate timestamp , value double, primary key (ticker, tradeDate)) clustered by tradeDate desc; alter table stocks add status text; drop table stocks; The command is ‘create table stocks’ along with the list of columns and data types. Then, a composite primary key is specified with columns ticker and trade date. It also illustrates using the ‘clustered by’ option to cluster the data by trade date column in the descending order. Next, an ALTER TABLE command is used to add a new column status of the type ‘text’ to the table. Finally, the DROP TABLE command is used to remove the table.

5.17 DML Statements INSERT

Cassandra includes the standard INSERT values statement to insert data into a table. The list of values must match the columns specified. If the primary key data already exists, the data in the remaining columns will be updated. Otherwise, an entire row will be inserted. Further, it is necessary to specify the IF NOT EXISTS condition to insert row only if the primary key does not exist in the table. The given query shows a row being inserted, which contains values for empid, empFirstname, and emplastname (Read as: employee id, employee first name, and employee last name) into the table employee: Insert into employee (empid, empFirstName, empLastName) values ( 100 , ‘Jack’ , ‘Frank’) if not exists; Note that the IF NOT EXISTS condition is specified so that the insert will happen only if there is no row with the empid of 100 in the table.

5.18 DML Statements UPDATE

Cassandra includes UPDATE statement to insert or update data into a table. These statements update one or more columns of the table. In a table, if a row matching the ‘where’ clause is found, the values are updated. On the contrary, if the row is not found, the values are inserted into the table with the keys specified in the ‘where’ clause. Note that the primary key value is inserted only once and cannot be updated. The given query shows the update of the employee first name to John for the empid 100: Update employee set empFirstName = ‘John’ Where empid = 100; (Do not read the above command) If a row with the empid 100 exists in the table, the value for employee first name is updated to John. If there is no row with empid of 100, then a new row is inserted into the table with values of empid = 100 (Read as: employee id equals one hundred) and employee first name = John (Read as: employee first name equals john). Further, there is another query for incrementing the counter column numEmployees (Read as: num employees) by one: Update department set numEmployees = numEmployees + 1 Where departmentId = 1000; Note that updating a column using previous column value is allowed only for counter columns. Suppose there is a column salary of type double and you update it as salary = salary + 200.50 (Read as: salary equals salary plus two zero zero point five oh), this will fail with an error stating this is not a counter column.

5.19 DML Statements COPY

Cassandra includes COPY statement to load bulk data into a table. This is an optional DML statement of Cassandra. Note that the INSERT INTO command can be used to insert a few rows into a table, but not bulk data. The COPY statement is not part of CQL, however it is part of Cassandra Command Line Interface. Using this command, you can import the data from a data file. Further, you can import the data into particular columns instead of all the columns. The data file is assumed to be delimited by ‘comma’ by default. You can also specify a different delimiter. Some datafiles may contain a header line that can be ignored using HEADER=TRUE (Read as: header equals true) as an option. If the duplicate data exists in the table, it will not cause an error. It will be treated as an update and the last value in the table will be kept. The given query shows the data of empid, empFirstName, and empLastName being copied into the employee table from the file employeeData.csv (Read as: employee data dot c s v) in the local file system. This file has to be a comma separated file: Copy employee (empid, empFirstName, empLastName) FROM employeeData.csv ; The secondquery shows the data being imported into all the columns of the employee table from (Read as: employee data dot bar) local file: Copy employee FROM WITH HEADER = TRUE AND DELIMITER = ‘|’ ; The delimiter is specified as vertical bar character and header is set to TRUE so that the first line of the file is ignored.

5.20 DML Statements SELECT

Cassandra includes SELECT statement to extract data from the tables. You can select one or more columns or use the * (Read as: asterix) symbol to extract all the columns of a table. The general syntax is ‘select columns from tableName where’ conditions. Cassandra is restrictive on what you can specify in the ‘where’ clause. You can specify the SELECT conditions based only on the primary keys or index columns. The syntax ‘Order by’ is allowed only on clustered columns. The given query shows a few valid select statements in Cassandra: Select * from employee where empid = 1000; // Allowed as empid is the primary key First, the SELECT statement extracts all the columns from the employee table where empid (Read as: employee id) is 100. This is allowed as empid is the primary key for the table. The second query shows the selection of two columns, empid, empFirstName from the employee table where empFirstName is Jack: Select empid, empFirstName from employee where empFirstName = ‘Jack’; // allowed only if there is index on empFirstName This is allowed only if there is an index for employee table created on empFirstName column:

5.21 SELECT Statements Restrictions

The SELECT statement in Cassandra has the following restrictions: The range condition is allowed only for the last part of the primary key. For example, if a primary key is defined as (key1, key2, key3, key4), then key1 = 100 and key2 = 20 and key3 = 50, and key4 in (3, 4) is allowed; whereas Key1 = 100 and key2 in (20,30) (Read as: twenty thirty) and key3 =50 and key4 = 3 is not allowed, as only key4 is allowed to have a range in this query. You have to specify the values for all leading columns of the primary key. For example, if primary key is defined as (key1, key2, key3, key4), then Key1 = 100 is allowed; whereas key1 = 100 and key3 = 40 is not allowed.

5.22 Valid and Invalid SELECT Statements Example

The following query shows four SELECT statements in Cassandra, out of which two are valid and the rest are invalid. The query shows the testTab table that has its primary key defined as (key1, key2, key3, key4): 1. Select * from testTab where key1 = 100 ; //allowed The first statement extracts all the columns where key1 is 100. This is allowed as key1 is the first column of the primary key. 2. Select * from testTab where key1 = 100 and key2 = 200; // allowed The second statement extracts all the columns where key1 is 100 and key2 is 200. This is also allowed as key1 and key2 are the leading columns of the primary key. 3. Select * from testTab where key1 = 100 and key3 = 50; // not allowed The third statement selects all the columns where key1 is 100 and key3 is 50. This is not allowed as you cannot specify value for key3 without specifying the value for the previous key, key2. 4. Select * from testTab where key1 = 100 and key2 = 20 and key4 = 3; // not allowed The fourth statement selects all the columns where key1 is 100, key2 is 20, and key4 is 3. This is also not allowed as the value for key4 is specified without specifying a value for key3.

5.23 DML Statements DELETE

Cassandra allows removal of rows from a table using the DELETE statement. This statement removes all the matching rows from the table. If the columns are specified in the DELETE statement, only the data from the columns will be removed, however the row will not be deleted. The ‘where’ clause of the DELETE statement is similar to that of the SELECT statement. You can also specify additional conditions with the IF clause. The given query shows three valid DELETE statements in Cassandra: 1. Delete from employee where employeeId = 1000; The first statement removes the row where employee ID equals 1000. 2. Delete employeeFirstName from employee where employeeId = 2000; The second statement removes the value in the employeeFirstname column, where the employee ID equals 2000. 3. Delete from employee where employeeId = 2000 (IF salary > 100000); (Do not read this command) The third statement deletes the row with employee ID whose value is 2000 along with the IF condition, where the salary is more than 100,000.

5.30 Quiz

A few questions will be presented in the following screens. Select the correct option and click submit to see the feedback.

5.31 Summary

Let us summarize the topics covered in this lesson. Cassandra data model contains keyspaces, tables, and columns. A keyspace is the container of all data in Cassandra. Replication is specified at the keyspace level. Tables and columns contain the key value data in Cassandra. Every table should have a primary key, which can be a composite primary key. Data is partitioned by the first part of the primary key and clustered by the rest. Cassandra provides CREATE TABLE, ALTER TABLE, and DROP TABLE statements for data definition. Cassandra provides INSERT, UPDATE, SELECT, and DELETE statements for data manipulation. The ‘where’ clauses in Cassandra have restrictions.

5.32 Conclusion

This concludes the lesson on Cassandra data model. The next lesson will focus on Cassandra interfaces.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Request more information

For individuals
For business
Phone Number*
Your Message (Optional)
We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Phone Number*
Job Title*