From a bird’s eye view Apache Cassandra is a database, one that is highly scalable, high-performance, designed to handle large amounts of data. In this tutorial we will focus on the data model for Cassandra. By the end of this lesson, you will be able to:

  • Describe the Cassandra data model
  • Describe the components of Cassandra data mode
  • Explain the functions of DDL and DML statements
  • Discuss the SELECT statement restrictions in Cassandra

Let us begin with the Cassandra Data Model!

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Cassandra Data Model

Cassandra is a NoSQL database, which is a key-value store. Some of the features of Cassandra data model are as follows:

  • Data in Cassandra is stored as a set of rows that are organized into tables.
  • Tables are also called column families.
  • Each Row is identified by a primary key value.
  • Data is partitioned by the primary key.
  • You can get the entire data or some data based on the primary key.

In the next section, we will learn about the components of the Cassandra Data Model.

Cassandra Data Model Components

Cassandra data model provides a mechanism for data storage. The components of Cassandra data model are keyspaces, tables, and columns.

Keyspaces

Cassandra data model consists of keyspaces at the highest level. Keyspaces are the containers of data, similar to the schema or database in a relational database. Typically, keyspaces contain many tables.

Tables

Within the keyspaces, the tables are defined. Tables are also referred to as Column Families in the earlier versions of Cassandra. Tables contain a set of columns and a primary key, and they store data in a set of rows.

Columns

Columns define the structure of data in a table. Each column has an associated type, such as integer, text, double, and Boolean. These Cassandra data model components will be discussed in detail in this lesson.

Let us learn about all of these in detail in the next section.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Keyspaces

As mentioned earlier, keyspaces in Cassandra are similar to the schema in a relational database. Some of the features of keyspaces are:

  • A keyspace needs to be defined before creating tables, as there is no default keyspace.
  • A keyspace can contain any number of tables, and a table belongs only to one keyspace. This represents a one-to-many relationship.
  • Replication is specified at the keyspace level. For example, replication of three implies that each data row in the keyspace will have three copies.
  • Further, you need to specify the replication factor during the creation of keyspace. However, the replication factor can be modified later.

The given sample query shows the command for creating a keyspace with the name as TestDB, replication as SimpleStrategy, and replication factor as three. It also shows the use command to start using the keyspace.

CassandraDataModel_1

Tables

A table contains data in the horizontal and vertical formats, referred to as rows and columns respectively. Some of the features of tables are:

  • Tables have multiple rows and columns. As mentioned earlier, a table is also called Column Family in the earlier versions of Cassandra.
  • It is still referred to as column family in some of the error messages and documents of Cassandra.
  • It is important to define a primary key for a table.

The given query shows the statement for creating a table named employee with four columns empid, empfirstname, emplastname, and empsalary.

CassandraDataModel_2

In this query, Empid is of the integer type, empfirstname and emplastname are the text type, and empsalary is defined as a double. Note that empid is defined as the primary key for the table. In database terms, primary key refers to a column that has unique values for each row in a table.

For example, in an organization, two employees may have the same first and the last names. However, each employee is assigned a unique employee ID. Therefore, if you create a table to hold employee data, the employee ID can be used as the primary key, as it is unique for all the rows of the employee table. 

On the other hand, employees’ first or last name cannot be used as a primary key, as it cannot uniquely identify a row in the table.

Columns

Column represents a single piece of data in Cassandra and has a type defined.

Some of its features are:

  • Columns consist of various types, such as integer, big integer, text, float, double, and Boolean.
  • Cassandra also provides collection types such as set, list, and map.
  • Further, column values have an associated time stamp representing the time of update.
  • This timestamp can be retrieved using the function writetime.

In the next section, let us learn all about UUID and TimeUUID.

UUID and TimeUUID

There are two special column types for sequence numbers and timestamps called UUID and TimeUUID.  

Universal Unique Identity or UUID

This is similar to the sequence numbers in relational databases. It is a 128-bit integer.

The given example for this column is represented as hexadecimal digits with hyphens in between:

Sample UUID is – 01234567-0123-0123-0123-0123456789ab.

TimeUUID

It contains a timestamp and guarantees no duplication. TimeUUID uses time in 100 nanosecond intervals. You can use the function now() to get the TimeUUID.

Sample TimeUUID is – D2177dd0-eaa2-11de-a572-001b779c76e3.

Counter

Counter is a special column which stores a number that is incremented. It can be used to count the occurrences of an event or a process. Some of the features of the counter are:

  • You can only update the value by adding or subtracting from the current value.
  • Before the first update, a counter will have an initial value of zero.
  • The update using the current value is allowed only for the counter columns in Cassandra.

The given query shows a counter column being added called numEmployees to the table Department.

CassandraDataModel_3

It also shows the process of incrementing the column value based on the previous value for a row where deptID = 1000.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

Compound Keys

The primary key in Cassandra can be a compound key, which refers to a key consisting of multiple columns. For example, a multi-year stock market data has the primary key (tradeDate, ticker), where:

  • tradeDate is the date of the stock trade and
  • ticker is the stock symbol.

When compound keys are specified, the first key is considered for partitioning of data and is called the partition key. The data in each partition is clustered and ordered by the remaining keys.

Further, if you want to partition the data by multiple columns, group the data as first part of the primary key, as shown in the following two instances:

1. Primary key ((key1), key2, key3, key4)

  • Here the data is partitioned by key1 and clustered by key2, key3, key4

2. Primary key ((key1, key2), key3, key4)

  • Here the data is partitioned by key1, key2 and clustered by key3, key4

Note that you can also specify the order for the cluster keys.

In databases, clustering refers to the grouping of data together. Suppose the data is clustered by key3, key4, the rows with the key3 values are stored together on physical storage, so that the queries such as key3 = 100 can be fetched quickly.

Indexes

Indexes can be used to speed up queries. Index column can be used to search for data rows. Cassandra allows only one column in an index. You can create multiple indexes on the same table.

The given query shows an index named first_index being created on an employee table in the keyspace called testDB.

CassandraDataModel_4 

The index is created on the column empFirstName. After creating the index, the given query that gets a row from the table with empFirstName = ‘Jack’ will use the index to search for data faster.

Collection Columns

Collection columns are used to represent a group of data in a single column. Cassandra provides three types of collection columns, such as set, list, and map.

Set

It refers to an unordered group of values. Set values are enclosed within curly brackets, also called braces.

For example, { ‘XYZ’, ‘ABC’,’PQR’ }

List

It is an ordered group of values, like an array. List values are enclosed within square brackets.

For example, [ ‘2011’, ‘2012’, ‘2013’ ]

Map

It is a set of key-value pairs. Map values are also enclosed within braces.

For example, { ‘key1’: ‘value1’, ‘key2’: ‘value2’}

The given query shows a table being created with empid as primary key and roles as a set of type text. You can add new roles to an employee by using the update command as shown:

CassandraDataModel_5

Update employee set roles = roles + {‘supervisor’}

Here, the new role ‘supervisor’ is enclosed within the braces.

Collection Columns - Set

The functions of collection column — Set are:

You can use ALTER TABLE statement to add a collection to an existing table, using the given command:

create table stocks (ticker text, year int, value double, primary key(ticker, year));

alter table stocks add supportNumber set<text>;

First, create table stocks with the columns ticker, year, and value; with primary key ticker and year. Next, use the alter table command to add supportNumber, which represents the set of support numbers for the stock. This is added as a set of text so it can hold the support number strings.

Next, use the update command to initialize the data of a Set column using the given command:

update stocks set supportNumber = {'1800800800'} where ticker = 'ABC' and year = 2011;

Here, update the supportNumber column with value of ‘1800800800’, enclosed in curly brackets to indicate a set.

Finally, to add a new support number to an existing Set, use the given command:

update stocks set supportNumber = supportNumber + {'1800800801'}

where ticker = 'ABC' and year = 2011;

Here, use the plus symbol to add value to an existing column. Note that the other collection types, such as lists and maps function in a similar manner.

In the next section, let us talk about DDL and DML Statements.

Become a Data Scientist with Hands-on Training!

Data Scientist Master’s ProgramExplore Program
Become a Data Scientist with Hands-on Training!

DDL and DML Statements

Cassandra supports Cassandra Query Language or CQL, which includes Data Definition Language or DDL statements and Data Manipulation Language or DML statements. DDL statements are CREATE TABLE, ALTER TABLE, and DROP TABLE. DML statements are INSERT, UPDATE, SELECT, and DELETE. COPY is an optional DML statement of Cassandra.

DDL Statements

Cassandra includes DDL statements such as:

  • CREATE TABLE
  • ALTER TABLE
  • DROP TABLE

The CREATE TABLEstatement can be used to create a table, as this statement defines the schema of the table. You have to specify the primary key while creating the table.

Once a table is created, ALTER TABLEstatement can be used to modify the table definition.

Further, the DROP TABLEstatement can be used to remove an existing table. You need to be careful before you drop a table as you will lose any data in the table.

The given query shows a table being created to hold the stock market data:

create table stocks(ticker text, tradeDate timestamp, value double,

primary key (ticker, tradeDate)) clustered by tradeDate desc;

alter table stocks add status text;

drop table stocks;

The command is ‘create table stocks’ along with the list of columns and data types. Then, a composite primary key is specified with columns ticker and trade date. It also illustrates using the ‘clustered by’ option to cluster the data by trade date column in the descending order.

Next, an ALTER TABLE command is used to add a new column status of the type ‘text’ to the table. Finally, the DROP TABLE command is used to remove the table.

In the next section, let us talk about Insert Statement.

CassandraDataModel_6.

DML Statements - INSERT

Cassandra includes the standard INSERT values statement to insert data into a table. The list of values must match the columns specified. If the primary key data already exists, the data in the remaining columns will be updated. Otherwise, an entire row will be inserted.

Further, it is necessary to specify the IF NOT EXISTS condition to insert row only if the primary key does not exist in the table.

The given query shows a row being inserted, which contains values for empid, empFirstname, and emplastname into the table employee:

Insert into employee (empid, empFirstName, empLastName)

values ( 100 , ‘Jack’, ‘Frank’) if not exists;

Note that the IF NOT EXISTS condition is specified so that the insert will happen only if there is no row with the empid of 100 in the table.

In the next section, let us talk about Update statement.

DML Statements - UPDATE

Cassandra includes UPDATE statement to insert or update data into a table. These statements update one or more columns of the table. In a table, if a row matching the ‘where’ clause is found, the values are updated.

On the contrary, if the row is not found, the values are inserted into the table with the keys specified in the ‘where’ clause. Note that the primary key value is inserted only once and cannot be updated. The given query shows the update of the employee first name to John for the empid 100:

Update employee set empFirstName = ‘John’

Where empid = 100;

If a row with the empid 100 exists in the table, the value for employee first name is updated to John. If there is no row with empid of 100, then a new row is inserted into the table with values of empid = 100 and employee first name = John.

Further, there is another query for incrementing the counter column numEmployees by one:

Update department set numEmployees = numEmployees + 1

Where departmentId = 1000;

Note that updating a column using previous column value is allowed only for counter columns. Suppose there is a column salary of type double and you update it as salary = salary + 200.50, this will fail with an error stating this is not a counter column.

DML Statements - COPY

Cassandra includes COPY statement to load bulk data into a table. This is an optional DML statement of Cassandra. Note that the INSERT INTO command can be used to insert a few rows into a table, but not bulk data.

The COPY statement is not part of CQL, however, it is part of Cassandra Command Line Interface. Using this command, you can import the data from a data file. Further, you can import the data into particular columns instead of all the columns. The data file is assumed to be delimited by ‘comma’ by default. You can also specify a different delimiter.

Some data files may contain a header line that can be ignored using HEADER=TRUE as an option. If the duplicate data exists in the table, it will not cause an error. It will be treated as an update, and the last value in the table will be kept.

The given query shows the data of empid, empFirstName, and empLastName being copied into the employee table from the file employeeData.csv in the local file system. This file has to be a comma separated file:

Copy employee (empid, empFirstName, empLastName) FROM

employeeData.csv ;

The second query shows the data being imported into all the columns of the employee table from employeeData.bar local file:

Copy employee FROM employeeData.bar WITH HEADER = TRUE AND

DELIMITER = ‘|’ ;

The delimiter is specified as vertical bar character and header is set to TRUE so that the first line of the file is ignored.

Learn Job Critical Skills To Help You Grow!

Post Graduate Program In Data EngineeringExplore Program
Learn Job Critical Skills To Help You Grow!

DML Statements - SELECT

Cassandra includes SELECT statement to extract data from the tables. You can select one or more columns or use the * symbol to extract all the columns of a table.

The general syntax is ‘select columns from tableName where’ conditions. Cassandra is restrictive on what you can specify in the ‘where’ clause. You can specify the SELECT conditions based only on the primary keys or index columns. The syntax ‘Order by’ is allowed only on clustered columns.

The given query shows a few valid select statements in Cassandra:

Select * from employee where empid = 1000; // Allowed as empid is the primary key

First, the SELECT statement extracts all the columns from the employee table where empid is 100. This is allowed as empid is the primary key for the table. The second query shows the selection of two columns, empid, empFirstName from the employee table where empFirstName is Jack:

Select empid, empFirstName from employee where empFirstName = ‘Jack’; // allowed

only if there is index on empFirstName

This is allowed only if there is an index for the employee table created on empFirstName column.

Let us talk about Select Statement Restrictions in the next section.

SELECT Statements Restrictions

The SELECT statement in Cassandra has the following restrictions: 

  • The range condition is allowed only for the last part of the primary key.

For example, if a primary key is defined as (key1, key2, key3, key4), then,

  • key1 = 100 and key2 = 20 and key3 = 50, and key4 in (3, 4) is allowed;
  • Key1 = 100 and key2 in (20,30) and key3 =50 and key4 = 3 is not allowed, as only key4 is allowed to have a range in this query.

You have to specify the values for all leading columns of the primary key. For example, if the primary key is defined as (key1, key2, key3, key4), then

  • Key1 = 100 is allowed;
  • key1 = 100 and key3 = 40 is not allowed.

Let us talk about valid and invalid Select Statement example in the next section.

Valid and Invalid SELECT Statements Example

The following query shows four SELECT statements in Cassandra, out of which two are valid, and the rest are invalid.

The query shows the testTab table that has its primary key defined as (key1, key2, key3, key4):

Select * from testTab where key1 = 100; //allowed

The first statement extracts all the columns where key1 is 100. This is allowed as key1 is the first column of the primary key.

Select * from testTab where key1 = 100 and key2 = 200; // allowed

The second statement extracts all the columns where key1 is 100 and key2 is 200. This is also allowed as key1 and key2 are the leading columns of the primary key.

Select * from testTab where key1 = 100 and key3 = 50; // not allowed

The third statement selects all the columns where key1 is 100 and key3 is 50. This is not allowed as you cannot specify the value for key3 without specifying the value for the previous key, key2.

Select * from testTab where key1 = 100 and key2 = 20 and key4 = 3; // not allowed

The fourth statement selects all the columns where key1 is 100, key2 is 20, and key4 is 3. This is also not allowed as the value for key4 is specified without specifying a value for key3.

DML Statements - DELETE

Cassandra allows removal of rows from a table using the DELETE statement. This statement removes all the matching rows from the table. If the columns are specified in the DELETE statement, only the data from the columns will be removed, however, the row will not be deleted. The ‘where’ clause of the DELETE statement is similar to that of the SELECT statement. You can also specify additional conditions with the IF clause.

The given query shows three valid DELETE statements in Cassandra:

Delete from employee where employeeId = 1000;

The first statement removes the row where employee ID equals 1000.

Delete employeeFirstName from employee where employeeId = 2000;

The second statement removes the value in the employeeFirstname column, where the employee ID equals 2000.

Delete from employee where employeeId = 2000

(IF salary > 100000);

The third statement deletes the row with employee ID whose value is 2000 along with the IF condition, where the salary is more than 100,000.

Learn Job Critical Skills To Help You Grow!

Post Graduate Program In Data EngineeringExplore Program
Learn Job Critical Skills To Help You Grow!

Summary

Let us summarize the topics covered in this lesson.

  • Cassandra data model contains keyspaces, tables, and columns.
  • A keyspace is the container of all data in Cassandra. Replication is specified at the keyspace level.
  • Tables and columns contain the key value data in Cassandra.
  • Every table should have a primary key, which can be a composite primary key.
  • Data is partitioned by the first part of the primary key and clustered by the rest.
  • Cassandra provides CREATE TABLE, ALTER TABLE, and DROP TABLE statements for data definition.
  • Cassandra provides INSERT, UPDATE, SELECT, and DELETE statements for data manipulation.
  • The ‘where’ clauses in Cassandra have restrictions.

Conclusion

If you wish to succeed in your Big Data or data science career, mastering apache cassandra can turn out to be really helpful. Enrol in Simplilearn’s Big Data Hadoop Certification Training and master cassandra and all the other essential Big Data concepts and tools now!