Cassandra Installation and Configuration Tutorial
4.1 Cassandra Installation and Configuration
Hello and welcome to the fourth lesson of the Apache CassandraTM course offered by Simplilearn. This lesson will cover the steps to install and configure Cassandra.
4.2 Course Map
The Apache Cassandra™ course by Simplilearn is divided into eight lessons, as listed. • Lesson 0—Course Overview • Lesson 1—Overview of Big Data and NoSQL Database • Lesson 2—Introduction to Cassandra • Lesson 3—Cassandra Architecture • Lesson 4—Cassandra Installation and Configuration • Lesson 5—Cassandra Data Model • Lesson 6—Cassandra Interfaces • Lesson 7—Cassandra Advanced Architecture and Cluster Management • Lesson 8—Hadoop Ecosystem around Cassandra This is the fourth lesson, ‘Cassandra Installation and Configuration.’
After completing this lesson, you will be able to state the various versions of Cassandra. You will also be able to explain the steps to install and configure Cassandra on the Ubuntu system. Finally, you will be able to list the steps to install Cassandra on CentOS.
4.4 Cassandra Versions
Cassandra has multiple versions. You need to choose the right version for installation. Version 1.0, released on October 17, 2011, was the first production version. Version 1.2 was released on January 02, 2013 with virtual nodes added. Version 2.0 was released on September 04, 2013, which added lightweight transactions. Version 2.1 is the latest, which supports Cassandra Query Language or CQL 3.0. This was released on April 01, 2015. Cassandra is an open source product supported by DataStax enterprise. DataStax provides the package installations as well as drivers for Cassandra.
4.5 Steps to Install and Configure Cassandra on Ubuntu System
To install and configure Cassandra on the Ubuntu system, perform the following steps: 1. Select the operating system. 2. Select the machine. 3. Prepare for installation. 4. Setup repository. 5. Install Cassandra 6. Check the installation. 7. Configure Cassandra 8. Configure the single-node cluster. 9. Configure the multi-node cluster. 10. Setup property file. 11. Configure the production cluster. 12. Setup gossiping property file. 13. Start the Cassandra services. 14. Connect to Cassandra. Each step will be discussed in detail.
4.6 Step 1-Operating System Selection
You can choose any of the Linux operating systems for installation. Some of the examples of the Linux operating systems are as follows: Ubuntu 12.04 or later version installed on a virtual machine. Red Hat Enterprise Linux, referred to as RHEL, CentOS, a free version of RHEL, and Debian systems. In addition, you can also choose to install Cassandra on Windows 7 or 8.
4.7 Step 2-Machine Selection
Cassandra needs good memory and adequate processing power. The recommended machine configurations for the Cassandra cluster are as follows: For development systems, minimum of 2GB RAM, two CPUs, and 1 TB hard disk. For production systems, the requirement is minimum of 16 GB to a maximum of 96 GB of RAM per machine, an 8-Core CPUs, 2 gigahertz and above processors, and four 2 TB hard disks.
4.8 Step 3-Preparing for Installation
The prerequisite software for installing Cassandra are: • Java JRE 1.7 or higher. Open JRE works, however Oracle JRE is recommended. • Python, for some Cassandra tools. • Extra Packages for Enterprise Linux or EPEL, for some systems.
4.9 Step 4-Setup Repository
DataStax provides packaged installation of Cassandra for many operating systems. You need to configure the repository for the Cassandra installation. You can provide instructions for the Ubuntu system and identify the commands for other systems from the DataStax site. There are two steps: First, add the DataStax repository to sources list. The command for this is shown: echo "deb http://debian.DataStax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list (Do not read this entire part) Note the pipe character or vertical bar in this command. This command adds the repository line to the file Cassandra.sources.list in the /etc/apt/sources.list.d directory. Second, add the DataStax repository key to trusted keys. The command for this is shown: curl -L http://debian.DataStax.com/debian/repo_key | sudo apt-key add – (Do not read this entire part) Note the pipe character of vertical bar in this command. Also note that there is a hyphen character at the end. This command gets the repository key from DataStax and adds it to the aptitutde keys of Ubuntu.
4.10 Step 5-Install Cassandra
After the repository is setup, update the packages and install Cassandra. To update the packages, use the command given: sudo apt-get update(Do not read this entire part) This command leads to a series of messages while updating the repository. Enter Y when asked to confirm any updates. Once packages are updated, install Cassandra package with the command given: sudo apt-get install dsc21=2.1.1-1 cassandra=2.1.1(Do not read this entire part) Note that the version being installed is 2.1.1, which is the stable version on Ubuntu. Once Cassandra is installed on the system, its services start automatically.
4.11 Step 6-Check the Installation
After installing Cassandra, it is important to check the installation. Go to the configuration directory of Cassandra to check the configuration files. Go to the configuration directory /etc/Cassandra and use the command given: cd/etc/cassandra This command will change the directory. Note that Linux is case sensitive. Therefore, all the letters in this command must be in the lower case. Note that on some systems, the configuration directory will be /etc/cassandra/conf. Next, do a directory listing to check the configuration files, using the command given: is -l This command will list the configuration files, such as cassandra-env.sh, cassandra-topology.properties and cassandra.yaml.
4.12 Step 7-Configuring Cassandra
The configuration files that are normally modified after the installation are as follows: The first file is cassandra-env.sh. This is a Linux shell script for setting up environment properties, such as Java heap size and JVM parameters. The next file is cassandra.yaml. This is the main configuration file used to customize the Cassandra cluster. You can set parameters, such as cluster name, number of virtual nodes, data file location, seed providers, listen address, and Remote Procedure Call or RPC ports in this file. The last file is cassandra-topology.properties, which is the cluster topology specification file. It contains the list of nodes and their topology, such as datacenter and rack configuration. You can open and check the default configuration in these files using the command given:
4.13 Step 8-Configuration for a Single-Node Cluster
For a single-node cluster, take the default configuration and modify only the cluster name. All addresses will be localhost, which is the same as 127.0.0.1. The contents of cassandra.yaml are: cluster_name: 'Simplilearn Cluster‘ num_tokens: 256 data_file_directories: - /var/lib/cassandra/data seed_provider: - seeds: "127.0.0.1“ listen_address: localhost native_transport_port: 9042 endpoint_snitch: SimpleSnitch Some of the key points are: • The cluster name is changed to Simplilearn Cluster. • Other parameters, such as num_tokens and data file directories are set by default. These parameters need not be modified. • The seed provider is set to 127.0.0.1, whereas listen_address is set to localhost, as both are same. • The endpoint snitch is set to SimpleSnitch as the property file is not needed for this cluster.
4.14 Step 9-Configuration for a Multi-Node and Multi-Datacenter Clusters
For a multi-node and multi-datacenter clusters, specify the node addresses and the cluster topology through the cassandra-topology.properties file. cassandra.yaml contains the default settings, as in case of a single-node cluster. The contents are: cluster_name: 'Simplilearn Cluster’ num_tokens: 256 data_file_directories: - /var/lib/cassandra/data seed_provider: seeds: "127.0.0.1“ listen_address: localhost native_transport_port: 9042 endpoint_snitch: SimpleSnitch Some of the key points are: • Multiple seed nodes must be specified. • The listen address also can be specified as a node address instead of localhost. • The end point snitch should point to the property file by using the PropertyFileSnitch.
4.15 Step 10-Setup Property File
The cassandra-topology.properties file contains the cluster topology for the entire cluster, while PropertyFileSnitch is used as the snitch. The contents of the sample file are: # Cassandra Node IP=Data Center:Rack 192.168.1.100=DC1:RAC1 192.168.2.200=DC2:RAC2 10.0.0.10=DC1:RAC1 10.0.0.11=DC1:RAC1 10.0.0.12=DC1:RAC2 # default for unknown nodes default=DC1:r1 The lines starting with hash are the comments that will be ignored. Each line contains the data center and rack information for a node in the cluster. It has the following format: IP address = datacenter name: Rack Name For example, if a node with IP address 192.168.1.100 is located in rack RAC1 of the datacenter DC1, then the line for this node will be: 192.168.1.100=DC1:RAC1 Further, you can also specify a default configuration to use for nodes that are not listed in the file. To do so, use the word default for the IP address as default=DC1:r1.
4.16 Step 11-Configuration for a Production Cluster
For a production cluster, specify the node addresses, cluster topology through the cassandra-rackdc.properties file, and the snitch as GossipingPropertyFileSnitch. The gossip protocol is used to propagate the topology information. Contents of cassandra.yaml are: cluster_name: 'Simplilearn Production Cluster‘ num_tokens: 256 data_file_directories: - /var/lib/cassandra/data seed_provider: #List of seed nodes to use for gossip bootstrap - seeds: “node1, node2, node3“ listen_address: node1 #Address of this node native_transport_port: 9042 endpoint_snitch: GossipingPropertyFileSnitch This configuration will be similar to the multi-datacenter configuration. However, the end point snitch will be specified as GossipingPropertyFileSnitch.
4.17 Step 12-Setup Gossiping Property File
The file cassandra-rackdc.properties contains the cluster topology information for the current node, while GossipingPropertyFileSnitch is used as the snitch. The contents of the sample gossiping property file are: # These properties are used with GossipingPropertyFileSnitch and will # indicate the rack and dc for this node dc=DC1 rack=RAC1 A sample file contains the datacenter and rack information only for that node. In each file, the datacenter for the node is specified using dc= line and the rack for the node is specified using rack= line. This reduces the amount of information shared during the gossip protocol. This completes the process of installing and configuring Cassandra.
4.18 Step 13-Starting Cassandra Services
After the configuration files are setup, start the Cassandra services. Typically, the installer starts the service immediately. In such a case, stop the service, remove the data, and then restart the service. First, check whether the service is running by using the given command: sudo service cassandra status This will indicate if the service is running. If the status shows running, stop the service using the given command: sudo service cassandra stop Remember to enter the password as simplilearn when the sudo command prompts. This command will stop the running Cassandra service. Next, remove the existing data directory of Cassandra, as it will be using the data directory from the previous configuration. Note that this needs to be done only once before you put any valuable data into Cassandra. Remove the existing data directories using the given command. sudo rm -rf /var/lib/cassandra/data/system/* (Do not read this part) Finally, start the Cassandra service using the given command: sudo service cassandra start This will start the Cassandra services. For a multi-node setup, this entire process has to be done on each node.
4.19 Step 14-Connecting to Cassandra
Once the Cassandra service is running, you can connect to Cassandra using the Cassandra command line interface. First, set up the host to connect to, using the CQLSH_HOST environment variable. You can set this variable using the given command: export CQLSH_HOST=localhost Note that the commands are case sensitive. After this, you can start the command line interface with the given command. cqlsh In the cqlsh prompt window, type help and then press the Enter key. This shows a list of commands provided by Cassandra. Next, you can type exit and press the Enter key to leave the Cassandra command line interface. This completes the process of Cassandra installation and configuration.
4.20 Installing on CentOS
In addition to Ubuntu, CentOS is another popular Linux distribution. To install Cassandra on CentOS, it is recommended to use yum instead of apt. The instructions to install Cassandra on CentOS are: First, check whether EPEL is installed using the given command: sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm (Do not read this part) Next, add the yum repository specification for DataStax to the repositories using the commands as shown: sudo cat > /etc/yum.repos.d/DataStax.repo
A few questions will be presented. Select the correct option and click submit to see the feedback.
Let us summarize the topics covered in this lesson. Cassandra has multiple versions and the latest versions are 2.0 and above. It is important to choose the right operating system and machine configurations before installing Cassandra. Cassandra can be installed using the DataStax repository. Cassandra configuration files are stored in /etc/cassandra or /etc/cassandra/conf. cassandra.yaml is the main configuration file for Cassandra. Use the SimpleSnitch, PropertyFileSnitch or GossipingPropertyFileSnitch based on the type of cluster. Cassandra services can be started as a Linux service. cqlsh is the command line interface used to connect to Cassandra.
This concludes the lesson on Installation and Configuration of Cassandra. The next lesson will focus on Cassandra Data Model.
About the On-Demand Webinar
About the Webinar