Data Science with R: Getting Started

The data explosion in recent years won't slow down any time soon. In fact, according to a report from IDC, the volume of data being generated is going to reach 175 zettabytes in 2025. Dealing with this massive amount of data is a challenge for all companies in all sectors. That's why companies worldwide are looking for professionals who can make sense of their data and derive meaningful and actionable insights.

Enter data science.

In this blog, you will learn how you can conduct data science with R. 

We will discuss:

1. Introduction to R

  • Why do we need R?
  • CRAN (Comprehensive R Archive Network)
  • Installing R

2. Simple linear regression using R

  • Line of best fit
  • Correlation Analysis in R

3. Use case Demo to predict the class of flower

Introduction to R

R programming language is an implementation of the S programming language. R is an open-source software that is free to download and is available under the GNU General Public License. Ross Ihaka and Robert Gentleman initially designed R at the University of Auckland. It has an active community and is compatible across all platforms, such as Linux, Windows, and Mac.

Data Science Career Guide

A Comprehensive Guide To Becoming A Data ScientistDOWNLOAD GUIDE
Data Science Career Guide

Features of R

R offers various statistical and graphical techniques. It has an extensive library of packages that makes it easy to implement machine learning algorithms. It can be easily integrated with popular software, like Tableau, and Microsoft SQL Server.

R is not just a programming language; it has a worldwide repository system called CRAN (Comprehensive R Archive Network). You can access it at https://cran.r-project.org/.

It also has a collection of all critical updates, R sources, R binaries, R packages, and other documentation. CRAN hosts around 10,000 packages of R.

Installation of R

R can be easily downloaded and installed from the CRAN website.

installation-R

You can select a suitable operating system and click on it to download. Here, "Download R for Windows" has been selected.

 r-for-windows

Follow the default options to finish the installation. 

You can also install RStudio, which is an integrated development environment for R. It is available in two formats: RStudio Desktop is a regular desktop application. At the same time, RStudio Server runs on a remote server and enables RStudio access using a web browser.

download-r.

The following is what the interface of RStudio looks like:

/r-script

Here is a small script that is used to perform some basic operations and plot a graph.

small-script

Before you start programming in R, you should install packages and their dependencies. Packages provide pre-assembled collections of functions and objects. Each package is hosted on the CRAN repository. Not all packages are loaded by default, but they can be installed on demand. 

To install a new package in RStudio, go to Tools -> Install Packages

install-package

Then, you can search for the package you want to install and select the location where you want to install the package.

tools-instal

Now, let's discuss the different data structures available in the R programming language.

1. Vectors: It is the most basic R object, which has atomic values.

2. Matrices: These are R objects in which the elements are arranged in a two-dimensional layout. They also contain elements of the same types.

3. Arrays: They can store data in more than two dimensions. Suppose we create an array of dimensions (two, three, four) then it creates four rectangular matrices, each with two rows and three columns.

4. Data Frames: A data frame is a table in which each column contains values of one variable, and each row contains one set of values from each column.

5. Lists: A list contains elements of different types (numbers, strings, vectors, etc.) It can also include a matrix or a function as its elements. The list is created using the list() function.

Importing files in R

R enables you to import data from different sources. 

1. Table: A table can be loaded in R using the read.table function.

table

2. CSV: A .csv file is imported using the read.csv function.

csv

Exporting Files in R

You can also export different files to another location in R.

1. To export a table: Write.table(file_name, “c:/file_name.txt”, sep=“\t”)

2. To export an Excel file: Write.xls(file_name, "c:/file_name.txt", sep= "\t")

3. To export a CSV file: Write.csv(file_name, “c:/file_name.csv”)

Looking forward to becoming a Data Scientist? Check out the Data Scientist Masters Program and get certified today.

Data Visualization in R

R has powerful graphics packages that help in data visualization. These graphics can be viewed on the screen, and saved in various formats, including .pdf, .png, .jpg, .wmf and .ps. It can be customized according to various graphic needs and enables you to copy and paste in Word or PowerPoint files.

You can create a bar chart, pie chart, histogram, kernel density plot, line chart, boxplot, heat map, and word cloud.

Let's look at boxplots in R.

Boxplots are also known as whisker diagrams. They will display the distribution of data based on the following parameters:

  • Minimum
  • First quartile
  • Median
  • Third quartile
  • Maximum

To create a boxplot, you need to provide a boxplot(data).

date

The line at the bottom of the box is the minimum value, and the line of the top of the box is the maximum value. The dark line inside the box is the median value, and the points lying outside the box are outliers.

Now that you know more about data visualization in R, let's jump into learning the different phases of the data science life cycle.

Data Science Life Cycle

A typical data science life cycle consists of the following stages:

  1. Data acquisition: The primary step in the life cycle of any data science project is to acquire the right data from multiple sources. Data acquisition involves acquiring data from different internal and external sources that can help answer business questions. Data can be extracted from various sources, such as logs from web servers, social media data, online repositories, or databases.
  2. Data preparation: Often referred to as data cleaning or data wrangling, it is a critical step in the life cycle. The data collected from different sources is frequently messy and is typically missing various values. Therefore, it is crucial to clean this data to derive value from it.
  3. Data exploration: After cleaning the data, you can perform hypothesis testing and visualize the data to understand the data better. Data exploration is sometimes called data mining. It is used to identify patterns in your data set and find important potential features with statistical analysis.
  4. Predictive modeling: To train your machine to make predictions, you need to build predictive models. For this, you have to choose the right algorithm on which the machine is to be trained. Historical data is then split into training and validation sets. The model is trained using the training set. The trained model is validated using the validation dataset, and the model is then evaluated for accuracy and efficiency.
  5. Model interpretation and deployment: After a rigorous evaluation of the model, you can deploy into a production-like environment for final user acceptance. You'll want to present your model to a non-technical person and convey the actionable insights derived from the data.

Now that we have looked at the different data science life cycle stages let's look at some of the data science algorithms that can help you solve complex business problems.  

Linear Regression with R

Linear regression is a statistical technique that is used to find relationships between a dependent variable and one or more independent variables. It is used to predict the outcome of a continuous (numeric) variable. It is widely used for stock market analysis, weather forecasting, and sales predictions.

Linear regression is applied in two steps:

1. Estimate the relationship between two variables.

Examples: Does body weight influence the blood cholesterol level? Will the size of the house affect house prices?

2. Predict the value of the dependent variable based on other independent variables.

The simplest form of a simple linear regression equation with one dependent and an independent variable is shown using the following formula:

y

Where y is the dependent variable, x is the independent variable, m is the slope, and c is the intercept/coefficient of the line.

The slope m is represented as:

 slope-m

Below are the two types of linear regression:

simple

Let's understand the intuition behind the regression line by using an example:

The table on the left represents the data; the data points are plotted on the graph on the right. 

variables

The next step is to calculate the mean of X and Y and plot the values on the graph.

Here, the mean of X is three, and the mean of Y is five.

The regression line should ideally pass through the mean of X and Y.

regression-line.

Now, we need to draw the equation of the regression line. For that, we need to calculate the following parameters.

Based on the calculated values, the values of slope (m) and coefficient (c) are solved.

coefficient

Let's calculate the predicted values of Y for corresponding values of X using the linear equation where m=1.3 and c=1.1.

y-pred

The best fit line should have the least sum of squares of these errors, also known as e square.

e-square

The sum of the squared errors for this regression line is 3.9. We check this error for each line and conclude the best fit line having the least e square value.

Learn data structures in R, how to import and export data in R, cluster analysis and forecasting with the Data Science with R Certification. Check out the course now.

Linear Regression Analysis in R 

In this analysis, we'll use a standard built-in cars dataset to find the correlation between variables. 

head(cars) - Displays the top six rows of the data frame

head-cars

str(cars) - Displays the structure of the data frame (50 observations and two variables)

plot-car

plot(cars) - Provides a scatter plot of speed vs distance

plot-cars2

plot(cars$speed, cars$dist)

Correlation analysis studies the strength of the relationship between two continuous variables. It involves computing the correlation coefficient between the two variables.

car-speed

If one variable consistently increases with the increasing value of the other, then they have a strong positive correlation (value close to +1).

Let's build a linear regression model on the entire dataset to build the coefficients:

reg-model.

dist

We can predict the dependent variables if the model is "statistically significant". 

p-value

The value of p should be less than 0.05 for the model to be statistically significant.

Split the dataset into training and testing:

Fit the model on training data and predict 'dist' on test data

1mod

Review model diagnostic measures: summary(lmMod)

summary-1mod

A simple correlation between the actual and predicted values can be used to measure accuracy:

accuracy

You can compute all the error metrics in one go using the regr.eval() function in the DMwR package. Use install.packages('DMwR') for this if you are using it for the first time.

dmwr

Now that we have seen how the linear regression algorithm works in R, let's now learn about decision trees.

Data Science Certification - R Programming

In Collaboration with IBMExplore Course
Data Science Certification - R Programming

Decision Trees

A decision tree is a tree-shaped algorithm used to determine a course of action. Each branch of the tree represents a possible decision, occurrence, or reaction.

leaf-node

Root node: Represents the entire population or sample, and this further gets divided into two or more homogeneous sets.

Splitting: The process of dividing a node into two or more sub-nodes.

Decision node: When a sub-node splits into further sub-nodes, then it is called a decision node.

Leaf/terminal Node: Nodes with no children (no further split) are called a leaf or terminal nodes.

Pruning: When we reduce the size of decision trees through node reduction (opposite of splitting), the process is called pruning.

Branch/sub-tree: A subsection of the decision tree is called a branch or sub-tree.

Parent and child node: A node, which is divided into sub-nodes, is called a parent node of sub-nodes, whereas sub-nodes are the child of parent nodes.

There are two more important concepts that you should know before implementing a decision tree algorithm: entropy and information gain.

very-random

Entropy is the measure of randomness or impurity in the dataset.

Information gain is the measure of the decrease in entropy after the dataset is split. It is also known as entropy reduction.

entropy.

Decision Tree Algorithm in R

We'll predict the class of flowers based on the petal length and petal width using R.

setosa

Install the necessary packages:

rpart.

Load the dataset and display the structure of the dataset:

Use set.seed() to determine the starting point used in the generation of a sequence of random numbers:

set-seed

Build your model using 'rpart' function:

model-rpart

versicolor

Let's validate the data using the remaining 50 rows we kept as testing data:

model-pred

Install the packages required to use "ConfusionMatrix" to evaluate the model:

We have an accuracy of 92 percent.

Conclusion

After reading this article, you learned more about how data science works, and why it is useful. You looked at how to install R and RStudio and the different features of R. You also got an idea about the different data structures in R. You learned about linear regression and how it works in R, and finally, you saw how to classify flowers using the decision tree algorithm. To learn more about data science with R, watch the following video: Data Science with R.

Want to Learn More About Data Science with R Programming?

Data scientists are in high demand, and R is an essential part of it. If you're ready to take your career to the next level, check out Simplilearn's R for Data Science Certification Training, co-developed with IBM. The course provides 64 hours of Blended Learning, lifetime access to self-paced learning, a dedicated mentoring session with an industry expert, and ten real-world industry projects. What are you waiting for?

About the Author

Avijeet BiswalAvijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.