#### Data Science Tutorial

Overview#### What is Data Science: A Comprehensive Guide for Beginners

Lesson - 1#### Data Science with R: Getting Started

Lesson - 2#### Logistic Regression in R: The Ultimate Tutorial with Examples

Lesson - 3#### Getting Started with Linear Regression in R

Lesson - 4#### Introduction to Random Forest in R

Lesson - 5#### Support Vector Machine (SVM) in R: Taking a Deep Dive

Lesson - 6#### What is Hierarchical Clustering and How Does It Work

Lesson - 7#### How to Build a Career in Data Science?

Lesson - 8#### How to Become a Data Scientist?

Lesson - 9#### Data Scientist vs Data Analyst vs Data Engineer: Job Role, Skills, and Salary

Lesson - 10#### Data Science Salary Report 2020: How Much Does a Data Scientist Earn?

Lesson - 11#### Top 50 Data Science Interview Questions and Answers

Lesson - 12

Machine learning has become the hottest technologies these days, and companies are using machine learning algorithms in various applications to solve business problems. They generally use it for classification, regression, and clustering-related problems. Some of the more popular algorithms include linear regression, logistic regression, decision trees, random forest, KNN, SVM, and so on.

In this article, we’ll cover the random forest algorithm in R from the ground up. The random forest algorithm is derived from the decision tree algorithm and consists of multiple decision trees—which is how it got its name. Tin Kam Ho created the first algorithm for random decision forests.

The following topics covered in this article include:

- What is a random forest?
- Random forest algorithm features
- How does the random forest algorithm work?
- Random forest applications
- Terms to know in a random forest classifier
- Random forest in R Use case: Predicting wine quality

Random forest is a popular supervised machine learning algorithm—used for both classification and regression problems. It is based on the concept of ensemble learning, which enables users to combine multiple classifiers to solve a complex problem and to also improve the performance of the model.

The random forest algorithm relies on multiple decision trees and accepts the results of the predictions from each tree. Based on the majority votes of predictions, it determines the final result.

The following is an example of what a random forest classifier in general looks like:

The classifier contains training datasets; each training dataset contains different values. Multiple decision tree models are created with the help of these datasets. Based on the output of these models, a vote is carried out to find the result with the highest frequency. A test set is evaluated based on these outputs to get the final predicted results.

- Provides higher accuracy than other algorithms
- Gives estimates of what variables are important in the classification
- Handles missing data efficiently, and the generated forests can be saved for future use with other data
- Computes proximities between pairs of cases that can be used in clustering, locating outliers, or to give interesting views of the data

Before understanding how a random forest algorithm works, first, let’s learn more about how a decision tree works with the following example:

Suppose you want to predict whether a person will buy a phone or not based on the phone’s features. For that, you can build a simple decision tree.

In this decision tree, the parent/root node and the internal nodes represent the phone’s features, while the leaf nodes are the outputs. The edges represent the connections between the nodes based on the values from the features. Based on the price, RAM, and internal storage, consumers can decide whether they want to purchase the phone. The problem with this decision tree is that you only have limited information, which may not always provide accurate results.

Using a random forest model will improve your results, as it provides diversity into building the model with several different features.

We have created three different decision trees to build a random forest model.

Now, suppose a new phone is launched with specific features, and you want to decide whether to buy that phone or not.

Let’s transfer this data to our random forest model and confirm the model’s output.

The first two trees predict the phone purchase, and the third decision tree suggests the disadvantages of making this purchase. Therefore, our model predicts that you should buy the newly launched phone.

- There should be some actual values in the feature variables of the dataset, which will give the classifier a better chance to predict accurate results, rather than provide an estimation. Missing values should be handled from training the model.
- The predictions from each tree must have very low correlations.

Accelerate your career in Data Science with the exclusive Data Scientist Master’s program in collaboration with IBM. Check out the course now,

- Randomly select “K” features from total “m” features where k < m
- Among the “K” features, calculate the node “d” using the best split point
- Split the node into daughter nodes using the best split method
- Repeat the previous steps until you reach the “l” number of nodes
- Build a forest by repeating all steps for “n” number times to create “n” number of trees

After the random forest trees and classifiers are created, predictions can be made using the following steps:

- Run the test data through the rules of each decision tree to predict the outcome and then store that predicted target outcome
- Calculate the votes for each of the predicted targets
- The most highly voted predicted target is the final prediction

Random forest classifiers have a plethora of applications in the market today. Let’s go ahead and look at a few of them:

- In the field of banking, it is used to predict fraudulent customers
- Random forests are used to analyze the symptoms of patients and diagnose diseases
- In the ecommerce field, recommendation lists help predict purchases based on customer activity
- Analyze stock market trends to predict profit or loss using the random forest algorithm

Let’s now look at a few of the terms we need to know in order to understand the random forest algorithm.

Before we start working with R, we need to understand a few different terminologies that are used in random forest algorithms, such as:

**1. Variance **- When there is a change in the training data algorithm, this is the measure of that change.

**2. Bagging **- This is a variance-reducing method that trains the model based on random subsamples of training data.

**3. Out-of-bag (oob) error estimate** - The random forest classifier is trained using bootstrap aggregation, where each new tree is fit from a bootstrap sample of the training observations. The out-of-bag (oob) error is the average error for each calculation using predictions from the trees that do not contain their respective bootstrap sample. This enables the random forest classifier to be adjusted and validated during training.

**4. Information gain** - Used to determine which feature/attribute gives us the maximum information about a class. It is based on the concept of entropy, which is the degree of uncertainty, impurity, or disorder. It aims to reduce the level of entropy, starting from the root node to the leaf nodes.

The formula for entropy is as shown below:

Where **p** represents the probability, and **E(S)** represents the entropy.

**5. Gini index**: The Gini index, or Gini impurity, measures the degree of probability of a particular variable being incorrectly classified when it is chosen randomly. The degree of the Gini index varies between zero and one, where zero denotes that all elements belong to a certain class or only one class exists, and one denotes that the elements are randomly distributed across various classes. A Gini index of 0.5 denotes equally distributed elements into some classes.

The Gini index formula is shown below:

Where pi is the probability of an object being classified to a particular class.

Let’s now look at how we can implement the random forest algorithm.

The following use case shows how this algorithm can be used to predict the quality of the wine based on certain features—such as chloride content, alcohol content, sugar content, pH value, etc.

To do this, we have randomly assigned the variables to our root node and the internal nodes.

Usually, with decision trees or random forest algorithms, the root nodes and the internal notes are calculated using the Gini index/Gini impurity values.

1. We have the first decision tree, which is going to take chlorides and alcohol content into consideration. If the chloride value is less than 0.08 and the alcohol content is greater than six, then the quality is high (in this case, it’s eight). Otherwise, the quality is five. This decision tree is shown below:

2. Our second decision tree will be split based on pH and sulphate content. If the sulphate value is less than 0.6 and the pH is lesser than 3.5, then the quality is six. Otherwise, it is five. The decision tree is shown below:

3. Our last decision tree will be split based on sugar and chloride content. If sugar is less than 2.5 and the chloride content is less than 0.08, then we get the quality of the wine to be five. Otherwise, it’s four. The decision tree is shown below:

Two out three decision trees above indicate the quality of our wine to be five—the forest predicts the same.

In this demo, we will run an R program to predict the wine’s quality. The image shown below is the dataset that holds all attribute values required to predict the wine’s quality.

So, let’s get coding!

wine <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"), header = TRUE, sep = ";") # This command is used to load the dataset head(wine) # Display the head and dimensions of wine dataset dim(wine) barplot(table(wine$quality)) # Barplot to see the quality of wines. The output looks like below |

# Now, we have to convert the quality values into factors wine$taste <- ifelse(wine$quality < 5, "bad", "good") wine$taste[wine$quality == 5] <- "normal" wine$taste[wine$quality == 6] <- "normal" wine$taste <- as.factor(wine$taste) str(wine$taste) barplot(table(wine$taste)) # Barplot to view the taste of wines. The output is shown below. table(wine$taste) |

# Next, we need to split the data into training and testing. 80% for training, 20% for testing. set.seed(123) samp <- sample(nrow(wine), 0.8 * nrow(wine)) train <- wine[samp, ] test <- wine[-samp, ] |

# Moving onto the Data visualization library(ggplot2) ggplot(wine,aes(fixed.acidity,volatile.acidity))+ geom_point(aes(color=taste))# This command is used to display a scatter plot. The output looks like below |

ggplot(wine,aes(alcohol)) + geom_histogram(aes(fill=taste),color='black',bins=50) # This command is used to display a stacked bar chart. The output looks like below |

dim(train) dim(test) # Checks the dimensions of training and testing dataset install.packages('randomforest') library(randomforest) # Install the random forest library # Now that we have installed the randomforest library, let’s build the random forest model model <- randomforest(taste ~ . - quality, data = train, ntree = 1000, mtry = 5) model model$confusion # The next step is to validate our model using the test data prediction <- predict(model, newdata = test) table(prediction, test$taste) prediction |

# Now, let’s display the predicted vs. the actual values results<-cbind(prediction,test$taste) results colnames(results)<-c('pred','real') results<-as.data.frame(results) View(results) # Finally, let’s calculate the accuracy of the model sum(prediction==test$taste) / nrow(test) # The output is as shown below |

You can see that this model’s accuracy is 90 percent, which is great. Now we have automated the process of predicting wine quality. This brings us to the end of this demo on random forest.

Learn data exploration, data visualization, predictive analysis, R packages, data structures in R with the Data Science with R Certification. Check out the course now!

After reading this article, you have likely learned more about the random forest, including how it works, different random forest terms, and more about its various applications that are used in the real world. We also included a demo, where we built a model using a random forest to predict wine quality. We worked on RStudio for this demo, where we went over different commands, packages, and data visualization methods in R. To learn more about the random forest in R, watch the following video:

If you’re an aspiring data scientist and want to advance in your career, check out Simplilearn’s Data Science with R certification training today. This comprehensive course will teach you everything you need to know to boost your career as a machine learning engineer.

Shruti is an engineer and a technophile. She works on several trending technologies. Her hobbies include reading, dancing and learning new languages. Currently, she is learning the Japanese language.

Data Scientist

8907 Learners

Lifetime Access*

Data Science Certification Training - R Programming

15209 Learners

Lifetime Access*

Data Science with Python

11817 Learners

Lifetime Access*

*Lifetime access to high-quality, self-paced e-learning content.

Explore Category- Video Tutorial
How to Become a Data Scientist?

- Ebook
Data Science Career Guide: A comprehensive playbook to becoming a Data Scientist

- Article
A Day in the Life of a Data Scientist

- Video Tutorial
How to Build a Career in Data Science?

- Video Tutorial
Data Science with R: Getting Started

- Ebook
Data Science Interview Guide

- Disclaimer
- PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.