Machine learning algorithms are used in almost every sector of business to solve critical problems and build intelligent systems and processes. Supervised machine learning algorithms, specifically, are used for solving classification and regression problems. In this article, we’ll be covering one of the most popularly used supervised learning algorithms: decision trees in Python.
What is a Decision Tree?
A decision tree is a tree-based supervised learning method used to predict the output of a target variable. Supervised learning uses labeled data (data with known output variables) to make predictions with the help of regression and classification algorithms. Supervised learning algorithms act as a supervisor for training a model with a defined output variable. It learns from simple decision rules using the various data features. Decision trees in Python can be used to solve both classification and regression problems—they are frequently used in determining odds.
The following is an example of a simple decision tree used to classify different animals based on their features. We will be using the color and height of the animals as input features.
Fig: Decision tree to classify animals
Advantages of Using Decision Trees
- Decision trees are simple to understand, interpret, and visualize
- They can effectively handle both numerical and categorical data
- They can determine the worst, best, and expected values for several scenarios
- Decision trees require little data preparation and data normalization
- They perform well, even if the actual model violates the assumptions
Decision Tree Applications
- A decision tree is used to determine whether an applicant is likely to default on a loan.
- It can be used to determine the odds of an individual developing a specific disease.
- It can help ecommerce companies in predicting whether a consumer is likely to purchase a specific product.
- Decision trees can also be used to find customer churn rates.
Important Terms Used in Decision Trees
1. Entropy: Entropy is the measure of uncertainty or randomness in a data set. Entropy handles how a decision tree splits the data.
It is calculated using the following formula:
2. Information Gain: The information gain measures the decrease in entropy after the data set is split.
It is calculated as follows:
IG( Y, X) = Entropy (Y) - Entropy ( Y | X)
3. Gini Index: The Gini Index is used to determine the correct variable for splitting nodes. It measures how often a randomly chosen variable would be incorrectly identified.
4. Root Node: The root node is always the top node of a decision tree. It represents the entire population or data sample, and it can be further divided into different sets.
5. Decision Node: Decision nodes are subnodes that can be split into different subnodes; they contain at least two branches.
6. Leaf Node: A leaf node in a decision tree carries the final results. These nodes, which are also known as terminal nodes, cannot be split any further.
How Does a Decision Tree Algorithm Work?
Suppose there are different animals, and you want to identify each animal and classify them based on their features. We can easily accomplish this by using a decision tree.
The following is a cluttered sample data set with high entropy:
We have to determine which features split the data so that the information gain is the highest. We can do that by splitting the data using each feature and checking the information gain that we obtain from them. The feature that returns the highest gain will be used for the first split.
For our demo, we will take the following features into consideration:
We’ll use the information gain method to determine which variable yields the maximum gain, which can also be used as the root node.
Suppose Color == Yellow results in the maximum information gain, so that is what we will use for our first split at the root node.
Fig: Using Color == Yellow for our first split of decision tree
The entropy after splitting should decrease considerably. However, we still need to split the child nodes at both the branches to attain an entropy value equal to zero.
We will split both the nodes using ‘height’ variable and height > 10 and height < 10 as our conditions.
Fig: Slitting the decision tree with the height variable
The decision tree above can now predict all the classes of animals present in the data set.
Now, it’s time to build a prediction model using the decision tree in Python.
Building a Decision Tree in Python
We’ll now predict if a consumer is likely to repay a loan using the decision tree algorithm in Python. The data set contains a wide range of information for making this prediction, including the initial payment amount, last payment amount, credit score, house number, and whether the individual was able to repay the loan.
1. First, we’ll import the libraries required to build a decision tree in Python.
2. Load the data set using the read_csv() function in pandas.
3. Display the top five rows from the data set using the head() function.
4. Separate the independent and dependent variables using the slicing method.
5. Split the data into training and testing sets.
6. Train the model using the decision tree classifier.
7. Predict the test data set values using the model above.
8. Calculate the accuracy of the model using the accuracy score function.
Our prediction model shows that there is an excellent accuracy score of 93.67 percent.
Looking forward to make a move to programming? Take up the Python Training Course and begin your career. Enroll now!
In this article, we covered one of the most widely used supervised learning algorithms—decision trees in Python. We not only introduced the basics of this algorithm, but we also went over its many benefits, explored more about how it works, and went through a demo that used the decision tree algorithm.
Do you have any questions about this article and what we covered? Please leave it in the comment section below, and someone from our team will get back to you as soon as possible.
Want to Learn Even More?
If you’re interested in developing more expertise in machine learning to boost your career in data science and AI, check out Simplilearn’s Machine Learning Certification Course. You will learn how to develop and train machine learning algorithms using Python and more through our unique Blended Learning approach.