Decision Tree in Machine Learning
In the decision trees article, we discussed how decision trees model decisions through a tree-like structure, where internal nodes represent feature tests, branches represent decision rules, and leaf nodes contain the final predictions. This basic understanding is crucial for building and interpreting decision trees, which are widely used for classification and regression tasks.
Now, let’s take this understanding a step further and dive into how decision trees are implemented in machine learning. We will explore how to train a decision tree model, make predictions, and evaluate its performance
Why Decision Tree Structure in ML?
A decision tree is a supervised learning algorithm used for both classification and regression tasks. It models decisions as a tree-like structure where internal nodes represent attribute tests, branches represent attribute values, and leaf nodes represent final decisions or predictions. Decision trees are versatile, interpretable, and widely used in machine learning for predictive modeling.
Now we have covered about the very basic of decision tree but its very important to understand the intuition behind the decision tree so lets move towards it.
Intuition behind the Decision Tree
Here’s an example to make it simple to understand the intuition of decision tree:
Imagine you’re deciding whether to buy an umbrella:
- Step 1 – Ask a Question (Root Node):
Is it raining?
If yes, you might decide to buy an umbrella. If no, you move to the next question. - Step 2 – More Questions (Internal Nodes):
If it’s not raining, you might ask:
Is it likely to rain later?
If yes, you buy an umbrella; if no, you don’t. - Step 3 – Decision (Leaf Node):
Based on your answers, you either buy or skip the umbrella
Approach in Decision Tree
Decision tree uses the tree representation to solve the problem in which each leaf node corresponds to a class label and attributes are represented on the internal node of the tree. We can represent any boolean function on discrete attributes using the decision tree.

Example: Predicting Whether a Person Likes Computer Games
Imagine you want to predict if a person enjoys computer games based on their age and gender. Here’s how the decision tree works:
- Start with the Root Question (Age):
- The first question is: “Is the person’s age less than 15?”
- If Yes, move to the left.
- If No, move to the right.
- Branch Based on Age:
- If the person is younger than 15, they are likely to enjoy computer games (+2 prediction score).
- If the person is 15 or older, ask the next question: “Is the person male?”
- Branch Based on Gender (For Age 15+):
- If the person is male, they are somewhat likely to enjoy computer games (+0.1 prediction score).
- If the person is not male, they are less likely to enjoy computer games (-1 prediction score)

Example: Predicting Whether a Person Likes Computer Games Using Two Decision Trees
Tree 1: Age and Gender
- The first tree asks two questions:
- “Is the person’s age less than 15?”
- If Yes, they get a score of +2.
- If No, proceed to the next question.
- “Is the person male?”
- If Yes, they get a score of +0.1.
- If No, they get a score of -1.
- “Is the person’s age less than 15?”
Tree 2: Computer Usage
- The second tree focuses on daily computer usage:
- “Does the person use a computer daily?”
- If Yes, they get a score of +0.9.
- If No, they get a score of -0.9.
- “Does the person use a computer daily?”
Combining Trees: Final Prediction
The final prediction score is the sum of scores from both trees
Information Gain and Gini Index in Decision Tree
Till now we have discovered the basic intituition and approach of how decision tree works, so lets just move to the attribute selection measure of decision tree.
We have two popular attribute selection measures used:
- 1. Information Gain
- 2. Gini Index
1. Information Gain:
Information Gain tells us how useful a question (or feature) is for splitting data into groups. It measures how much the uncertainty decreases after the split. A good question will create clearer groups, and the feature with the highest Information Gain is chosen to make the decision.
For example, if we split a dataset of people into “Young” and “Old” based on age, and all young people bought the product while all old people did not, the Information Gain would be high because the split perfectly separates the two groups with no uncertainty left
- Suppose S is a set of instances, A is an attribute, Sv is the subset of S , v represents an individual value that the attribute A can take and Values (A) is the set of all possible values of A, then
[Tex]Gain(S, A) = Entropy(S) – \sum_{v}^{A}\frac{\left | S_{v} \right |}{\left | S \right |}. Entropy(S_{v}) [/Tex]
Entropy: is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary collection of examples. The higher the entropy more the information content.
For example, if a dataset has an equal number of “Yes” and “No” outcomes (like 3 people who bought a product and 3 who didn’t), the entropy is high because it’s uncertain which outcome to predict. But if all the outcomes are the same (all “Yes” or all “No”), the entropy is 0, meaning there is no uncertainty left in predicting the outcome
Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and Values (A) is the set of all possible values of A, then
[Tex]Gain(S, A) = Entropy(S) – \sum_{v \epsilon Values(A)}\frac{\left | S_{v} \right |}{\left | S \right |}. Entropy(S_{v}) [/Tex]
Example:
For the set X = {a,a,a,b,b,b,b,b} Total instances: 8 Instances of b: 5 Instances of a: 3
[Tex]\begin{aligned}\text{Entropy } H(X) & =\left [ \left ( \frac{3}{8} \right )\log_{2}\frac{3}{8} + \left ( \frac{5}{8} \right )\log_{2}\frac{5}{8} \right ]\\& = -[0.375 (-1.415) + 0.625 (-0.678)] \\& = -(-0.53-0.424) \\& = 0.954\end{aligned}[/Tex]
Building Decision Tree using Information GainThe essentials:
- Start with all training instances associated with the root node
- Use info gain to choose which attribute to label each node with
- Note: No root-to-leaf path should contain the same discrete attribute twice
- Recursively construct each subtree on the subset of training instances that would be classified down that path in the tree.
- If all positive or all negative training instances remain, the label that node “yes” or “no” accordingly
- If no attributes remain, label with a majority vote of training instances left at that node
- If no instances remain, label with a majority vote of the parent’s training instances.
Example: Now, let us draw a Decision Tree for the following data using Information gain. Training set: 3 features and 2 classes
X | Y | Z | C |
---|---|---|---|
1 | 1 | 1 | I |
1 | 1 | 0 | I |
0 | 0 | 1 | II |
1 | 0 | 0 | II |
Here, we have 3 features and 2 output classes. To build a decision tree using Information gain. We will take each of the features and calculate the information for each feature.
Split on feature X
Split on feature Y

Split on feature Z
From the above images, we can see that the information gain is maximum when we make a split on feature Y. So, for the root node best-suited feature is feature Y. Now we can see that while splitting the dataset by feature Y, the child contains a pure subset of the target variable. So we don’t need to further split the dataset. The final tree for the above dataset would look like this:
2. Gini Index
- Gini Index is a metric to measure how often a randomly chosen element would be incorrectly identified. It means an attribute with a lower Gini index should be preferred.
- Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
For example, if we have a group of people where all bought the product (100% “Yes”), the Gini Index is 0, indicating perfect purity. But if the group has an equal mix of “Yes” and “No”, the Gini Index would be 0.5, showing higher impurity or uncertainty.
The Formula for Gini Index is given by :
[Tex]Gini = 1 – \sum_{i=1}^{n} p_i^2[/Tex]
Some additional features and characteristics of the Gini Index are:
- It is calculated by summing the squared probabilities of each outcome in a distribution and subtracting the result from 1.
- A lower Gini Index indicates a more homogeneous or pure distribution, while a higher Gini Index indicates a more heterogeneous or impure distribution.
- In decision trees, the Gini Index is used to evaluate the quality of a split by measuring the difference between the impurity of the parent node and the weighted impurity of the child nodes.
- Compared to other impurity measures like entropy, the Gini Index is faster to compute and more sensitive to changes in class probabilities.
- One disadvantage of the Gini Index is that it tends to favour splits that create equally sized child nodes, even if they are not optimal for classification accuracy.
- In practice, the choice between using the Gini Index or other impurity measures depends on the specific problem and dataset, and often requires experimentation and tuning.
Understanding Decision Tree with Real life Usecase:
Till now we have understand about the attriburtes and components of decision tree. Now lets jump to a real life usecase in which how decision tree works step by step.
Step 1. Start with the Whole Dataset
We begin with all the data, which is treated as the root node of the decision tree.
Step 2. Choose the Best Question (Attribute)
Pick the best question to divide the dataset. For example, ask: “What is the outlook?”
- Possible answers: Sunny, Cloudy, or Rainy.
Step 3. Split the Data into Subsets
Divide the dataset into groups based on the question:
- If Sunny, go to one subset.
- If Cloudy, go to another subset.
- If Rainy, go to the last subset.
Step 4. Split Further if Needed (Recursive Splitting)
For each subset, ask another question to refine the groups. For example:
- If the Sunny subset is mixed, ask: “Is the humidity high or normal?”
- High humidity → “Swimming”.
- Normal humidity → “Hiking”.
Step 5. Assign Final Decisions (Leaf Nodes)
When a subset contains only one activity, stop splitting and assign it a label:
- Cloudy → “Hiking”.
- Rainy → “Stay Inside”.
- Sunny + High Humidity → “Swimming”.
- Sunny + Normal Humidity → “Hiking”.
Step 6. Use the Tree for Predictions
To predict an activity, follow the branches of the tree:
- Example: If the outlook is Sunny and the humidity is High, follow the tree:
- Start at Outlook.
- Take the branch for Sunny.
- Then go to Humidity and take the branch for High Humidity.
- Result: “Swimming”.
This is how a decision tree works: by splitting data step-by-step based on the best questions and stopping when a clear decision is made!
Conclusion
Decision trees, a key tool in machine learning, model and predict outcomes based on input data through a tree-like structure. They offer interpretability, versatility, and simple visualization, making them valuable for both categorization and regression tasks. While decision trees have advantages like ease of understanding, they may face challenges such as overfitting. Understanding their terminologies and formation process is essential for effective application in diverse scenarios.
Frequently Asked Questions (FAQs)
1. What are the major issues in decision tree learning?
Major issues in decision tree learning include overfitting, sensitivity to small data changes, and limited generalization. Ensuring proper pruning, tuning, and handling imbalanced data can help mitigate these challenges for more robust decision tree models.
2. How does decision tree help in decision making?
Decision trees aid decision-making by representing complex choices in a hierarchical structure. Each node tests specific attributes, guiding decisions based on data values. Leaf nodes provide final outcomes, offering a clear and interpretable path for decision analysis in machine learning.
3. What is the maximum depth of a decision tree?
The maximum depth of a decision tree is a hyperparameter that determines the maximum number of levels or nodes from the root to any leaf. It controls the complexity of the tree and helps prevent overfitting.
4. What is the concept of decision tree?
A decision tree is a supervised learning algorithm that models decisions based on input features. It forms a tree-like structure where each internal node represents a decision based on an attribute, leading to leaf nodes representing outcomes.
5. What is entropy in decision tree?
In decision trees, entropy is a measure of impurity or disorder within a dataset. It quantifies the uncertainty associated with classifying instances, guiding the algorithm to make informative splits for effective decision-making.
6. What are the Hyperparameters of decision tree?
- Max Depth: Maximum depth of the tree.
- Min Samples Split: Minimum samples required to split an internal node.
- Min Samples Leaf: Minimum samples required in a leaf node.
- Criterion: The function used to measure the quality of a split