Decision Tree

개요

skleran.tree.plot_tree의 색상을 바꿔보도록 한다.
matplotlib 객체지향의 구조를 알면 어렵지(?) 않게 바꿀 수 있다.
간단하게 plot_tree 시각화를 구현해본다.
- 언제나 예제로 희생당하는 iris 데이터에게 애도를 표한다.
구글코랩에서 실행 시, 다음 코드를 실행하여 최신 라이브러리로 업그레이드 한다.

!pip install -U matplotlib

Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (3.2.2)
Collecting matplotlib
  Downloading matplotlib-3.5.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (11.2 MB)
[K     |████████████████████████████████| 11.2 MB 27.0 MB/s 
[?25hRequirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (1.4.0)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (1.21.5)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (7.1.2)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (3.0.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (0.11.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (21.3)
Collecting fonttools>=4.22.0
  Downloading fonttools-4.31.2-py3-none-any.whl (899 kB)
[K     |████████████████████████████████| 899 kB 50.5 MB/s 
[?25hRequirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib) (3.10.0.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7->matplotlib) (1.15.0)
Installing collected packages: fonttools, matplotlib
  Attempting uninstall: matplotlib
    Found existing installation: matplotlib 3.2.2
    Uninstalling matplotlib-3.2.2:
      Successfully uninstalled matplotlib-3.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.[0m
Successfully installed fonttools-4.31.2 matplotlib-3.5.1

%matplotlib inline 

import sklearn
print(sklearn.__version__)
import matplotlib
print(matplotlib.__version__)

# 필수 라이브러리 불러오기
from sklearn.datasets import load_iris
from sklearn import tree 
import matplotlib.pyplot as plt

# 데이터 불러오기
iris = load_iris()
print(iris.data.shape, iris.target.shape)
print("feature names", iris.feature_names)
print("class names", iris.target_names)

# 모형 학습 및 plot_tree 그래프 구현
dt = tree.DecisionTreeClassifier(random_state=0)
dt.fit(iris.data, iris.target)

fig, ax = plt.subplots(figsize=(10, 6))
ax = tree.plot_tree(dt, max_depth = 2, filled=True, feature_names = iris.feature_names, class_names = iris.target_names)
plt.show()

1.0.2
3.5.1
(150, 4) (150,)
feature names ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
class names ['setosa' 'versicolor' 'virginica']

png

개요

현대 머신러닝 이론의 백본(Backbone)이 되는 결정 트리에 대해 이론적으로 살짝 정리한다.
주요 수식은 Python Machine Learning Second Edition 교재를 주로 참고 하였다. (Page: 90 ~ 94)
- 교재 출처: https://www.amazon.com/Python-Machine-Learning-scikit-learn-TensorFlow/dp/1787125939

결정 트리의 예

결정 트리는 여러가지 연속된 질문을 학습하여 분류하는 것이 원칙이다.
다음의 간단한 예를 들어본다.

결정 트리는 크게 3가지로 구성이 되어 있다.
- 트리 내부 노드, 리프 노드, 그리고 가지로 구성이 되어 있다.
- 어떻게 질문을 하느냐에 따라서 분류가 결정된다.
결정 트리는 숫자에도 적용할 수 있다.
- 예) 키가 160cm보다 큰가요?

결정 트리의 주요 원리

결정 트리는 트리의 루트(Root)에서 시작하여, 정보 이득(Information Gain, IG)가 최대가 되는 특성으로 데이터를 나눔
반복 과정(분류할 수 있는 연속적인 질문)을 통해 계속적으로 분류함
- 전문 용어로는 Until the leaves are pure.
결정 트리에서 가장 중요한 것은 정보 이득 최대화임(Maximizing Information Gain)
- 가장 빠르게 확실하게 분류할 수 있는 질문(Question)
- 다음 예를 살펴본다. (출처: https://www.geeksforgeeks.org/decision-tree-introduction-example/)

There are a number of ways to validate second level models (meta-models). In this reading material you will find a description for the most popular ones. If not specified, we assume that the data does not have a time component. We also assume we already validated and fixed hyperparameters for the first level models (models).

Simple holdout scheme

Split train data into three parts: partA and partB and partC.
Fit N diverse models on partA, predict for partB, partC, test_data getting meta-features partB_meta, partC_meta and test_meta respectively.
Fit a metamodel to a partB_meta while validating its hyperparameters on partC_meta.
When the metamodel is validated, fit it to [partB_meta, partC_meta] and predict for test_meta.

Meta holdout scheme with OOF meta-features

Split train data into K folds. Iterate though each fold: retrain N diverse models on all folds except current fold, predict for the current fold. After this step for each object in train_data we will have N meta-features (also known as out-of-fold predictions, OOF). Let’s call them train_meta.
Fit models to whole train data and predict for test data. Let’s call these features test_meta.
Split train_meta into two parts: train_metaA and train_metaB. Fit a meta-model to train_metaA while validating its hyperparameters on train_metaB.
When the meta-model is validated, fit it to train_meta and predict for test_meta.

Meta KFold scheme with OOF meta-features

Obtain OOF predictions train_meta and test metafeatures test_meta using b.1 and b.2.
Use KFold scheme on train_meta to validate hyperparameters for meta-model. A common practice to fix seed for this KFold to be the same as seed for KFold used to get OOF predictions.
When the meta-model is validated, fit it to train_meta and predict for test_meta.

Holdout scheme with OOF meta-features

Split train data into two parts: partA and partB.
Split partA into K folds. Iterate though each fold: retrain N diverse models on all folds except current fold, predict for the current fold. After this step for each object in partA we will have N meta-features (also known as out-of-fold predictions, OOF). Let’s call them partA_meta.
Fit models to whole partA and predict for partB and test_data, getting partB_meta and test_meta respectively.
Fit a meta-model to a partA_meta, using partB_meta to validate its hyperparameters.
When the meta-model is validated basically do 2. and 3. without dividing train_data into parts and then train a meta-model. That is, first get out-of-fold predictions train_meta for the train_data using models. Then train models on train_data, predict for test_data, getting test_meta. Train meta-model on the train_meta and predict for test_meta.

KFold scheme with OOF meta-features

To validate the model we basically do d.1 – d.4 but we divide train data into parts partA and partB M times using KFold strategy with M folds.
When the meta-model is validated do d.5.

Validation in presence of time component

KFold scheme in time series

In time-series task we usually have a fixed period of time we are asked to predict. Like day, week, month or arbitrary period with duration of T.

개요

사이킷런(scikit-learn)은 파이썬 머신러닝 라이브러리이다.
파이썬에서 나오는 최신 알고리즘들도 이제는 사이킷런에 통합하는 형태로 취하고 있다.
구글 코랩은 기본적으로 사이킷런까지 설치가 완료되기에 별도의 설치가 필요없는 장점이 있다.
Note: 본 포스트는 머신러닝 자체를 처음 접하는 분들을 위한 것이기 때문에, 어느정도 경험이 있으신 분들은 필자의 다른 포스트를 읽어주시기를 바랍니다.

패키지 불러오기

패키지는 시간에 지남에 따라 계속 업그레이드가 되기 때문에 꼭 버전 체크를 하는 것을 권장한다.
- 필자가 글을 남겼을 때는 2020년 8월 16일에 작성했음을 기억하자.

import sklearn
print(sklearn.__version__)

0.22.2.post1

머신러닝 워크플로우

가장 간단한 데이터인 iris 데이터의 종 분류를 진행하도록 한다.
사실, 이 예제는 매우 간단하기 때문에, 전체적인 프로세스를 익히는 관점에서 확인하기를 바란다.

(1) 지도학습의 정의

지도학습(Supervised Learning)의 가장 큰 특징 중의 하나는 위와 같이 분류 결정값이 사전에 정의가 되어야 한다.
- 만��, 사전에 분류가 된 것이 없다면 어떻게 할 수 있을까? 당연한 말이지만, 머신러닝 수행하기 전 데이터 수집이 필수가 된다.
위 꽃을 분류하는 것 같이, 명확한 정답이 주어진 데이터를 먼저 학습 한 뒤, 미지의 정답을 예측하는 것을 지도학습이라 부른다.