Pipeline

PyCaret, Skorch Using Pipeline

개요

  • Scikit-Learn의 Pipeline은 강력하다.
  • PyCaret, Skorch에도 사용이 가능하다.
  • Google Colab에서 시도해보자.

필수 라이브러리 설치

  • pycaret을 설치 한 후에는 반드시 런타임 재시작을 클릭한다.
!pip install pycaret
Collecting pycaret
  Downloading pycaret-2.3.5-py3-none-any.whl (288 kB)
.
.
Successfully installed Boruta-0.3 Mako-1.1.6 PyYAML-6.0 alembic-1.4.1 databricks-cli-0.16.2 docker-5.0.3 funcy-1.17 gitdb-4.0.9 gitpython-3.1.24 gunicorn-20.1.0 htmlmin-0.1.12 imagehash-4.2.1 imbalanced-learn-0.7.0 joblib-1.0.1 kmodes-0.11.1 lightgbm-3.3.1 mlflow-1.22.0 mlxtend-0.19.0 multimethod-1.6 pandas-profiling-3.1.0 phik-0.12.0 prometheus-flask-exporter-0.18.7 pyLDAvis-3.2.2 pycaret-2.3.5 pydantic-1.8.2 pynndescent-0.5.5 pyod-0.9.6 python-editor-1.0.4 querystring-parser-1.2.4 requests-2.26.0 scikit-learn-0.23.2 scikit-plot-0.3.7 scipy-1.5.4 smmap-5.0.0 tangled-up-in-unicode-0.1.0 umap-learn-0.5.2 visions-0.7.4 websocket-client-1.2.3
!pip install -U skorch
Requirement already satisfied: skorch in /usr/local/lib/python3.7/dist-packages (0.11.0)
Requirement already satisfied: tabulate>=0.7.7 in /usr/local/lib/python3.7/dist-packages (from skorch) (0.8.9)
Requirement already satisfied: scikit-learn>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from skorch) (0.23.2)
Requirement already satisfied: tqdm>=4.14.0 in /usr/local/lib/python3.7/dist-packages (from skorch) (4.62.3)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from skorch) (1.19.5)
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from skorch) (1.5.4)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.19.1->skorch) (1.0.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.19.1->skorch) (3.0.0)
from pycaret.datasets import get_data
data = get_data("electrical_grid")
tau1tau2tau3tau4p1p2p3p4g1g2g3g4stabf
02.9590603.0798858.3810259.7807543.763085-0.782604-1.257395-1.7230860.6504560.8595780.8874450.958034unstable
19.3040974.9025243.0475411.3693575.067812-1.940058-1.872742-1.2550120.4134410.8624140.5621390.781760stable
28.9717078.8484283.0464791.2145183.405158-1.207456-1.277210-0.9204920.1630410.7666890.8394440.109853unstable
30.7164157.6696004.4866412.3405633.963791-1.027473-1.938944-0.9973740.4462090.9767440.9293810.362718unstable
43.1341127.6087724.9437599.8575733.525811-1.125531-1.845975-0.5543050.7971100.4554500.6569470.820923unstable

PyTorchModel

  • sktorch 라이브러리는 PyTorch 모델과 함께 작동한다.
  • MLP 모델을 작성하는 클래스를 설계한다.
import torch.nn as nn

class Net(nn.Module): 
  def __init__(self, num_inputs=12, num_units_d1 = 200, num_units_d2 = 100):
    super(Net, self).__init__() 

    self.dense0 = nn.Linear(num_inputs, num_units_d1)
    self.nonlin = nn.ReLU()
    self.dropout = nn.Dropout(0.5)
    self.dense1 = nn.Linear(num_units_d1, num_units_d2)
    self.output = nn.Linear(num_units_d2, 2)
    self.softmax = nn.Softmax(dim=-1)

  def forward(self, X, **kwargs):
    X = self.nonlin(self.dense0(X))
    X = self.dropout(X)
    X = self.nonlin(self.dense1(X))
    X = self.softmax(self.output(X))
    return X

Skorch Classifier

  • NeuralNetClassifier 클래스를 PyTorch 클래스와 연동한다.
  • Optimizer 기본값인 SGD를 사용한다. 만약 다른 Optimizer로 변경을 원하면 다음 링크에서 확인한다.
  • Sktorch 5 폴드 교차검증을 수행한다.
    • 학습 데이터는 80%, 나머지 20%는 검증 데이터로 활용한다.
from skorch import NeuralNetClassifier 

net = NeuralNetClassifier(
    module = Net, 
    max_epochs = 30, 
    lr = 0.1, 
    batch_size = 32, 
    train_split = None
)

PyCaret과 신경망 학습 방법

  • SKORCH NN model을 초기화 했다면, 이번에는 PyCaret과 함께 모델을 학습할 수 있다.
  • PyCaret은 기본적으로 Pandas DataFrame을 메인 객체로 사용하다.
  • 그런데, sktorch model을 사용하기 위해서는 pipeline을 구성할 때는 DataFrameTransformer() 함수를 사용해야 한다.
from skorch.helper import DataFrameTransformer
import numpy as np
from sklearn.pipeline import Pipeline

nn_pipe = Pipeline(
    [("transform", DataFrameTransformer()), 
     ("net", net), ]
)

PyCaret Setup

  • Skorch API 대신 PyCaret 모델을 사용해본다.
  • log_experimentTrue를 사용하게 되면 MLFlow를 사용할 수 있다.
  • silentTrue인 경우 중간에 발생하는 press enter to continue 입력 단계를 피할 수 있다.
from pycaret.classification import *
target = "stabf"
clf1 = setup(data = data, 
            target = target,
            train_size = 0.8,
            fold = 5,
            session_id = 123,
            log_experiment = True, 
            experiment_name = 'electrical_grid_1', 
            silent = True)
DescriptionValue
0session_id123
1Targetstabf
2Target TypeBinary
3Label Encodedstable: 0, unstable: 1
4Original Data(10000, 13)
5Missing ValuesFalse
6Numeric Features12
7Categorical Features0
8Ordinal FeaturesFalse
9High Cardinality FeaturesFalse
10High Cardinality MethodNone
11Transformed Train Set(8000, 12)
12Transformed Test Set(2000, 12)
13Shuffle Train-TestTrue
14Stratify Train-TestFalse
15Fold GeneratorStratifiedKFold
16Fold Number5
17CPU Jobs-1
18Use GPUFalse
19Log ExperimentTrue
20Experiment Nameelectrical_grid_1
21USI9626
22Imputation Typesimple
23Iterative Imputation IterationNone
24Numeric Imputermean
25Iterative Imputation Numeric ModelNone
26Categorical Imputerconstant
27Iterative Imputation Categorical ModelNone
28Unknown Categoricals Handlingleast_frequent
29NormalizeFalse
30Normalize MethodNone
31TransformationFalse
32Transformation MethodNone
33PCAFalse
34PCA MethodNone
35PCA ComponentsNone
36Ignore Low VarianceFalse
37Combine Rare LevelsFalse
38Rare Level ThresholdNone
39Numeric BinningFalse
40Remove OutliersFalse
41Outliers ThresholdNone
42Remove MulticollinearityFalse
43Multicollinearity ThresholdNone
44Remove Perfect CollinearityTrue
45ClusteringFalse
46Clustering IterationNone
47Polynomial FeaturesFalse
48Polynomial DegreeNone
49Trignometry FeaturesFalse
50Polynomial ThresholdNone
51Group FeaturesFalse
52Feature SelectionFalse
53Feature Selection Methodclassic
54Features Selection ThresholdNone
55Feature InteractionFalse
56Feature RatioFalse
57Interaction ThresholdNone
58Fix ImbalanceFalse
59Fix Imbalance MethodSMOTE

PyCaret Train Model

  • Random Forest 모델을 사용해본다.
model = create_model("rf")
AccuracyAUCRecallPrec.F1KappaMCC
00.92440.97960.96670.91890.94220.83310.8353
10.92750.97930.95490.93300.94380.84170.8422
20.92250.98100.96080.92110.94060.82940.8309
30.90810.97380.94610.91300.92930.79830.7993
40.90440.97380.94710.90710.92670.78940.7909
Mean0.91740.97750.95510.91860.93650.81840.8197
SD0.00930.00310.00790.00870.00710.02060.0206

PyCaret Train Skorch Model

  • 이번에는 Skorch Model을 Pycaret 함수에 넣어서 확인해본다.
skorch_model = create_model(nn_pipe)
AccuracyAUCRecallPrec.F1KappaMCC
00.88310.96440.95000.87690.91200.73890.7441
10.85500.94370.95690.83850.89380.66850.6831
20.83690.92800.96380.81460.88290.62020.6446
30.85060.93470.86680.89570.88100.68050.6812
40.80810.94110.97650.77890.86660.54000.5859
Mean0.84680.94240.94280.84090.88730.64960.6678
SD0.02450.01230.03900.04210.01510.06660.0519

Comparing Models

  • 두 모델 중 어떤 모델이 더 좋은지 확인해본다.
best_model = compare_models(include=[skorch_model, model], sort = "AUC")
ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
1Random Forest Classifier0.91740.97750.95510.91860.93650.81840.81972.114
0NeuralNetClassifier0.84260.94000.95470.82810.88610.63550.656511.878

Hyperparameter Grid

  • Hyperparameter 튜닝을 적용하도록 한다.
  • 모형 튜닝을 위한 parameter 값은 다음 명령어를 통해서 확인할 수 있다.
skorch_model.get_params().keys()
dict_keys(['memory', 'steps', 'verbose', 'transform', 'net', 'transform__float_dtype', 'transform__int_dtype', 'transform__treat_int_as_categorical', 'net__module', 'net__criterion', 'net__optimizer', 'net__lr', 'net__max_epochs', 'net__batch_size', 'net__iterator_train', 'net__iterator_valid', 'net__dataset', 'net__train_split', 'net__callbacks', 'net__predict_nonlinearity', 'net__warm_start', 'net__verbose', 'net__device', 'net___kwargs', 'net__classes', 'net__callbacks__epoch_timer', 'net__callbacks__train_loss', 'net__callbacks__train_loss__name', 'net__callbacks__train_loss__lower_is_better', 'net__callbacks__train_loss__on_train', 'net__callbacks__valid_loss', 'net__callbacks__valid_loss__name', 'net__callbacks__valid_loss__lower_is_better', 'net__callbacks__valid_loss__on_train', 'net__callbacks__valid_acc', 'net__callbacks__valid_acc__scoring', 'net__callbacks__valid_acc__lower_is_better', 'net__callbacks__valid_acc__on_train', 'net__callbacks__valid_acc__name', 'net__callbacks__valid_acc__target_extractor', 'net__callbacks__valid_acc__use_caching', 'net__callbacks__print_log', 'net__callbacks__print_log__keys_ignored', 'net__callbacks__print_log__sink', 'net__callbacks__print_log__tablefmt', 'net__callbacks__print_log__floatfmt', 'net__callbacks__print_log__stralign'])
net.get_params().keys()
dict_keys(['module', 'criterion', 'optimizer', 'lr', 'max_epochs', 'batch_size', 'iterator_train', 'iterator_valid', 'dataset', 'train_split', 'callbacks', 'predict_nonlinearity', 'warm_start', 'verbose', 'device', '_kwargs', 'classes', 'callbacks__epoch_timer', 'callbacks__train_loss', 'callbacks__train_loss__name', 'callbacks__train_loss__lower_is_better', 'callbacks__train_loss__on_train', 'callbacks__valid_loss', 'callbacks__valid_loss__name', 'callbacks__valid_loss__lower_is_better', 'callbacks__valid_loss__on_train', 'callbacks__valid_acc', 'callbacks__valid_acc__scoring', 'callbacks__valid_acc__lower_is_better', 'callbacks__valid_acc__on_train', 'callbacks__valid_acc__name', 'callbacks__valid_acc__target_extractor', 'callbacks__valid_acc__use_caching', 'callbacks__print_log', 'callbacks__print_log__keys_ignored', 'callbacks__print_log__sink', 'callbacks__print_log__tablefmt', 'callbacks__print_log__floatfmt', 'callbacks__print_log__stralign'])
import torch.optim as optim

custom_grid = {
	'net__max_epochs':[20, 30],
	'net__lr': [0.01, 0.05, 0.1],
	'net__module__num_units_d1': [50, 100, 150],
	'net__module__num_units_d2': [50, 100, 150],
	'net__optimizer': [optim.Adam, optim.SGD, optim.RMSprop]
	}
  • 이번에는 hyperparameter 모델을 적용하여 모델을 빠르게 만들어 본다.
tuned_skorch_model = tune_model(skorch_model, custom_grid = custom_grid)
AccuracyAUCRecallPrec.F1KappaMCC
00.87620.96670.96860.85620.90890.71820.7316
10.86750.94770.87840.91060.89420.71710.7179
20.83750.94520.78350.95350.86020.67060.6891
30.85750.95220.82080.94900.88030.70660.7180
40.79750.93150.97260.77040.85970.51270.5602
Mean0.84720.94870.88480.88790.88070.66500.6834
SD0.02800.01140.07630.06840.01920.07810.0631

References

  1. https://pycaret.org/
  2. https://www.analyticsvidhya.com/blog/2020/05/pycaret-machine-learning-model-seconds/
  3. https://github.com/skorch-dev/skorch
  4. https://towardsdatascience.com/skorch-pytorch-models-trained-with-a-scikit-learn-wrapper-62b9a154623e
  5. https://towardsdatascience.com/pycaret-skorch-build-pytorch-neural-networks-using-minimal-code-57079e197f33