기록_2

기록_2

2023. 2. 17. 14:42ㆍLG AI(AImiers) 과정

정형데이터 대회는 AutoML에 때려박고(?) 시작하자!

이번 코드에서는 AutoML 패키지인 PyCaret을 활용하여 정형데이터 대회에 참여하는 과정을 알아보겠습니다. Feature engineering, model tuning 없이 주어진 데이터를 그대로 활용하여 default 모델을 훈련하고 예측 했으므로, 추가 작업을 통해 높은 성능을 보여줄 수 있을 것 같습니다.

개인적으로 PyCaret은 아직까지 single output인 문제에는 적합한데 multi output 문제에는 부적합한것 같습니다. 혹시 multi output 문제에도 잘 적용된다면 알려주세요!

In this kernel we will use an AutoML package called PyCaret to enter data science competitions with structured data. I've used the given data without any feature engineering and trained the models without model tuning, so I expect better scores if we engineer additional feature and tune the models.

I think PyCaret is approporiate for single output prediction tasks, but I still haven't figured out easier way to implement it on multi output prediction tasks. Would appreciate it if anyone could share tutorial code on applying PyCaret on multi output prediction task.

경로 설정 (Define your path)

path = 'data/'

import os
os.listdir(path)

['train.csv', 'test_x.csv', 'sample_submission.csv']

데이터 불러오기 (Read Data)

import pandas as pd
train = pd.read_csv(path + 'train.csv')
test = pd.read_csv(path + 'test_x.csv')
submission = pd.read_csv(path + 'sample_submission.csv')

데이터 구조 확인 (Checking the shapes of data)

print(train.shape)
print(test.shape)
print(submission.shape)

(45532, 78)
(11383, 77)
(11383, 2)

PyCaret 패키지 설치 (Install PyCaret)

!pip install pycaret

분류 작업에 필용한 함수 불러오기 (Import methods for classification task)

from pycaret.classification import *

실험 환경 구축 (Setup the environment)

PyCaret에서는 모델 학습 전 실험 환경을 구축 해주어야 합니다. setup 함수를 통해 환경을 구축할 수 있습니다.
setup 단계에서는 PyCaret이 자동으로 컬럼 형태를 인식합니다. 그 후 사용자에게 제대로 인식되었는지 확인을 받게 됩니다. 그 때 enter를 눌러주시면 됩니다.
또한 주어진 데이터의 얼마를 사용하여 train / validation을 구축할지 묻게 되는데, 전체 데이터를 사용하고 싶다면 enter 눌러주시면 됩니다.

In PyCaret you have to setup the environment before experimenting with the models. It can be done by using 'setup' method.
In setup stage, PyCaret automatically interprets column types of the given data and asks the user if it has intepreted it correctly. You can customize whether you want each columns to be interpreted differently by using the parameters in setup method. In this tutorial we will just go with the automatic interpretation by pressing 'enter'.
Also, it asks the ratio of dataset used to contruct train/validation sets. We will use 100% of the dataset so just press 'enter' again.

# 'voted' 컬럼이 예측 대상이므로 target 인자에 명시
# 'voted' column is the target variable
clf = setup(data = train, target = 'voted')

Setup Succesfully Completed!

DescriptionValue012345678910111213141516171819202122232425262728293031323334353637383940414243

session_id	6636
Target Type	Binary
Label Encoded	1: 0, 2: 1
Original Data	(45532, 78)
Missing Values	False
Numeric Features	42
Categorical Features	35
Ordinal Features	False
High Cardinality Features	False
High Cardinality Method	None
Sampled Data	(45532, 78)
Transformed Train Set	(31872, 201)
Transformed Test Set	(13660, 201)
Numeric Imputer	mean
Categorical Imputer	constant
Normalize	False
Normalize Method	None
Transformation	False
Transformation Method	None
PCA	False
PCA Method	None
PCA Components	None
Ignore Low Variance	False
Combine Rare Levels	False
Rare Level Threshold	None
Numeric Binning	False
Remove Outliers	False
Outliers Threshold	None
Remove Multicollinearity	False
Multicollinearity Threshold	None
Clustering	False
Clustering Iteration	None
Polynomial Features	False
Polynomial Degree	None
Trignometry Features	False
Polynomial Threshold	None
Group Features	False
Feature Selection	False
Features Selection Threshold	None
Feature Interaction	False
Feature Ratio	False
Interaction Threshold	None
Fix Imbalance	False
Fix Imbalance Method	SMOTE

모델 학습 및 비교 (Train models and compare)

환경 구축을 했으니 PyCaret에서 제공하는 기본 모델에 대해 학습하고 비교해보겠습니다.
compared_models 함수를 통해 15개의 기본 모델을 학습하고 성능을 비교할 수 있습니다.
AUC 기준으로 성능이 가장 좋은 3개의 모델을 추려내어 저장해보겠습니다. 본 대회 평가지표가 AUC이기 때문에 AUC 기준으로 모델을 선정합니다.

Now we have constructed the environment, we will now train and compare the default models provided in PyCaret
By using 'compare_models' method we can easily train and compare 15 default models provided in the package
We will select top 3 models in terms of AUC, that is because the evaluation metric for this competition is AUC

best_3 = compare_models(sort = 'AUC', n_select = 3)

ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)01234567891011121314

CatBoost Classifier	0.6931	0.7645	0.6581	0.7501	0.7011	0.3885	0.3921	26.2353
Gradient Boosting Classifier	0.6936	0.7636	0.6406	0.7613	0.6957	0.3918	0.3978	30.7960
Light Gradient Boosting Machine	0.6919	0.7625	0.6457	0.7554	0.6962	0.3876	0.3926	2.2352
Linear Discriminant Analysis	0.6914	0.7606	0.6630	0.7447	0.7014	0.3843	0.3871	0.9525
Extra Trees Classifier	0.6885	0.7576	0.6467	0.7493	0.6942	0.3803	0.3846	5.0695
Ada Boost Classifier	0.6882	0.7545	0.6534	0.7451	0.6962	0.3788	0.3823	7.5470
Extreme Gradient Boosting	0.6732	0.7432	0.6633	0.7178	0.6894	0.3458	0.3471	31.7858
Random Forest Classifier	0.6509	0.7090	0.6019	0.7147	0.6534	0.3070	0.3116	0.6434
Decision Tree Classifier	0.6101	0.6070	0.6398	0.6446	0.6421	0.2138	0.2139	2.3453
Naive Bayes	0.4535	0.5110	0.0102	0.5177	0.0199	-0.0013	-0.0064	0.1069
K Neighbors Classifier	0.5139	0.5102	0.5547	0.5556	0.5551	0.0194	0.0194	1.1944
Quadratic Discriminant Analysis	0.4532	0.5001	0.0000	0.0000	0.0000	0.0000	0.0000	0.4221
Logistic Regression	0.5466	0.4783	0.9982	0.5468	0.7065	-0.0001	0.0000	1.3739
SVM - Linear Kernel	0.5028	0.0000	0.5803	0.5428	0.5560	-0.0106	-0.0116	0.5023
Ridge Classifier	0.6915	0.0000	0.6634	0.7445	0.7016	0.3844	0.3872	0.2240

CatBoost Classfier, Gradient Boosting Classifer, LGBM이 가장 좋은 3개의 모델입니다. 해당 모델은 best_3 변수에 저장되어 있습니다.
CatBoost Classfier, Gradient Boosting Classifer, and LGBM are the best 3 models. Those models are now stored in best_3 variable.

모델 앙상블 (Model Ensemble)

학습된 3개의 모델을 앙상블 시키도록 하겠습니다. 본 대회는 score 최적화를 위해 확률 값을 예측해야 하므로 soft vote ensemble을 진행하겠습니다.

We will now ensemble the three models. In order to optimize the score for this competition we have to predict probabilities, we we will soft-vote ensemble the three models using 'blend_models' method.

blended = blend_models(estimator_list = best_3, fold = 5, method = 'soft')

AccuracyAUCRecallPrec.F1KappaMCC01234MeanSD

0.6985	0.7716	0.6569	0.7593	0.7044	0.4000	0.4044
0.6907	0.7607	0.6388	0.7575	0.6931	0.3858	0.3915
0.6895	0.7603	0.6428	0.7532	0.6936	0.3829	0.3879
0.6961	0.7677	0.6568	0.7554	0.7027	0.3950	0.3991
0.6939	0.7664	0.6374	0.7638	0.6949	0.3928	0.3993
0.6937	0.7654	0.6465	0.7578	0.6977	0.3913	0.3964
0.0033	0.0043	0.0086	0.0036	0.0048	0.0062	0.0059

모델 예측 (Prediction)

구축된 앙상블 모델을 통해 예측을 해보겠습니다.
setup 환경에 이미 hold-out set이 존재하므로 해당 데이터에 대해 예측을 하여 모델 성능을 확인하겠습니다.

We will use the ensembled model on predicting unseen data.
There is already a hold-out set constucted on our environment so we will test on it to evaluate the performance.

pred_holdout = predict_model(blended)

ModelAccuracyAUCRecallPrec.F1KappaMCC0

Voting Classifier

0.7001

0.7725

0.6471

0.7679

0.7024

0.4045

0.4105

AUC가 0.7725로 꽤 준수한 성능을 보이는 것을 알 수 있습니다.
We got a pretty decent model with AUC of 0.7725

전체 데이터에 대한 재학습 (Re-training the model on whole data)

현재까지 실험은 주어진 train 데이터를 다시 한 번 train / validation으로 나눠서 실험을 한 것이므로, 전체 train 데이터에 학습되어 있지 않습니다.
최적의 성능을 위해 전체 데이터에 학습을 시켜주도록 하겠습니다.

Until now we have splitted the given train data into another train / validation sets to experiment. So the models are not trained on the full training data set.
We will train the model on the whole dataset for the most optimal performance.

final_model = finalize_model(blended)

대회용 test set에 대한 예측 (Predicting on test set for the competition)

predict_model 함수를 통해 재학습된 모델을 대회용 test set에 대해 예측해보겠습니다.
We will now use the re-trained model on the test set for the competition

predictions = predict_model(final_model, data = test)

predictions

indexQaAQaEQbAQbEQcAQcEQdAQdEQeA...wr_06wr_07wr_08wr_09wr_10wr_11wr_12wr_13LabelScore01234...1137811379113801138111382

0	3.0	736	2.0	2941	3.0	4621	1.0	4857	2.0	...	0	0	1	0	1	0	1	1	2	0.6475
1	3.0	514	2.0	1952	3.0	1552	3.0	821	4.0	...	0	0	0	0	0	0	0	0	2	0.8857
2	3.0	500	2.0	2507	4.0	480	2.0	614	2.0	...	0	1	1	0	1	0	1	1	2	0.5256
3	1.0	669	1.0	1050	5.0	1435	2.0	2252	5.0	...	1	1	1	1	1	1	1	1	1	0.1998
4	2.0	499	1.0	1243	5.0	845	2.0	1666	2.0	...	0	1	1	0	1	1	1	1	2	0.7567
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
11378	5.0	427	5.0	1066	5.0	588	1.0	560	2.0	...	0	1	1	0	1	0	1	1	1	0.3924
11379	1.0	314	5.0	554	5.0	230	1.0	956	2.0	...	1	1	1	1	1	1	1	1	2	0.8792
11380	1.0	627	2.0	799	1.0	739	2.0	1123	1.0	...	0	1	1	0	1	0	1	1	1	0.2230
11381	2.0	539	1.0	2090	2.0	4642	1.0	673	2.0	...	0	1	1	0	1	1	1	0	1	0.3271
11382	2.0	541	4.0	900	5.0	691	2.0	1951	1.0	...	1	0	1	0	0	1	0	0	2	0.6550

11383 rows × 79 columns

확률 값이 'Score' 컬럼에 저장되어 있으므로 해당 값을 submission 파일에 옮겨 데이콘에 제출하겠습니다.
The probability values are stored on 'Score' column. So we will write them on our submission format and submit on DACON.

submission['voted'] = predictions['Score']

submission.to_csv('submission_proba.csv', index = False)

아마 0.77 정도의 성능을 보일 것이며 추가 작업을 통해 성능을 더 향상시킬 수 있을거라 기대합니다.
You will probabily get around 0.77 AUC and with additional steps I think we can improve this score.

'LG AI(AImiers) 과정' 카테고리의 다른 글

LG AImeris_DACON_Competition_note (0)	2023.02.12
Casual Effect Identifiability (0)	2023.01.26
Time-SeriesTransformer (TST) (0)	2023.01.24
[part6. ensemble] (0)	2023.01.14
part2. linear regression (0)	2023.01.06

스터디 기록 노트