Notice

Recent Posts

Recent Comments

Link

Tags more

Archives

Today

Total

관리 메뉴

SJ_Koding

05. [Dacon Basic] 항공사 고객 만족도 예측 경진대회 (최종 2등!!) 본문

AI Competition

05. [Dacon Basic] 항공사 고객 만족도 예측 경진대회 (최종 2등!!)

성지코딩 2022. 2. 8. 01:20

베이스라인의 EDA를 저만의 방식으로 설명해보았고
스케일링, hard voting 앙상블, soft voting 등의 다양한 시도를 해보았습니다.

첫 EDA를 진행해보았고 아직은 참고해서 하는 수준이네요.
또 하이퍼파라메터 튜닝에 GridSearchCV가 주로 쓰이는걸 알고 한 번 적용해보았습니다.
GridSearchCV는 간단히 말해 사용자가 하이퍼 파라메터들의 경우의 수를 지정하여 최적의 하이퍼 파라메터를 찾아주는 모델입니다.
즉, 모델 튜닝 시에 사용되는 기능입니다. 이를 공부하고 처음으로 적용 시켜 보았습니다.

데이터 처리를 어떻게 해야할지 감이 아직 안와서 이 점은 차근차근 해쳐나가겠습니다.
가장 기본적인 범주형 데이터 처리와 다중공선성 컬럼을 제거하였습니다.

감사합니다! 좋게 봐주세요 :) 도움되었다면 좋아요도 눌러주세요 ;)

+) 추가, 0.939가 나왔습니다. 뭔가 기분 좋다기 보다는 더욱 황당했습니다. 각종 튜닝 및 전처리를 1차원적으로 해야만 고득점이 나왔습니다.
무슨말이냐 하면, 다중공산성 특징을 새로 조합하여 새로운 컬럼을 삭제하고 기존 특징을 제거 + 이상치 SimpleImputer로 mean변환 + 이산형 데이터에서 0에대한 값 처리 + 왜곡이 큰 데이터 log처리 등등을 처리한 것 보다.

다중 공산성 특징 단순 삭제 + 이상치 평균값 대체 (모듈없이 그냥 mean()함수 사용) 이 훨씬 높게 나왔다는 이야기입니다.

+) public 18위, Private 2위 스코어를 기록했습니다. 아무래도 운이 정말 좋았던 것 같습니다. 그래도 순위권이라니 기분이 좋네요 :)

https://dacon.io/competitions/official/235871/overview/description

항공사 고객 만족도 예측 경진대회 - DACON

좋아요는 1분 내에 한 번만 클릭 할 수 있습니다.

dacon.io

In [1]:

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))

항공사 고객 만족도 예측 경진대회 [Dacon Basic]¶

데이터 불러오기¶

In [2]:

import warnings 
warnings.filterwarnings('ignore')

In [3]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv('D:/Dacon/airline_satisfied_score_predict/train.csv')
data = data.drop('id', axis=1) # id 제외하고 분석

test = pd.read_csv('D:/Dacon/airline_satisfied_score_predict/test.csv')
test = test.drop('id', axis=1) # id 제외하고 분석

pd.set_option('display.max_columns', None) # 컬럼 수가 많으므로 요약되어 표시되지 않게 함
data # 데이터 확인

Out[3]:

	Gender	Customer Type	Age	Type of Travel	Class	Flight Distance	Seat comfort	Departure/Arrival time convenient	Food and drink	Gate location	Inflight wifi service	Inflight entertainment	Online support	Ease of Online booking	On-board service	Leg room service	Baggage handling	Checkin service	Cleanliness	Online boarding	Departure Delay in Minutes	Arrival Delay in Minutes	target
0	Female	disloyal Customer	22	Business travel	Eco	1599	3	0	3	3	4	3	4	4	5	4	4	4	5	4	0	0.0	0
1	Female	Loyal Customer	37	Business travel	Business	2810	2	4	4	4	1	4	3	5	5	4	2	1	5	2	18	18.0	0
2	Male	Loyal Customer	46	Business travel	Business	2622	1	1	1	1	4	5	5	4	4	4	4	5	4	3	0	0.0	1
3	Female	disloyal Customer	24	Business travel	Eco	2348	3	3	3	3	3	3	3	3	2	4	5	3	4	3	10	2.0	0
4	Female	Loyal Customer	58	Business travel	Business	105	3	3	3	3	4	4	5	4	4	4	4	4	4	5	0	0.0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	Male	Loyal Customer	30	Personal Travel	Eco	2377	1	5	1	1	1	4	4	1	3	5	3	4	2	4	211	225.0	0
2996	Female	disloyal Customer	24	Business travel	Eco	1643	2	4	3	4	5	3	5	5	2	2	4	1	3	5	20	13.0	0
2997	Female	disloyal Customer	22	Business travel	Eco	1442	2	2	2	3	4	2	4	4	3	2	3	4	3	4	64	67.0	0
2998	Female	disloyal Customer	33	Business travel	Business	2158	2	2	2	5	4	2	4	4	5	2	5	5	5	4	0	3.0	0
2999	Female	Loyal Customer	42	Business travel	Eco	624	1	2	2	2	3	3	4	1	1	1	1	3	1	3	28	6.0	0

3000 rows × 23 columns

In [4]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Gender                             3000 non-null   object 
 1   Customer Type                      3000 non-null   object 
 2   Age                                3000 non-null   int64  
 3   Type of Travel                     3000 non-null   object 
 4   Class                              3000 non-null   object 
 5   Flight Distance                    3000 non-null   int64  
 6   Seat comfort                       3000 non-null   int64  
 7   Departure/Arrival time convenient  3000 non-null   int64  
 8   Food and drink                     3000 non-null   int64  
 9   Gate location                      3000 non-null   int64  
 10  Inflight wifi service              3000 non-null   int64  
 11  Inflight entertainment             3000 non-null   int64  
 12  Online support                     3000 non-null   int64  
 13  Ease of Online booking             3000 non-null   int64  
 14  On-board service                   3000 non-null   int64  
 15  Leg room service                   3000 non-null   int64  
 16  Baggage handling                   3000 non-null   int64  
 17  Checkin service                    3000 non-null   int64  
 18  Cleanliness                        3000 non-null   int64  
 19  Online boarding                    3000 non-null   int64  
 20  Departure Delay in Minutes         3000 non-null   int64  
 21  Arrival Delay in Minutes           3000 non-null   float64
 22  target                             3000 non-null   int64  
dtypes: float64(1), int64(18), object(4)
memory usage: 539.2+ KB

Gender, Customer Type, Type of Travel, Class는 범주형 데이터
Age, Flight Distance, Departure Delay in Minutes, Arrival Delay in Minutes를 제외한 feature들은 0~5 사이의 degree값을 나타내는 정수형임을 확인했습니다.

In [5]:

data.isna().sum()

Out[5]:

Gender                               0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
target                               0
dtype: int64

결측치 확인결과 결측치가 존재하지 않습니다.

기초 통계 분석¶

feature 분포 시각화¶

In [6]:

plt.style.use('ggplot')

# 히스토그램을 사용해서 데이터의 분포 살펴보기
plt.figure(figsize=(25, 20))
plt.suptitle("Data Histogram", fontsize = 40)

# id는 제외하고 시각화
cols = data.columns
for i in range(len(cols)):
    plt.subplot(5, 5, i+1) # 최대 5 by 5, 25개의 특징 분포를 확인할 수 있습니다.
    plt.title(cols[i], fontsize=20) # 각 분포그림의 제목을 특징명으로 설정합니다.
    if len(data[cols[i]].unique()) > 20: # 해당 특징의 고유한 값 종류가 20개가 넘으면
        plt.hist(data[cols[i]], bins=20, color='b', alpha=0.7) # 히스토그램을 출력합니다. bins는 칸을 나누는 값이며 alpha는 투명도입니다.
        
    else: # 해당 특징의 고유한 값 종류가 20개가 넘지 않으면
        temp = data[cols[i]].value_counts() # 각 특징의 값 종류의 개수들을 temp에 저장합니다.
        plt.bar(temp.keys(), temp.values, width=0.5, alpha=0.7)
        plt.xticks(temp.keys())
        
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

파란 그래프가 아닌 것은 2개 혹은 5개의 값으로 이루어져 있으며
파란 그래프는 연속된 값들 입니다. 파란색 그래프만 보면 age를 제외하고 왼쪽으로 치우쳐져 있는 것을 볼 수 있습니다.

target과 feature들의 관계 확인¶

In [7]:

# 타겟 설정
target = "target"
# 범주형 데이터 분리
categorical_feature = data.columns[data.dtypes=='object']

plt.figure(figsize=(20,15))
plt.suptitle("Violin Plot", fontsize=40)

# id는 제외하고 시각화합니다.
for i in range(len(categorical_feature)):
    plt.subplot(2,2,i+1)
    plt.xlabel(categorical_feature[i])
    plt.ylabel(target)
    sns.violinplot(x= data[categorical_feature[i]], y= data[target])
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

target은 0과 1로 이루어져 있습니다. y축의 0부분과 1부분을 집중해서 봅시다.
gender같은 경우 1을 준 성별은 여성이 많고 0을 준 성별은 남성이 많습니다.
Customer Type 같은 경우 disloyal과 loyal에 따라 target값이 굉장히 차이나는 것을 볼 수 있습니다. 중요한 특징입니다.

In [8]:

# 수치형 데이터 분리
numeric_feature = data.columns[(data.dtypes=='int64') | (data.dtypes=='float')]
num_data = data[numeric_feature]

# 박스플롯
fig, axes = plt.subplots(3, 6, figsize=(25, 20))

fig.suptitle('feature distributions per quality', fontsize= 40)
for ax, col in zip(axes.flat, num_data.columns[:-1]):
    sns.boxplot(x= 'target', y= col, ax=ax, data=num_data)
    ax.set_title(col, fontsize=20)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

박스들이 중간에 있을 수록, 이상치가 없을 수록 골고루 분포되어있는 것으로 해석할 수 있습니다. 즉, target이 0이던 1이던 골고루 있다는 것은 그렇게 중요한 특징이 아니라는 것이죠. age, Departure/Arrival time convenient, Gate location 등등이 별 연관이 없어보입니다. 상관계수로 확인해 보겠습니다.

상관계수 확인하기¶

범주형 데이터 숫자화¶

In [9]:

corr_df = data.copy()
corr_df[corr_df.columns[corr_df.dtypes=='O']] = corr_df[corr_df.columns[corr_df.dtypes=='O']].astype(str).apply(LabelEncoder().fit_transform)

heatmap을 활용하여 시각화¶

In [10]:

plt.figure(figsize=(35,25))

heat_table = corr_df.corr()
mask = np.zeros_like(heat_table)
mask[np.triu_indices_from(mask)] = True
heatmap_ax = sns.heatmap(heat_table, annot=True, mask = mask, cmap='coolwarm')
heatmap_ax.set_xticklabels(heatmap_ax.get_xticklabels(), fontsize=15, rotation=45)
heatmap_ax.set_yticklabels(heatmap_ax.get_yticklabels(), fontsize=15)
plt.title('correlation between features', fontsize=40)
plt.show()

상관계수의 값이 낮다고해서 상관계수가 안좋은게 아니라 절대값을 봐야합니다. 즉, 남색에 가까운게 안좋은게 아니라
0에 가까운 Departure Delay in Minutes & Gate location 의 색이 상관계수가 낮은 것이라고 보면 됩니다.
target을 기준으로 확인했을 때, 위에서 본 대로 age, Departure/Arrival time convenient, Gate location의 상관계수가 낮은것을 확인 할 수 있습니다.
반대로 Inflight entertainment와 Online support, Ease of Online booking 등등이 높은 상관계수를 보입니다.

추가+)Depature Delay in Minutes & Arrival Delay in Minuits 에 대하여¶

_(관련 피드백을 제공해 주신 dong_ho 님께 감사합니다!)_ 먼저 출발 시간이 지연되면 당연히 도착시간도 지연됩니다. 근데 이 특징을 둘 다 가지고 있습니다.
상관계수를 한 번 볼까요? 0.98로 매우 높은 상관관계를 가집니다. 과연 좋은 값일까요?
관련 키워드는 다중 공선성입니다. 독립변수 즉, target이 아닌 특징들에 대하여 서로 독립적인 관계여야 합니다.
하지만 위의 두 특징은 서로 독립적인 관계가 아니고 거의 하나로 움직이게 되는데 주로 상관계수가 0.7이상일 때 다중 공선성이 나타난다고 합니다.

0.7이상의 특징 쌍이 하나 더 보입니다. 'Food and drink' & 'Seat confort' 이네요. 좌석의 편리성과 음식 관계는 아무래도 비싼 좌석일 수록 자리가 편하고 이에따라 음식도 잘 나오므로 높게 나타난 것으로 추정됩니다.

이 다중 공선성의 처리방법중 가장 간단한 방법은 둘 중 한 특징을 제거하는 것입니다. 저는 2개의 쌍 중 target과의 상관계수가 낮은 특징을 제거해보겠습니다.

Depature Delay in Minutes & Arrival Delay in Minuits (target과의 상관관계: 0.1 / 0.11)

Depature Delay in Minutes 제거 (0.1)

Food and drink' & 'Seat confort ( target과의 상관관계: 0.15 / 0.27)

Food and drink 제거 ( 0.15 )

In [11]:

data.drop('Departure Delay in Minutes', axis = 1, inplace = True)
data.drop('Food and drink', axis = 1, inplace = True)
test.drop('Departure Delay in Minutes', axis = 1, inplace = True)
test.drop('Food and drink', axis = 1, inplace = True)

이상치 확인¶

boxplot을 이용하여 이상치들을 확인해보겠습니다.

In [12]:

data.plot(kind='box', subplots=True, layout=(5, 5), figsize=(15, 21))
plt.show()       

출발/도착 지연시간과 비행시간에 대해서 이상치가 매우 많습니다.

아직 지식이 부족한데 개인적인 생각으로 출발/도착 지연시간같은 경우 이상치이더라도 지연이 크게될 수록 만족도가 당연히 하락하는데, 이상치를 처리하는게 맞는지를 잘 모르겠어서 처리는 일단 보류하겠습니다. 댓글로 피드백 주시면 감사하겠습니다!

이상치는 항상 제거하는게 성능에 좋겠습니다. 이상치의 인덱스를 추출하겠습니다.

In [13]:

def outliers_iqr(data):
    q1, q3 = np.percentile(data, [25, 75])
    # 넘파이의 값을 퍼센트로 표시해주는 함수

    iqr = q3 - q1
    lower_bound = q1 - (iqr * 1.5)
    upper_bound = q3 + (iqr * 1.5)
    
    return np.where((data > upper_bound) | (data < lower_bound))

In [14]:

ArrivalDelay_index_data = outliers_iqr(data['Arrival Delay in Minutes'])[0]
FlightDistance_index_data = outliers_iqr(data['Flight Distance'])[0]
GarageArea_index_data = outliers_iqr(data['Checkin service'])[0]

In [15]:

ArrivalDelay_index_data

Out[15]:

array([   6,   23,   30,   35,   36,   51,   57,   59,   62,   66,   78,
         82,   85,   92,  116,  134,  137,  151,  160,  162,  175,  180,
        198,  202,  206,  214,  245,  255,  258,  279,  283,  285,  290,
        292,  303,  309,  315,  325,  335,  340,  341,  349,  359,  368,
        373,  375,  377,  385,  391,  394,  416,  428,  429,  435,  454,
        457,  472,  483,  484,  490,  510,  511,  517,  522,  537,  539,
        573,  575,  587,  592,  599,  604,  620,  621,  623,  627,  638,
        641,  650,  652,  653,  670,  672,  676,  693,  695,  696,  703,
        707,  715,  721,  723,  737,  752,  763,  766,  773,  774,  783,
        792,  797,  798,  799,  809,  810,  824,  837,  846,  852,  858,
        862,  869,  883,  884,  895,  904,  906,  912,  923,  937,  947,
        948,  949,  962,  968,  978,  983,  987,  993, 1001, 1002, 1013,
       1014, 1020, 1027, 1031, 1034, 1044, 1065, 1068, 1069, 1073, 1074,
       1098, 1103, 1116, 1122, 1140, 1151, 1152, 1160, 1177, 1183, 1185,
       1187, 1190, 1191, 1195, 1203, 1222, 1238, 1242, 1246, 1252, 1256,
       1259, 1271, 1272, 1274, 1277, 1279, 1281, 1300, 1302, 1307, 1309,
       1310, 1324, 1346, 1356, 1369, 1370, 1383, 1387, 1389, 1393, 1394,
       1395, 1399, 1405, 1407, 1418, 1422, 1432, 1443, 1462, 1476, 1487,
       1490, 1491, 1497, 1504, 1510, 1518, 1519, 1525, 1532, 1533, 1541,
       1551, 1561, 1562, 1575, 1576, 1579, 1582, 1588, 1605, 1607, 1608,
       1612, 1618, 1626, 1630, 1635, 1655, 1661, 1688, 1699, 1700, 1705,
       1706, 1711, 1715, 1718, 1725, 1731, 1736, 1743, 1745, 1747, 1756,
       1776, 1792, 1795, 1797, 1801, 1802, 1803, 1809, 1820, 1827, 1831,
       1833, 1837, 1840, 1842, 1844, 1875, 1886, 1896, 1897, 1918, 1919,
       1931, 1936, 1945, 1953, 1954, 1956, 1969, 1978, 1979, 1995, 1997,
       2009, 2037, 2061, 2065, 2066, 2088, 2100, 2105, 2120, 2138, 2144,
       2171, 2180, 2198, 2199, 2210, 2213, 2215, 2226, 2239, 2246, 2247,
       2252, 2261, 2266, 2267, 2269, 2281, 2291, 2293, 2310, 2313, 2327,
       2328, 2330, 2333, 2334, 2337, 2340, 2349, 2352, 2353, 2357, 2358,
       2370, 2387, 2389, 2398, 2418, 2424, 2437, 2440, 2443, 2447, 2448,
       2449, 2461, 2462, 2466, 2469, 2480, 2484, 2488, 2489, 2490, 2495,
       2496, 2511, 2541, 2548, 2561, 2563, 2569, 2576, 2581, 2602, 2604,
       2624, 2629, 2635, 2644, 2648, 2649, 2653, 2663, 2668, 2674, 2701,
       2711, 2712, 2713, 2714, 2733, 2734, 2737, 2747, 2753, 2755, 2760,
       2761, 2764, 2777, 2784, 2785, 2787, 2788, 2838, 2850, 2852, 2853,
       2854, 2864, 2866, 2880, 2885, 2899, 2921, 2929, 2930, 2933, 2945,
       2953, 2954, 2955, 2961, 2966, 2977, 2987, 2995, 2997], dtype=int64)

행 자체를 삭제하려고 보니 이상치가 포함된 행이 너무 많아 학습에 지장이 생길 것 같습니다('Arrival Delay in Minutes': 405개). 제거가아닌 평균값으로 대치하도록 하겠습니다. 이상치 인덱스에 대해 값을 대치하는 좋은 피드백을 제시해주신 유재성 KADE 님께 감사의 말씀 드립니다!

In [16]:

data.loc[ArrivalDelay_index_data, 'Arrival Delay in Minutes'] = data['Arrival Delay in Minutes'].mean()
data.loc[FlightDistance_index_data, 'Flight Distance'] = data['Flight Distance'].mean()
data.loc[GarageArea_index_data, 'Checkin service'] = data['Checkin service'].mean()

In [17]:

data.plot(kind='box', subplots=True, layout=(5, 5), figsize=(15, 21))
plt.show()       

이상치가 잘 제거된 모습입니다. 마찬가디로 테스트 셋에 대해서도 적용합니다.

In [18]:

ArrivalDelay_index_test = outliers_iqr(test['Arrival Delay in Minutes'])[0]
FlightDistance_index_test = outliers_iqr(test['Flight Distance'])[0]
GarageArea_index_test = outliers_iqr(test['Checkin service'])[0]

test.loc[ArrivalDelay_index_test, 'Arrival Delay in Minutes'] = test['Arrival Delay in Minutes'].mean()
test.loc[FlightDistance_index_test, 'Flight Distance'] = test['Flight Distance'].mean()
test.loc[GarageArea_index_test, 'Checkin service'] = test['Checkin service'].mean()

원본 데이터 라벨링¶

In [19]:

# 데이터 셋
data[data.columns[data.dtypes=='O']] = data[data.columns[data.dtypes=='O']].astype(str).apply(LabelEncoder().fit_transform)

# 테스트 셋
test[test.columns[test.dtypes=='O']] = test[test.columns[test.dtypes=='O']].astype(str).apply(LabelEncoder().fit_transform)

모델 학습¶

이제 모델을 학습시켜봅니다. 학습 후 스코어를 확인하고 스케일링을 적용해보며 결과를 비교해봅니다

In [20]:

# 여러 모델 비교
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier

평가 척도¶

In [21]:

import numpy as np

def ACCURACY(true, pred):   
    score = np.mean(true==pred)
    return score

In [22]:

from sklearn.model_selection import StratifiedKFold, GridSearchCV

K-Fold¶

K-Fold는 train셋 안에서(여기서는 data) train셋과 validation셋을 일정 비율로 번갈아 담당하여 스코어를 측정하여 평균값을 내놓을 수 있습니다. 이는 과적합결과를 방지해주며 보다 신뢰도 있는 스코어를 확인할 수 있습니다.

In [23]:

from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MinMaxScaler

def kfold(model, train, scale = False):
    cv_accuracy = []
    cv = StratifiedKFold(n_splits=5)
    
    n_iter = 0
    
    for t, v in cv.split(train, train['target']):
        
        train_cv = train.iloc[t] # 훈련용
        val_cv = train.iloc[v] # 검증용 분리

        train_X = train_cv.drop('target', axis=1)
        train_y = train_cv['target']

        val_X = val_cv.drop('target', axis=1)
        val_y = val_cv['target']
            
        model.fit(train_X, train_y)
        score = ACCURACY(val_y, model.predict(val_X))
        
        cv_accuracy.append(score)
        n_iter += 1
    return np.mean(cv_accuracy)

In [24]:

models = [
    KNeighborsClassifier(),
    LogisticRegression(),
    DecisionTreeClassifier(),
    RandomForestClassifier(max_depth=12, min_samples_leaf=8, min_samples_split=20, n_estimators=300),
    GradientBoostingClassifier(),
    XGBClassifier(eval_metric = 'logloss', \
                              max_depth = 5, \
                              min_child_weight = 3, \
                               gamma = 3, \
                               colsample_bytree = 0.5, \
                               n_estimators=700),
    LGBMClassifier(n_estimators=600, max_bin=400, num_leaves=24),
    CatBoostClassifier(silent=True, depth=6, l2_leaf_reg=7, learning_rate=0.1, n_estimators=500),
    ExtraTreesClassifier(max_depth=25, n_estimators=320)
]

print('스케일링 적용 전')
for model in models:
    print(f'{type(model).__name__} score: {kfold(model, data)}')

스케일링 적용 전
KNeighborsClassifier score: 0.6056666666666667
LogisticRegression score: 0.8119999999999999
DecisionTreeClassifier score: 0.874
RandomForestClassifier score: 0.8966666666666667
GradientBoostingClassifier score: 0.9123333333333333
XGBClassifier score: 0.9283333333333333
LGBMClassifier score: 0.932
CatBoostClassifier score: 0.9406666666666667
ExtraTreesClassifier score: 0.9236666666666666

모델 튜닝¶

이제 최종적으로 적용하기 전에 모델튜닝을 진행합니다. gridSerchCV를 통해 최적의 하이퍼파라메타를 찾아보겠습니다.

RandomForest¶

In [25]:

params = { 'n_estimators' : [10, 100, 1000],
           'max_depth' : [6, 8, 10, 12],
           'min_samples_leaf' : [8, 12, 18],
           'min_samples_split' : [8, 16, 20]
            }

In [26]:

train = data.drop('target', axis=1)
target = data['target']

model_RFC = RandomForestClassifier()
grid_cv_RFC = GridSearchCV(model_RFC, param_grid = params, cv=5, n_jobs = -1)
grid_cv_RFC.fit(train, target) # train과 target은 위에 train_test_split하기 전에 있었음
print('최적 하이퍼 파라미터: ', grid_cv_RFC.best_params_)
print('최고 예측 정확도: {:.4f}'.format(grid_cv_RFC.best_score_))

최적 하이퍼 파라미터:  {'max_depth': 10, 'min_samples_leaf': 8, 'min_samples_split': 8, 'n_estimators': 100}
최고 예측 정확도: 0.8977

모델 앙상블¶

원본데이터에서 성능이 좋게나온 3가지의 모델 RandomForestClassifier, XGBClassifier, LGBMClassifier를 사용해보겠습니다.
그리고 앙상블과 원본의 정확도를 비교해보겠습니다

크게 차이나지는 않지만 과반수기법인 HardVoting기법을 사용하여 앙상블을 진행했습니다.

XGBClassifier¶

참고: https://m.blog.naver.com/PostView.naver?isHttpsRedirect=true&blogId=gustn3964&logNo=221431933811

In [27]:

model_XGB = XGBClassifier(eval_metric='logloss', silent = True)

param_grid={'booster' :['gbtree'],
                 'max_depth':[5,6,8],
                 'min_child_weight':[1,3,5],
                 'gamma':[0,1,2,3],
                 'nthread':[0, 4],
                 'colsample_bytree':[0.5, 1.0],
                 'n_estimators':[50, 100, 150],
                 'objective':['binary:logistic'],
            }

grid_cv_XGB=GridSearchCV(model_XGB, param_grid=param_grid, cv=5 , n_jobs=-1)
grid_cv_XGB.fit(train, target)
print('최적 하이퍼 파라미터: ', grid_cv_XGB.best_params_)
print('최고 예측 정확도: {:.4f}'.format(grid_cv_XGB.best_score_))

[11:47:52] WARNING: D:\bld\xgboost-split_1643227225381\work\src\learner.cc:576: 
Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


최적 하이퍼 파라미터:  {'booster': 'gbtree', 'colsample_bytree': 0.5, 'gamma': 2, 'max_depth': 8, 'min_child_weight': 1, 'n_estimators': 150, 'nthread': 0, 'objective': 'binary:logistic'}
최고 예측 정확도: 0.9367

LGBMClassifier¶

참고: https://www.kaggle.com/bitit1994/parameter-grid-search-lgbm-with-scikit-learn

In [28]:

model_LGBM = LGBMClassifier()

gridParams = {
    'learning_rate': [0.005, 0.01, 0.1],
    'n_estimators': [100, 300, 500],
    'num_leaves': [15, 31, 63], # large num_leaves helps improve accuracy but might lead to over-fitting
    'boosting_type' : ['dart', 'gbdt'], # for better accuracy -> try dart
    'objective' : ['binary'],
    'max_bin':[255], # large max_bin helps improve accuracy but might slow down training progress
    }

grid_cv_LGBM = GridSearchCV(model_LGBM, param_grid=gridParams, cv=3 , n_jobs=-1)
grid_cv_LGBM.fit(train, target)
print('최적 하이퍼 파라미터: ', grid_cv_LGBM.best_params_)
print('최고 예측 정확도: {:.4f}'.format(grid_cv_LGBM.best_score_))

최적 하이퍼 파라미터:  {'boosting_type': 'gbdt', 'learning_rate': 0.1, 'max_bin': 255, 'n_estimators': 500, 'num_leaves': 63, 'objective': 'binary'}
최고 예측 정확도: 0.9267

CatBoostClassifier¶

참고 : https://catboost.ai/en/docs/concepts/python-reference_catboost_grid_search

In [29]:

model_CAT = CatBoostClassifier(silent = True)

grid = {'learning_rate': [0.03, 0.1],
        'depth': [4, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 7, 9]}

grid_cv_CAT = GridSearchCV(model_CAT, param_grid=grid, cv=3 , n_jobs=-1)
grid_cv_CAT.fit(train, target)
print('최적 하이퍼 파라미터: ', grid_cv_CAT.best_params_)
print('최고 예측 정확도: {:.4f}'.format(grid_cv_CAT.best_score_))

최적 하이퍼 파라미터:  {'depth': 6, 'l2_leaf_reg': 9, 'learning_rate': 0.1}
최고 예측 정확도: 0.9417

제출¶

soft보팅 기법을 사용하여 4가지의 모델(XGB + LGBM + CAT + EXTRA)을 앙상블합니다.

In [30]:

train = data.drop('target', axis=1)
target = data['target']

# best_model_XGB = XGBClassifier(eval_metric = 'logloss', \
#                               silent = True, \
#                               max_depth = 5, \
#                               min_child_weight = 3, \
#                                gamma = 3, \
#                                colsample_bytree = 0.5, \
#                                n_estimators=700)
# best_model_LGBM = LGBMClassifier(n_estimators=600, max_bin=400, num_leaves=24)
# best_model_CAT = CatBoostClassifier(silent=True, depth=6, l2_leaf_reg=7, learning_rate=0.1, n_estimators=500)
# best_model_EXTRA = ExtraTreesClassifier(max_depth=25, n_estimators=320)

best_model_XGB = grid_cv_XGB.best_estimator_ # 최적 하이퍼 파라메터로 설정, 학습된 모델 저장
# best_model_RFC = grid_cv_RFC.best_estimator_ 
best_model_LGBM = grid_cv_LGBM.best_estimator_ 
best_model_CAT = grid_cv_CAT.best_estimator_
best_model_EXTRA = ExtraTreesClassifier(max_depth=25, n_estimators=320)
from sklearn.ensemble import VotingClassifier
softVoting_model = VotingClassifier(estimators=[('XGB', best_model_XGB), ('LGBM', best_model_LGBM), ('CAT', best_model_CAT), ('EXTRA', best_model_EXTRA)], voting='soft') 
softVoting_model.fit(train, target)

[11:52:07] WARNING: D:\bld\xgboost-split_1643227225381\work\src\learner.cc:576: 
Parameters: { "silent" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.

Out[30]:

VotingClassifier(estimators=[('XGB',
                              XGBClassifier(base_score=0.5, booster='gbtree',
                                            colsample_bylevel=1,
                                            colsample_bynode=1,
                                            colsample_bytree=0.5,
                                            enable_categorical=False,
                                            eval_metric='logloss', gamma=2,
                                            gpu_id=-1, importance_type=None,
                                            interaction_constraints='',
                                            learning_rate=0.300000012,
                                            max_delta_step=0, max_depth=8,
                                            min_child_weight=1, missing=nan,
                                            monotone_...
                                            predictor='auto', random_state=0,
                                            reg_alpha=0, reg_lambda=1,
                                            scale_pos_weight=1, silent=True,
                                            subsample=1, tree_method='exact',
                                            validate_parameters=1, ...)),
                             ('LGBM',
                              LGBMClassifier(max_bin=255, n_estimators=500,
                                             num_leaves=63,
                                             objective='binary')),
                             ('CAT',
                              <catboost.core.CatBoostClassifier object at 0x000001E55333A640>),
                             ('EXTRA',
                              ExtraTreesClassifier(max_depth=25,
                                                   n_estimators=320))],
                 voting='soft')

In [31]:

soft_pred = softVoting_model.predict(test)
plt.hist(soft_pred)

Out[31]:

(array([ 903.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
        1097.]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <BarContainer object of 10 artists>)

In [32]:

submission = pd.read_csv('D:/Dacon/airline_satisfied_score_predict/sample_submission.csv')
submission

Out[32]:

	id	target
0	1	0
1	2	0
2	3	0
3	4	0
4	5	0
...	...	...
1995	1996	0
1996	1997	0
1997	1998	0
1998	1999	0
1999	2000	0

2000 rows × 2 columns

In [33]:

submission['target'] = soft_pred
submission

Out[33]:

	id	target
0	1	1
1	2	0
2	3	1
3	4	1
4	5	1
...	...	...
1995	1996	0
1996	1997	1
1997	1998	0
1998	1999	1
1999	2000	1

2000 rows × 2 columns

In [34]:

submission.to_csv('D:/Dacon/airline_satisfied_score_predict/submission_XGB_LGBM_CAT_EXTRA_soft_6.csv', index=False)

제출 결과¶

0.939의 스코어를 기록했습니다. (22/02/14) 추가적인 전처리를 진행하여 스코어를 높여보겠습니다!

In [ ]:

저작자표시

'AI Competition' 카테고리의 다른 글

04. [Dacon Basic] 집값 예측 경진대회 (0)	2022.02.08
03.[Dacon Basic] 영화 리뷰 감정분석 경진대회 (최종39위 / 605명) (1)	2022.01.20
02. [Dacon 교육] Fashion MNIST : 의류 클래스 예측 (csv파일)아주 쉽게 따라하기. (Pytorch 이용) (0)	2022.01.14
01. [Dacon basic], 펭귄 몸무게 예측 경진대회 참가 코드[최종 26위 / 725명, private score : 308.10401(RMSE)] (2)	2022.01.06

'AI Competition' Related Articles

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

SJ_Koding

SJ_Koding

05. [Dacon Basic] 항공사 고객 만족도 예측 경진대회 (최종 2등!!) 본문

05. [Dacon Basic] 항공사 고객 만족도 예측 경진대회 (최종 2등!!)

항공사 고객 만족도 예측 경진대회 [Dacon Basic]¶

데이터 불러오기¶

기초 통계 분석¶

feature 분포 시각화¶

target과 feature들의 관계 확인¶

상관계수 확인하기¶

범주형 데이터 숫자화¶

heatmap을 활용하여 시각화¶

추가+)Depature Delay in Minutes & Arrival Delay in Minuits 에 대하여¶

이상치 확인¶

원본 데이터 라벨링¶

모델 학습¶

평가 척도¶

K-Fold¶

모델 튜닝¶

RandomForest¶

모델 앙상블¶

XGBClassifier¶

LGBMClassifier¶

CatBoostClassifier¶

제출¶

제출 결과¶

'AI Competition' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30