可解释分类#
在本 Jupyter Notebook 中,我们将拟合分类可解释增强机器 (EBM)、逻辑回归 (LogisticRegression) 和分类树 (ClassificationTree) 模型。拟合完成后,我们将利用它们 Glassbox 的特性来理解其全局和局部解释。
本 Notebook 位于 GitHub 上的我们的示例文件夹中。
# install interpret if not already installed
try:
import interpret
except ModuleNotFoundError:
!pip install --quiet interpret pandas scikit-learn
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from interpret import show
from interpret.perf import ROC
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
header=None)
df.columns = [
"Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
"MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
"CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]
X = df.iloc[:, :-1]
y = (df.iloc[:, -1] == " >50K").astype(int)
seed = 42
np.random.seed(seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)
探索数据集
from interpret.data import ClassHistogram
hist = ClassHistogram().explain_data(X_train, y_train, name='Train Data')
show(hist)
训练可解释增强机器 (EBM)
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)
ExplainableBoostingClassifier()在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任该 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
ExplainableBoostingClassifier()
EBM 是 Glassbox 模型,因此我们可以编辑它们
# post-process monotonize the Age feature
ebm.monotonize("Age", increasing=True)
ExplainableBoostingClassifier()在 Jupyter 环境中,请重新运行此单元格以显示 HTML 表示或信任该 Notebook。
在 GitHub 上,HTML 表示无法渲染,请尝试使用 nbviewer.org 加载此页面。
ExplainableBoostingClassifier()
全局解释:模型整体学到了什么
ebm_global = ebm.explain_global(name='EBM')
show(ebm_global)
局部解释:个体预测是如何做出的
ebm_local = ebm.explain_local(X_test[:5], y_test[:5], name='EBM')
show(ebm_local, 0)
评估 EBM 性能
ebm_perf = ROC(ebm).explain_perf(X_test, y_test, name='EBM')
show(ebm_perf)
让我们测试一些其他可解释模型
from interpret.glassbox import LogisticRegression, ClassificationTree
# We have to transform categorical variables to use Logistic Regression and Decision Tree
X = pd.get_dummies(X, prefix_sep='.').astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)
lr = LogisticRegression(random_state=seed, penalty='l1', solver='liblinear')
lr.fit(X_train, y_train)
tree = ClassificationTree()
tree.fit(X_train, y_train)
<interpret.glassbox._decisiontree.ClassificationTree at 0x7f0749bf3940>
使用 Dashboard 比较性能
lr_perf = ROC(lr).explain_perf(X_test, y_test, name='Logistic Regression')
show(lr_perf)
tree_perf = ROC(tree).explain_perf(X_test, y_test, name='Classification Tree')
show(tree_perf)
Glassbox:我们所有的模型都具有全局和局部解释
lr_global = lr.explain_global(name='Logistic Regression')
show(lr_global)
tree_global = tree.explain_global(name='Classification Tree')
show(tree_global)