EBM 内部原理 - 二分类

EBM 内部原理 - 二分类#

这是描述 EBM 内部原理和如何进行预测的 3 部分系列中的第 2 部分。对于第 1 部分，请点击此处。对于第 3 部分，请点击此处。

在第 2 部分中，我们将介绍二分类、交互项、缺失值、有序特征以及交互项降维离散化。在阅读本部分之前，您应该熟悉第 1 部分

# boilerplate
from interpret import show
from interpret.glassbox import ExplainableBoostingClassifier
import numpy as np

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())

# make a dataset composed of an ordinal categorical, and a continuous feature
X = [["low", 8.0], ["medium", 7.0], ["high", 9.0], [None, None]]
y = ["apples", "apples", "oranges", "oranges"]

# Fit a classification EBM with 1 interaction
# Define an ordinal feature with specified ordering
# Limit the number of interaction bins to force a lower resolution
# Eliminate the validation set to handle the small dataset
ebm = ExplainableBoostingClassifier(
    interactions=1,
    feature_types=[["low", "medium", "high"], 'continuous'], 
    max_interaction_bins=4,
    validation_size=0, outer_bags=1, max_rounds=900, min_samples_leaf=1, min_hessian=1e-9)
ebm.fit(X, y)
show(ebm.explain_global())

/opt/hostedtoolcache/Python/3.9.21/x64/lib/python3.9/site-packages/interpret/glassbox/_ebm/_ebm.py:812: UserWarning: Missing values detected. Our visualizations do not currently display missing values. To retain the glassbox nature of the model you need to either set the missing values to an extreme value like -1000 that will be visible on the graphs, or manually examine the missing value score in ebm.term_scores_[term_index][0]
  warn(

print(ebm.classes_)

['apples' 'oranges']

与所有 scikit-learn 分类器一样，我们将类别列表存储在 ebm.classes_ 属性中作为排序数组。在此示例中，我们的类别是字符串，但我们也接受整数，这将在第 3 部分中看到

print(ebm.feature_types)

[['low', 'medium', 'high'], 'continuous']

在此示例中，我们将 feature_types 传递给了 ExplainableBoostingClassifier 的 __init__ 函数。根据 scikit-learn 约定，这在 ebm 对象中被原样记录。

print(ebm.feature_types_in_)

['ordinal', 'continuous']

传递给 __init__ 的 feature_types 被实际转换为基础特征类型 ['ordinal', 'continuous']。遵循 scikit-learn 的SLEP007 约定精神，我们在 ebm.feature_types_in_ 中记录了这一点。

print(ebm.feature_names_in_)

['feature_0000', 'feature_0001']

由于我们没有指定特征名称，模型创建了一些默认名称。如果我们将 feature_names 传递给了 ExplainableBoostingClassifier 的 __init__ 函数，或者使用了带有列名的 Pandas dataframe，那么 ebm.feature_names_in_ 将包含这些名称。

print(ebm.term_features_)

[(0,), (1,), (0, 1)]

我们的模型包含 3 个可加项。前两项是主效应特征，第三项是各个特征之间的成对交互项。EBMs 不仅限于主效应和成对效应。我们也支持 3 阶交互、4 阶交互以及更高阶的交互。如果模型中存在任何更高阶的交互项，它们将作为进一步的索引元组列在 ebm.term_features_ 中。

print(ebm.term_names_)

['feature_0000', 'feature_0001', 'feature_0000 & feature_0001']

ebm.term_names_ 是一个便利属性，它将 ebm.term_features_ 和 ebm.feature_names_in_ 连接起来，为每个可加项创建名称。

ebm.term_names_ 是以下代码的结果：

term_names = [" & ".join(ebm.feature_names_in_[i] for i in grp) for grp in ebm.term_features_]

print(ebm.bins_)

[[{'low': 1, 'medium': 2, 'high': 3}], [array([7.5, 8.5]), array([8.5])]]

ebm.bins_ 是一个按特征划分的属性。如第 1 部分所述，ebm.bins_ 定义了如何对分类特征（“名义型”和“有序型”）和“连续型”特征进行分箱。

对于分类特征，我们使用一个字典来将类别字符串映射到分箱索引。

如第 1 部分所述，连续型特征分箱由一系列将连续范围划分成区域的分割点定义。在此示例中，我们的数据集中连续型特征有 3 个唯一值：7.0、8.0 和 9.0。与第 1 部分类似，此示例中的主效应有 2 个分割点，将这些值分隔成 3 个区域。在此示例中，主效应的分割点再次是 7.5 和 8.5。

EBMs 支持在对交互项特征进行分箱时降低分箱分辨率。在调用 ExplainableBoostingClassifier 的 __init__ 函数时，我们指定了 max_interaction_bins=4，这将 EBM 在进行交互项分箱时限制为仅创建 4 个箱。其中两个箱保留给“缺失值”和“未见值”，剩下的连续型特征值只有 2 个箱。然而，我们的数据集中有 3 个唯一值，因此 EBM 被迫决定将哪些值组合在一起，并选择一个分割点将它们分隔成 2 个区域。在此示例中，EBM 本可以在 7.0 和 9.0 之间选择任何分割点。它选择了 8.5，这将 7.0 和 8.0 值放在较低的箱中，将 9.0 放在较高的箱中。

主效应和交互项的分箱定义存储在 ebm.bins_ 属性中每个特征对应的列表中。在此示例中，ebm.bins_[1] 包含一个数组列表：[array([7.5, 8.5]), array([8.5])]。ebm.bins_[1][0] 处的第一个数组 [7.5, 8.5] 是主效应的分箱分辨率。ebm.bins_[1][1] 处的第二个数组 [8.5] 是对交互项进行分箱时使用的分箱分辨率。

分箱分辨率并不仅限于成对交互项。如果需要对三阶交互项使用更低的分辨率，则列表中将包含第三个分割点数组。列表中的最后一项是用于所有高于该位置的交互阶数的分箱分辨率。如果 EBM 的列表中只包含 [7.5, 8.5] 的分箱分辨率，那么该分辨率将用于主效应、成对交互、三阶交互和更高阶交互。

print(ebm.term_scores_[0])

[ 12.56350354 -11.18647629 -11.17109277   9.79406552   0.        ]

ebm.term_scores_[0] 是此示例中第一个特征的查找表。由于第一个特征是有序分类特征，我们使用字典 {'low': 1, 'medium': 2, 'high': 3} 来查找每个类别字符串应使用的分箱。如果特征值是 NaN，则使用索引 0 处的分数。如果特征值是“low”，则使用索引 1 处的分数。如果特征值是“medium”，则使用索引 2 处的分数。如果特征值是“high”，则使用索引 3 处的分数。如果特征值是其他任何值，则使用索引 4 处的分数。

在此示例中，第 0 个箱的分数非零，因为我们在数据集中包含了此特征的缺失值。

print(ebm.term_scores_[1])

[ 12.51538503 -11.24334993 -11.21543351   9.94339841   0.        ]

ebm.term_scores_[1] 是此示例中第二个特征的查找表。由于第二个特征是连续型特征，我们使用分割点进行分箱。第 0 个箱索引再次保留给缺失值，最后一个箱索引再次保留给未见值。在此示例中，第 0 个箱的分数非零，因为我们在数据集中包含了此特征的缺失值。

ebm.bins_[1] 属性包含一个包含 2 个分割点数组的列表。在此情况下，我们正在对主效应特征进行分箱，因此我们使用索引 0 处的分箱，即 ebm.bins_[1][0]。

print(ebm.term_scores_[2])

[[ 0.5257299  -0.3892701  -0.26111878  0.        ]
 [-0.0892701  -0.8397803  -0.03662897  0.        ]
 [-0.22928273 -0.66705678  0.35110718  0.        ]
 [-0.32111878 -0.03889283  0.98110718  0.        ]
 [ 0.          0.          0.          0.        ]]

ebm.term_scores_[2] 是此示例中由两个特征组成的成对项的查找表。成对项中涉及的特征可在 ebm.term_features_[2] 中找到。成对项查找表是二维的，因此对其进行索引需要两个索引。第一个索引是第一个特征的分箱索引，第二个索引是第二个特征的分箱索引。示例：

pair_scores = ebm.term_scores_[2]

local_score = pair_scores[(feature_0_index, feature_1_index)]

示例代码

最后，这是一些将上述考虑因素整合到一个函数中的代码，该函数可以对简化场景进行预测。此代码不处理回归、多分类、未见值或超出成对交互的交互项等情况。

如果您需要一个可以在所有 EBM 场景中工作的完整函数，请参阅第 3 部分中的多分类示例，该示例除了处理多分类外，还处理回归和二分类以及所有其他细节。

sample_scores = []
for sample in X:
    # start from the intercept for each sample
    score = float(ebm.intercept_)
    
    # We have 3 terms: two main effects, and 1 pair interaction
    for term_idx, features in enumerate(ebm.term_features_):
        # indexing into a tensor requires a multi-dimensional index
        tensor_index = []

        # main effects will have 1 feature, and pairs will have 2 features
        for feature_idx in features:
            feature_val = sample[feature_idx]
            bin_idx = 0  # if missing value, use bin index 0

            if feature_val is not None and feature_val is not np.nan:
                # we bin differently for main effects and pairs,
                # so determine which resolution is needed
                if len(features) == 1 or len(ebm.bins_[feature_idx]) == 1:
                    # this is a main effect or only one bin level
                    # is available, so use the highest resolution bins
                    bins = ebm.bins_[feature_idx][0]
                elif len(features) == 2 or len(ebm.bins_[feature_idx]) == 2:
                    # use the lower resolution bins
                    bins = ebm.bins_[feature_idx][1]
                else:
                    raise Exception("Unsupported bin resolution")

                if isinstance(bins, dict):
                    # categorical feature
                    bin_idx = bins[feature_val]
                else:
                    # continuous feature
                    # add 1 because the 0th bin is reserved for 'missing'
                    bin_idx = np.digitize(feature_val, bins) + 1

            tensor_index.append(bin_idx)
        # local_score is also the local feature importance
        local_score = ebm.term_scores_[term_idx][tuple(tensor_index)]
        score += local_score
    sample_scores.append(score)

logits = np.array(sample_scores)

# use the sigmoid function to convert the logits into probabilities
probabilities = 1 / (1 + np.exp(-logits))

print("probability of " + ebm.classes_[1])
print(ebm.predict_proba(X)[:, 1])
print(probabilities)

probability of oranges
[1.93722861e-10 2.27379223e-10 1.00000000e+00 1.00000000e+00]
[1.93722861e-10 2.27379223e-10 1.00000000e+00 1.00000000e+00]

/tmp/ipykernel_19121/1383054448.py:4: DeprecationWarning:

Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)

对于回归，我们的默认链接函数是恒等链接函数，因此分数就是实际预测值。

对于分类，分数是 logits，我们需要应用逆链接函数来计算概率。对于二分类，逆链接函数是 sigmoid 函数。

与第 1 部分中的回归完全相同，‘local_score’ 变量包含局部解释显示的值。

show(ebm.explain_local(X, y), 0)