-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
为什么我的数据会漂移? #7798
为什么我的数据会漂移? #7798
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Machine learning (ML) models deployed in production are usually paired with systems to monitor possible dataset drift. MLOps systems are designed to trigger alerts when drift is detected, but in order to make decisions about the strategy to follow next, we also need to understand what is actually changing in our data and what kind of abnormality the model is facing. | ||
应用在生产中的机器学习(ML)模型经常配备了检测数据漂移的系统。人们设计了 MLOps 系统用于在检测到漂移时发出警报,但是为了决定接下来要使用的策略,我们也需要知道数据中什么地方改变了,以及模型在面对什么样的异常。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应用在生产中的机器学习(ML)模型经常配备了检测数据漂移的系统 =>应用于实际项目的机器学习(ML)模型通常都配置了检测数据漂移的系统。
人们设计了 MLOps 系统用于在检测到漂移时发出警报 => MLOps 系统就是其中之一,它可以在检测到漂移时发出警报
但是为了决定接下来要使用的策略,我们也需要知道数据中什么地方改变了,以及模型在面对什么样的异常。 =>
但是我们还需要知道数据中哪些部分改变了,以及模型发生了什么样的异常,以此来决定后续策略。
|
||
Aberrations can appear in incoming data for many reasons: noisy data collection, poorly performing sensors, data poisoning attacks, and more. These examples of data corruptions are a type of covariate shift that can be efficiently captured by drift detectors analyzing the feature distributions. For a refresher on dataset shift, have a look at [this blog post](https://medium.com/data-from-the-trenches/a-primer-on-data-drift-18789ef252a6) [1]. | ||
很多因素会导致得到的数据出现异常:有噪声的数据采集、性能较差的传感器、以及数据中毒攻击等等。这些数据损坏的例子是协方差漂移的一种,这种漂移可以被分析特征分布的漂移检测器捕获。如果想要复习数据集的漂移,可以参考[这篇文章](https://medium.com/data-from-the-trenches/a-primer-on-data-drift-18789ef252a6) [1]。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
很多因素会导致得到的数据出现异常 => 导致得到的数据出现异常的因素有很多
这种漂移可以被分析特征分布的漂移检测器捕获 => 用于分析特征分布的漂移检测器可以有效捕获这种漂移
如果想要复习数据集的漂移 => 欲复习数据集漂移的相关内容
|
||
This post describes how to leverage a domain-discriminative classifier to identify the most atypical features and samples and shows how to use SHapley Additive exPlanations (SHAP) to boost the analysis of the data corruption. | ||
这篇文章描述了如何应用域判别分类器来识别最不正常的特征和样本,并且演示了如何使用 SHAP 来加速数据损坏的分析。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这篇文章描述了如何应用域判别分类器来识别最不正常的特征和样本=> 这篇文章介绍了如何应用域判别分类器来识别极端异常的特征和样本
加速数据损坏的分析 => 进行数据损坏情况的分析
|
||
We train a predictor for this binary task on a random split of the dataset, constituting our source training set. We are happy with the trained model and deploy it in production together with a drift monitoring system. | ||
我们在这个数据集上选了一个随机片段作为我们的训练集,并在这个训练集上为这个二分类任务训练了一个预测器。我们对这个训练的模型很满意,并将它和一个漂移检测系统一同应用在生产中。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
并将它和一个漂移检测系统一同应用在生产中 => 并将它和一个漂移检测系统同时部署在应用程序中。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
个人感觉这里应用程序说的有些窄了,直接说“应用”会不会好些?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
也可以。不过我觉得,开发阶段说成”应用程序“更妥当,上线后两者都可以。
|
||
The remaining part of the adult dataset represents the dataset provided at production time. Unfortunately, a part of this target-domain dataset is corrupted. | ||
这个成年人数据集的剩余部分代表在实际生产中提供的数据集。不幸的是,这一目标域数据集的一部分损坏了。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个成年人数据集的剩余部分代表在实际生产中提供的数据集 => 这个成年人数据集的剩余部分是真实有效的数据。
|
||
In this specific situation, we would probably recognize easily (and find suspicious) that all retrieved samples have constant values for some features, but this might not be the case in general. However, in some drift scenarios where the shift occurs at the distribution level, such as selection bias, looking at individual samples is not very useful. They would just be regular samples from a subpopulation of the source dataset, thus technically not an aberration. However, as we cannot know beforehand what kind of shift we are facing, it’s still a good idea to have a look at the individual samples ! | ||
在这个特别的情形下,我们可能轻易就能识别到(并且发现可疑之处),所有获取到的样本在某些特定特征上都是常量,但这可能并不是普遍规律。然而,在一些漂移出现在分布层级上的情形下,例如选择性偏差,观察个别样本就不是那么有用了。他们可能只是在源数据集的一个子集内的常规样本,因此技术上讲不能算作异常。但是,毕竟我们无法事先知道我们在面对什么样的漂移,所以看一看个别样本依然是很好的思路! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
所有获取到的样本在某些特定特征上都是常量 => 所有获取到的样本的某些特征值都是常量
在一些漂移出现在分布层级上的情形下 => 如果漂移出现于分布层级
所以看一看个别样本依然是很好的思路 => 观察一下个别样本依然是个好办法
|
||
A SHAP decision plot displaying the top-100 most atypical samples, like the one in Figure 6, where each curve represents one atypical sample, can help us see what is drifting. We also see it going towards higher domain classifier drift probabilities. | ||
图 6 所示的是 SHAP 决定曲线图,其中每条曲线代表一个异常样本。这种图表可以帮助我们发现什么在漂移。我们也可以发现曲线在朝向更高的域分类器漂移评分变化。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这种图表可以帮助我们发现什么在漂移 => 这种图表可以帮助我们发现漂移的情况
|
||
![Figure 6. SHAP decision plot. Each curve represents one of the top-100 most atypical samples. The top features are the most contributing to make the sample atypical and ‘pushing’ the domain classifier probability for New domain towards higher values.](https://cdn-images-1.medium.com/max/2350/0*TMcQpeyCp2sajxxu) | ||
![图 6. SHAP 决定曲线图。每条曲线代表 100 个异常最显著的样本之一。最上方的特征是对特征的异常贡献最大的,并且 “推动” 域分类器给出的属于新域的概率向更大的数值靠拢](https://cdn-images-1.medium.com/max/2350/0*TMcQpeyCp2sajxxu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
并且 “推动” 域分类器给出的属于新域的概率向更大的数值靠拢 => 并且大大增加了域分类器判断样本属于新域的概率
|
||
Of course nothing can replace a standard analysis of feature distributions, especially now that we can select the most suspicious features to focus on. In Figure 8, we can look at the distribution of the drifted features for the top-100 atypical samples in red, and compare them with the baseline of samples from the source domain training set. As discriminative analysis is more intuitive for humans, this is a simple way to highlight what kind of drift is going on in the new dataset. In this example, looking at the feature distributions we can immediately spot that feature values are constant and don’t respect the expected distribution. | ||
当然,没有什么能取代对特征分布的标准分析,尤其是在我们可以选择重点关注那些最可疑的特征的时候。在图 8 中,我们将 100 个异常最显著样本的漂移特征的分布用红色标出,并将它们和源训练集的分布进行比较。判别分析更符合人类的直觉,所以这是一种判断新数据集漂移种类的简单手段。在本例中,通过观察特征分布,我们可以马上察觉到特征取值是常量,这并不符合期望的分布。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有什么能取代对特征分布的标准分析 => 对特征分布的标准分析还是必不可少的
我们可以马上察觉到特征取值是常量 => 我们可以马上发现特征取值是常量
|
||
When monitoring deployed models for unexpected data changes, we can take advantage of drift detectors, such as the domain classifier, to also identify atypical samples in case of drift alert. We can streamline the analysis of a drift scenario by highlighting the most drifted features to investigate. This selection can be done thanks to feature importance measures of the domain classifier. | ||
当我们将模型应用于意料之外的数据变动,并想监控模型时,我们可以使用域分类器等漂移检测器,在发现漂移时识别异常样本。标出漂移最严重的样本并深入调查,这一系列步骤可以组织成为漂移分析的流水线。而异常能被标记则是多亏了域分类器的重要性衡量准则。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
而异常能被标记则是多亏了域分类器的重要性衡量准则 => 而异常能被标记应该归功于域分类器的重要性衡量准则
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SamYu2000 感谢校对,校对很用心,而且让我发现我很多地方翻译的太直接了:/
校对内容我看过了一遍,有几个地方个人感觉有更好的翻译方式,就写在 review 里面了,您有空看一下~
|
||
We train a predictor for this binary task on a random split of the dataset, constituting our source training set. We are happy with the trained model and deploy it in production together with a drift monitoring system. | ||
我们在这个数据集上选了一个随机片段作为我们的训练集,并在这个训练集上为这个二分类任务训练了一个预测器。我们对这个训练的模型很满意,并将它和一个漂移检测系统一同应用在生产中。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
个人感觉这里应用程序说的有些窄了,直接说“应用”会不会好些?
|
||
![Figure 2. Comparison of importance ranks attributed to features: the lower the rank, the more drifted the feature is considered to be. The SHAP rank is based on average absolute Shapley values per feature in the whole test set. The domain classifier rank is given by the Mean Decrease of Impurity due to a feature.](https://cdn-images-1.medium.com/max/2110/0*odM1VlEqPkGFGFMv) | ||
![图 2. 特征重要性排名的比较:排名数值越低,对应的特征就被认为有更严重额漂移。SHAP 排名是基于每一个特征在全部测试集内的平均绝对沙普利值计算的。域分类器排名则是根据特征的平均不纯度降低给出的。](https://cdn-images-1.medium.com/max/2110/0*odM1VlEqPkGFGFMv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看了下面的修改,好像“等级”比“排名”在这里更合适。所以都统一成“等级”吧?
|
||
Instead of arbitrarily selecting the top-3 drifted features, one way of identifying drifted features is to compare the feature importance with a uniform importance (1/n_features) corresponding to undistinguishable domains. Then, we would spot the features that stand out, like in Figure 3 below, where **race**, **marital_status** and **fnlwgt** clearly show up. | ||
实际中并不是随意地选择 3 个漂移最严重的特征。作为替代,一种方法是将特征的重要性值和在未识别的域中均匀分布的特征重要性值(特征总数的倒数)做对比。之后,我们就可以识别出那些突出的特征。正如下面图 3 所示的那样,**race**,**marital_status**,和 **fnlwgt** 就凸显出来了。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的修改我是认同的,不过有个问题呀,感觉这里是段首,感觉转折词要单独拿出来比较好。我的意思是:
『但并不是随意地选择 3 个漂移最严重的特征』 => 『但是,并不是随意地选择 3 个漂移最严重的特征』
这样感觉 OK 吗?
@SamYu2000 好的没问题 :) |
已根据校对意见完成修改。
@chzh9311 已经 merge 啦~ 快快麻溜发布到掘金然后给我发下链接,方便及时添加积分哟。 掘金翻译计划有自己的知乎专栏,你也可以投稿哈,推荐使用一个好用的插件。 |
@lsvih 已发布~ |
译文翻译完成,resolve #7781