[Bug]: test_adapter 兼容性 #441

FailedNamed · 2024-09-29T09:57:33Z

Before Reporting 报告之前

I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

ubuntu

Installation Method 安装方式

source

Data-Juicer Version Data-Juicer版本

v0.2.0

Python Version Python版本

3.9.19

Describe the bug 描述这个bug

执行 python -m tests.core.test_adapter 报错

To Reproduce 如何复现

在项目根目录执行 python -m tests.core.test_adapter
出现报错，经过定位应该是在Filter.run的这段代码
dataset = dataset.map(add_same_content_to_new_column, fn_kwargs={ 'new_column_name': Fields.stats, 'initial_value': {} }, num_proc=self.runtime_np(), batch_size=self.batch_size, desc='Adding new column for stats')
中的initial_value有问题，是个空字典，这个在PerplexityFilter算子中compute_stats中的 for idx, stat in enumerate(samples_stats) 进不去循环，没执行计算，后面报错KeyError: 'perplexity'，参考其他test例子把 'initial_value': {}替换成了 'initial_value': [{}] * dataset.num_rows（ps：不知道要不要乘以这个rows），后执行，PerplexityFilter算子不再报错
继续执行，PerplexityFilter算子不再报错，但是DocumentDeduplicator算子报错，信息大概为 File "/root/data-juicer/data-juicer/data_juicer/ops/deduplicator/document_deduplicator.py", line 63, in _get_hash
return hashlib.md5(txt.strip().encode('utf-8')).hexdigest()
AttributeError: 'list' object has no attribute 'strip'，看了下代码，是因为前置的FixUnicodeMapper算子处理完数据后，
samples[self.text_key] = list( map( lambda text: ftfy.fix_text(text, normalization=self.normalization), samples[self.text_key]))
samples[self.text_key]是一个数组，导致DocumentDeduplicator算子执行_get_hash处理时报错
看了下其他mapper算子，貌似输出的samples[self.text_key]有许多格式，数组，字典，字符串都有，但是strip应该只支持字符串，是不是这些算子之间的兼容性处理的不够好，其他算子是否也有类似问题
麻烦有空帮忙解答下，感谢！

Configs 配置信息

No response

Logs 报错日志

Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 17, in resource_monitor
if mdict['stop']:
File "", line 2, in getitem
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

ERROR: test_execute_and_probe (main.AdapterTest)

Traceback (most recent call last):
File "/root/data-juicer/data-juicer/tests/core/test_adapter.py", line 126, in test_execute_and_probe
resource_util_list = Adapter.execute_and_probe(ds, ops)
File "/root/data-juicer/data-juicer/data_juicer/core/adapter.py", line 42, in execute_and_probe
dataset, resource_util_per_op = Monitor.monitor_func(
File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 201, in monitor_func
ret = func()
File "/root/data-juicer/data-juicer/data_juicer/ops/base_op.py", line 318, in run
new_dataset = dataset.filter(self.process,
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/fingerprint.py", line 482, in wrapper
out = func(dataset, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3709, in filter
indices = self.map(
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3156, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3547, in _map_single
batch = apply_function_on_filtered_inputs(
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 6477, in get_indices_from_mask_function
mask.append(function(example, *additional_args, **fn_kwargs))
File "/root/data-juicer/data-juicer/data_juicer/core/data.py", line 72, in wrapped_f
return f(*args, **kargs)
File "/root/data-juicer/data-juicer/data_juicer/ops/filter/perplexity_filter.py", line 87, in process
return samples[Fields.stats][StatsKeys.perplexity] <= self.max_ppl
KeyError: 'perplexity'

Screenshots 截图

No response

Additional 额外信息

No response

The text was updated successfully, but these errors were encountered:

drcege · 2024-10-14T06:02:48Z

@HYLcool

HYLcool · 2024-10-17T08:37:44Z

嗨 @FailedNamed ，感谢你的使用与反馈！

我们这边未能复现你遇到的问题，请你拉取最新版本代码再进行尝试，如还是遇到类似问题，欢迎与我们继续讨论~

FailedNamed added the bug Something isn't working label Sep 29, 2024

github-project-automation bot added this to data-juicer Sep 29, 2024

github-project-automation bot moved this to Todo in data-juicer Sep 29, 2024

drcege assigned HYLcool and yxdyc Oct 14, 2024

drcege changed the title ~~[Bug]:~~ [Bug]: test_adapter 兼容性 Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: test_adapter 兼容性 #441

[Bug]: test_adapter 兼容性 #441

FailedNamed commented Sep 29, 2024 •

edited

Loading

drcege commented Oct 14, 2024

HYLcool commented Oct 17, 2024

[Bug]: test_adapter 兼容性 #441

[Bug]: test_adapter 兼容性 #441

Comments

FailedNamed commented Sep 29, 2024 • edited Loading

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Logs 报错日志

ERROR: test_execute_and_probe (main.AdapterTest)

Screenshots 截图

Additional 额外信息

drcege commented Oct 14, 2024

HYLcool commented Oct 17, 2024

FailedNamed commented Sep 29, 2024 •

edited

Loading