You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。
OS 系统
ubuntu
Installation Method 安装方式
source
Data-Juicer Version Data-Juicer版本
v0.2.0
Python Version Python版本
3.9.19
Describe the bug 描述这个bug
执行 python -m tests.core.test_adapter 报错
To Reproduce 如何复现
在项目根目录执行 python -m tests.core.test_adapter
出现报错,经过定位应该是在Filter.run的这段代码 dataset = dataset.map(add_same_content_to_new_column, fn_kwargs={ 'new_column_name': Fields.stats, 'initial_value': {} }, num_proc=self.runtime_np(), batch_size=self.batch_size, desc='Adding new column for stats')
中的initial_value有问题,是个空字典,这个在PerplexityFilter算子中compute_stats中的 for idx, stat in enumerate(samples_stats) 进不去循环,没执行计算,后面报错KeyError: 'perplexity',参考其他test例子把 'initial_value': {}替换成了 'initial_value': [{}] * dataset.num_rows(ps:不知道要不要乘以这个rows),后执行,PerplexityFilter算子不再报错
继续执行,PerplexityFilter算子不再报错,但是DocumentDeduplicator算子报错,信息大概为 File "/root/data-juicer/data-juicer/data_juicer/ops/deduplicator/document_deduplicator.py", line 63, in _get_hash
return hashlib.md5(txt.strip().encode('utf-8')).hexdigest()
AttributeError: 'list' object has no attribute 'strip',看了下代码,是因为前置的FixUnicodeMapper算子处理完数据后, samples[self.text_key] = list( map( lambda text: ftfy.fix_text(text, normalization=self.normalization), samples[self.text_key]))
samples[self.text_key]是一个数组,导致DocumentDeduplicator算子执行_get_hash处理时报错
看了下其他mapper算子,貌似输出的samples[self.text_key]有许多格式,数组,字典,字符串都有,但是strip应该只支持字符串,是不是这些算子之间的兼容性处理的不够好,其他算子是否也有类似问题
麻烦有空帮忙解答下,感谢!
Configs 配置信息
No response
Logs 报错日志
Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 17, in resource_monitor
if mdict['stop']:
File "", line 2, in getitem
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
ERROR: test_execute_and_probe (main.AdapterTest)
Traceback (most recent call last):
File "/root/data-juicer/data-juicer/tests/core/test_adapter.py", line 126, in test_execute_and_probe
resource_util_list = Adapter.execute_and_probe(ds, ops)
File "/root/data-juicer/data-juicer/data_juicer/core/adapter.py", line 42, in execute_and_probe
dataset, resource_util_per_op = Monitor.monitor_func(
File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 201, in monitor_func
ret = func()
File "/root/data-juicer/data-juicer/data_juicer/ops/base_op.py", line 318, in run
new_dataset = dataset.filter(self.process,
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/fingerprint.py", line 482, in wrapper
out = func(dataset, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3709, in filter
indices = self.map(
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3156, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3547, in _map_single
batch = apply_function_on_filtered_inputs(
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 6477, in get_indices_from_mask_function
mask.append(function(example, *additional_args, **fn_kwargs))
File "/root/data-juicer/data-juicer/data_juicer/core/data.py", line 72, in wrapped_f
return f(*args, **kargs)
File "/root/data-juicer/data-juicer/data_juicer/ops/filter/perplexity_filter.py", line 87, in process
return samples[Fields.stats][StatsKeys.perplexity] <= self.max_ppl
KeyError: 'perplexity'
Screenshots 截图
No response
Additional 额外信息
No response
The text was updated successfully, but these errors were encountered:
Before Reporting 报告之前
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
ubuntu
Installation Method 安装方式
source
Data-Juicer Version Data-Juicer版本
v0.2.0
Python Version Python版本
3.9.19
Describe the bug 描述这个bug
执行 python -m tests.core.test_adapter 报错
To Reproduce 如何复现
dataset = dataset.map(add_same_content_to_new_column, fn_kwargs={ 'new_column_name': Fields.stats, 'initial_value': {} }, num_proc=self.runtime_np(), batch_size=self.batch_size, desc='Adding new column for stats')
中的initial_value有问题,是个空字典,这个在PerplexityFilter算子中compute_stats中的 for idx, stat in enumerate(samples_stats) 进不去循环,没执行计算,后面报错KeyError: 'perplexity',参考其他test例子把 'initial_value': {}替换成了 'initial_value': [{}] * dataset.num_rows(ps:不知道要不要乘以这个rows),后执行,PerplexityFilter算子不再报错
return hashlib.md5(txt.strip().encode('utf-8')).hexdigest()
AttributeError: 'list' object has no attribute 'strip',看了下代码,是因为前置的FixUnicodeMapper算子处理完数据后,
samples[self.text_key] = list( map( lambda text: ftfy.fix_text(text, normalization=self.normalization), samples[self.text_key]))
samples[self.text_key]是一个数组,导致DocumentDeduplicator算子执行_get_hash处理时报错
看了下其他mapper算子,貌似输出的samples[self.text_key]有许多格式,数组,字典,字符串都有,但是strip应该只支持字符串,是不是这些算子之间的兼容性处理的不够好,其他算子是否也有类似问题
Configs 配置信息
No response
Logs 报错日志
Traceback (most recent call last):
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 17, in resource_monitor
if mdict['stop']:
File "", line 2, in getitem
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
ERROR: test_execute_and_probe (main.AdapterTest)
Traceback (most recent call last):
File "/root/data-juicer/data-juicer/tests/core/test_adapter.py", line 126, in test_execute_and_probe
resource_util_list = Adapter.execute_and_probe(ds, ops)
File "/root/data-juicer/data-juicer/data_juicer/core/adapter.py", line 42, in execute_and_probe
dataset, resource_util_per_op = Monitor.monitor_func(
File "/root/data-juicer/data-juicer/data_juicer/core/monitor.py", line 201, in monitor_func
ret = func()
File "/root/data-juicer/data-juicer/data_juicer/ops/base_op.py", line 318, in run
new_dataset = dataset.filter(self.process,
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/fingerprint.py", line 482, in wrapper
out = func(dataset, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3709, in filter
indices = self.map(
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3156, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3547, in _map_single
batch = apply_function_on_filtered_inputs(
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/usr/local/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 6477, in get_indices_from_mask_function
mask.append(function(example, *additional_args, **fn_kwargs))
File "/root/data-juicer/data-juicer/data_juicer/core/data.py", line 72, in wrapped_f
return f(*args, **kargs)
File "/root/data-juicer/data-juicer/data_juicer/ops/filter/perplexity_filter.py", line 87, in process
return samples[Fields.stats][StatsKeys.perplexity] <= self.max_ppl
KeyError: 'perplexity'
Screenshots 截图
No response
Additional 额外信息
No response
The text was updated successfully, but these errors were encountered: