A question regarding the translation of long texts. #28

weizjajj · 2024-06-26T02:50:05Z

In your code, to ensure that long texts do not exceed the model's maximum token limit, you have segmented the text. However, during the actual translation process, all the segmented pieces are included in the context.

j-dominguez9 · 2024-06-28T22:53:52Z

Hi @weizjajj , yes that is accurate. We found in our testing that including all the text allows for better reflection/translations since the model has context of the entire text, which would be lost if it is processed piecemeal. Even if this limit is under the 4k allowed output from current LLM vendors, the reflections--and thus, translations--were better when processing at a smaller scale (~1k tokens). If you have any further questions, let me know. If not, we'll close this issue and thank you for trying out the translation-agent!

siddhantx0 · 2024-06-28T22:54:43Z

Wow

…

On Fri, Jun 28, 2024 at 5:54 PM Joaquin Dominguez ***@***.***> wrote: Hi @weizjajj <https://github.com/weizjajj> , yes that is accurate. We found in our testing that including all the text allows for better reflection/translations since the model has context of the entire text, which would be lost if it is processed piecemeal. Even if this limit is under the 4k allowed output from current LLM vendors, the reflections--and thus, translations--were better when processing at a smaller scale (~1k tokens). If you have any further questions, let me know. If not, we'll close this issue and thank you for trying out the translation-agent! — Reply to this email directly, view it on GitHub <#28 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AY7DCNDGAEEY3PX2IGCQ3A3ZJXSRTAVCNFSM6AAAAABJ47BMRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXG42DQNZUGU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

weizjajj · 2024-06-28T23:43:33Z

Thank you for your response. If I understand correctly, the reason for segmenting the text is to have the model translate only about 1,000 words at a time, and in such cases, outputs of roughly 1,000 or slightly fewer tokens result in more accurate translations than those exceeding this amount, correct? However, doesn't this scenario pose an issue with handling input texts that exceed the model's max_context_length? In my testing process, when attempting to translate the first chapter of a novel from English to Chinese, I encountered an error indicating the input exceeded the model's maximum token limit.

j-dominguez9 · 2024-06-29T03:50:39Z

That would be a case that is not accounted for if it exceeds the model's max_context_length, you're right. However, to the extent that the entire text contributes to improving the translation of a section,(although we didn't test it) I would surmise that it would not have much of a benefit if your text is >8k tokens. So in that case, I don't suppose that the quality would be affected if you break up the text into sections that do fit in the context_length. That is, if your text is greater than max_context_length, it would not affect the quality of the translation to break it up in the same way that it would be if you don't include the full text for the translation of a section.

methanet · 2024-07-11T17:02:40Z

@weizjajj closing this for now and we welcome follow up questions and PRs in case you see an opportunity for improvements.

j-dominguez9 added the question Further information is requested label Jun 28, 2024

methanet closed this as completed Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question regarding the translation of long texts. #28

A question regarding the translation of long texts. #28

weizjajj commented Jun 26, 2024

j-dominguez9 commented Jun 28, 2024

siddhantx0 commented Jun 28, 2024 via email

weizjajj commented Jun 28, 2024

j-dominguez9 commented Jun 29, 2024

methanet commented Jul 11, 2024

A question regarding the translation of long texts. #28

A question regarding the translation of long texts. #28

Comments

weizjajj commented Jun 26, 2024

j-dominguez9 commented Jun 28, 2024

siddhantx0 commented Jun 28, 2024 via email

weizjajj commented Jun 28, 2024

j-dominguez9 commented Jun 29, 2024

methanet commented Jul 11, 2024