-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A question regarding the translation of long texts. #28
Comments
Hi @weizjajj , yes that is accurate. We found in our testing that including all the text allows for better reflection/translations since the model has context of the entire text, which would be lost if it is processed piecemeal. Even if this limit is under the 4k allowed output from current LLM vendors, the reflections--and thus, translations--were better when processing at a smaller scale (~1k tokens). If you have any further questions, let me know. If not, we'll close this issue and thank you for trying out the translation-agent! |
Wow
…On Fri, Jun 28, 2024 at 5:54 PM Joaquin Dominguez ***@***.***> wrote:
Hi @weizjajj <https://github.com/weizjajj> , yes that is accurate. We
found in our testing that including all the text allows for better
reflection/translations since the model has context of the entire text,
which would be lost if it is processed piecemeal. Even if this limit is
under the 4k allowed output from current LLM vendors, the reflections--and
thus, translations--were better when processing at a smaller scale (~1k
tokens). If you have any further questions, let me know. If not, we'll
close this issue and thank you for trying out the translation-agent!
—
Reply to this email directly, view it on GitHub
<#28 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AY7DCNDGAEEY3PX2IGCQ3A3ZJXSRTAVCNFSM6AAAAABJ47BMRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXG42DQNZUGU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thank you for your response. If I understand correctly, the reason for segmenting the text is to have the model translate only about 1,000 words at a time, and in such cases, outputs of roughly 1,000 or slightly fewer tokens result in more accurate translations than those exceeding this amount, correct? However, doesn't this scenario pose an issue with handling input texts that exceed the model's max_context_length? In my testing process, when attempting to translate the first chapter of a novel from English to Chinese, I encountered an error indicating the input exceeded the model's maximum token limit. |
That would be a case that is not accounted for if it exceeds the model's max_context_length, you're right. However, to the extent that the entire text contributes to improving the translation of a section,(although we didn't test it) I would surmise that it would not have much of a benefit if your text is >8k tokens. So in that case, I don't suppose that the quality would be affected if you break up the text into sections that do fit in the context_length. That is, if your text is greater than max_context_length, it would not affect the quality of the translation to break it up in the same way that it would be if you don't include the full text for the translation of a section. |
@weizjajj closing this for now and we welcome follow up questions and PRs in case you see an opportunity for improvements. |
In your code, to ensure that long texts do not exceed the model's maximum token limit, you have segmented the text. However, during the actual translation process, all the segmented pieces are included in the context.
The text was updated successfully, but these errors were encountered: