Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question regarding the translation of long texts. #28

Closed
weizjajj opened this issue Jun 26, 2024 · 5 comments
Closed

A question regarding the translation of long texts. #28

weizjajj opened this issue Jun 26, 2024 · 5 comments
Labels
question Further information is requested

Comments

@weizjajj
Copy link

In your code, to ensure that long texts do not exceed the model's maximum token limit, you have segmented the text. However, during the actual translation process, all the segmented pieces are included in the context.
problem

@j-dominguez9 j-dominguez9 added the question Further information is requested label Jun 28, 2024
@j-dominguez9
Copy link
Collaborator

Hi @weizjajj , yes that is accurate. We found in our testing that including all the text allows for better reflection/translations since the model has context of the entire text, which would be lost if it is processed piecemeal. Even if this limit is under the 4k allowed output from current LLM vendors, the reflections--and thus, translations--were better when processing at a smaller scale (~1k tokens). If you have any further questions, let me know. If not, we'll close this issue and thank you for trying out the translation-agent!

@siddhantx0
Copy link

siddhantx0 commented Jun 28, 2024 via email

@weizjajj
Copy link
Author

Thank you for your response. If I understand correctly, the reason for segmenting the text is to have the model translate only about 1,000 words at a time, and in such cases, outputs of roughly 1,000 or slightly fewer tokens result in more accurate translations than those exceeding this amount, correct? However, doesn't this scenario pose an issue with handling input texts that exceed the model's max_context_length? In my testing process, when attempting to translate the first chapter of a novel from English to Chinese, I encountered an error indicating the input exceeded the model's maximum token limit.

@j-dominguez9
Copy link
Collaborator

That would be a case that is not accounted for if it exceeds the model's max_context_length, you're right. However, to the extent that the entire text contributes to improving the translation of a section,(although we didn't test it) I would surmise that it would not have much of a benefit if your text is >8k tokens. So in that case, I don't suppose that the quality would be affected if you break up the text into sections that do fit in the context_length. That is, if your text is greater than max_context_length, it would not affect the quality of the translation to break it up in the same way that it would be if you don't include the full text for the translation of a section.

@methanet
Copy link
Collaborator

@weizjajj closing this for now and we welcome follow up questions and PRs in case you see an opportunity for improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants