-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some suggestion about render accelerate&small problem fix #141
Conversation
Wow! Thank you for your continuing contributions and support. Also 5. I did not know about this dump yet. Thanks! |
It is an honor for me to contribute to such an excellent warehouse, thanks for you create it and your accommodating attitude. when it comes to the arxiv dump, you mean this kaggle PDF dataset? I know this dataset, but I didn't know it can be de-compile LaTex code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I've added some comments. Please check if you approve
```python if type is not None: type2expression = {"en": r"[a-zA-Z]+", "zh": r"[\u4e00-\u9fa5]+", "num": r"\d+", "punctuation": u"[\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b]"} expression = type2expression[type] ```
Add windows support: - use shell in get_installed_fonts - resolve backslash and `convert` to `magick convert`
support single page output.
Thank you for your help these days, and I want to make it better via some little suggestions.
1. Motivation
remember you told me in this replay
then I read the logic of render.py, hope I not misunderstood, this script try to render a batch equation at once, if only one equation in that batch is failure, then add that batch equations index to faulty arry, and continue to next batch.
After go through all batches, let me call this one epoch. If the current epoch is remained equation not equal to last epoch, Then shuffle all data and enter the next epoch and shrink the batch size if qualified some condition.
so I think if we can just add those failure equations to the faulty array and keep others successfully rendered equations in that batch, then it can save some time. (in fact, it saved much time for me , because the epoch num is also reduced)
2. Reference data
I found a manner at How can I ignore latex error while compiling? without
-halt-on-error
. I write some regular patterns to extract the error rendered equation and validate rendered page number with the math number.3. Example
Here is an tex file with 100 equation.
3.1 before modify
only 10 pages were rendered. not only page num less than equation num, but also bring some problem with index matching and subsequent Image.open operation, and so on.
3.2 after modification
above
./eq_58jtai6.tex:59
is represent 59 line is failed to compile. andOutput written on eq_58jtai6.pdf (100 pages)
100 can used for validation.4. some other modify
4.1 font problem
I am not sure it's an environmental problem or a mistake, these fonts you list here can't be rendered except
Latin Modern Math', 'GFSNeohellenicMath.otf
.illustrate example
4.1.1
Asana Math
but I do have
Asana-Math.otf
, I think they are some thing andAsana-Math.otf
is work.(latex_ocr) yhtao@LG-gram:/mnt/d/tmp/texfile/normal$ find /usr/local/texlive/ -name Asana*.otf /usr/local/texlive/2022/texmf-dist/fonts/opentype/public/asana-math/Asana-Math.otf
4.1.2
XITS Math
and
(latex_ocr) yhtao@LG-gram:/mnt/d/tmp/texfile/normal$ find /usr/local/texlive/ -name XITS*.otf /usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITS-Bold.otf /usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITSMath-Regular.otf /usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITS-Regular.otf /usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITS-BoldItalic.otf /usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITS-Italic.otf /usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITSMath-Bold.otf
so I write a function to automatically find fonts and add to the font list if the user not specified.
4.2 whole white pixel images
this will cause an error but you just pass through it, maybe it's fine. but a small judgement will save time on these process.
4.3 index may not match
Here is a process to remove the blank line of formula.
but if the math has a blank line, then the number of math is not equal to the number of batch size, furthermore, the output image filename is mismatch with the line index of equation.
5. data suggestion
If you know the Wikipedia dump file, just pretend I didn't say it. If you don't know, I think it maybe helps you. just download
enwiki-latest-pages-articles.xml.bz2
and unzip, with some simple regular expression, and modification and will get 900 thousand equations.Thanks again for your time to read this PR. I have tested these modifications at a 380 thousand equation and work fine.