Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some suggestion about render accelerate&small problem fix #141

Merged
merged 10 commits into from
Apr 28, 2022

Conversation

TITC
Copy link
Collaborator

@TITC TITC commented Apr 27, 2022

Thank you for your help these days, and I want to make it better via some little suggestions.

1. Motivation

remember you told me in this replay

If there is only one equation in each batch that is faulty for one reason or another no equations can be rendered.

then I read the logic of render.py, hope I not misunderstood, this script try to render a batch equation at once, if only one equation in that batch is failure, then add that batch equations index to faulty arry, and continue to next batch.

After go through all batches, let me call this one epoch. If the current epoch is remained equation not equal to last epoch, Then shuffle all data and enter the next epoch and shrink the batch size if qualified some condition.

so I think if we can just add those failure equations to the faulty array and keep others successfully rendered equations in that batch, then it can save some time. (in fact, it saved much time for me , because the epoch num is also reduced)

2. Reference data

I found a manner at How can I ignore latex error while compiling? without -halt-on-error. I write some regular patterns to extract the error rendered equation and validate rendered page number with the math number.

3. Example

Here is an tex file with 100 equation.

3.1 before modify

(latex_ocr) yhtao@LG-gram:/mnt/d/tmp/texfile/normal$ xelatex  -halt-on-error  eq_58jtai6.tex 
This is XeTeX, Version 3.141592653-2.6-0.999994 (TeX Live 2022) (preloaded format=xelatex)
 restricted \write18 enabled.
entering extended mode
(./eq_58jtai6.tex
LaTeX2e <2021-11-15> patch level 1
L3 programming layer <2022-04-10>
(/usr/local/texlive/2022/texmf-dist/tex/latex/standalone/standalone.cls
Document Class: standalone 2018/03/26 v1.3a Class to compile TeX sub-files stan
dalone
(/usr/local/texlive/2022/texmf-dist/tex/latex/tools/shellesc.sty)
.......
(./eq_58jtai6.aux)
Preview: Fontsize 10pt
Preview: PDFoutput 1
Preview: Tightpage -32891 -32891 32891 32891
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
! Undefined control sequence.
l.18 $$ \mathbb { F } = \Complex
                                 $$
Output written on eq_58jtai6.pdf (10 pages).

only 10 pages were rendered. not only page num less than equation num, but also bring some problem with index matching and subsequent Image.open operation, and so on.
image

3.2 after modification

(/usr/local/texlive/2022/texmf-dist/tex/generic/luatex85/luatex85.sty)
...
(/usr/local/texlive/2022/texmf-dist/tex/latex/preview/prtightpage.def))
(./eq_58jtai6.aux)
Preview: Fontsize 10pt
Preview: PDFoutput 1
Preview: Tightpage -32891 -32891 32891 32891
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
./eq_58jtai6.tex:18: Undefined control sequence.
l.18 $$ \mathbb { F } = \Complex
                                 $$
[11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]
[26] [27]
./eq_58jtai6.tex:35: Undefined control sequence.
l.35 $$ H _ { i } ( X , \Z
                           ) = \{ 0 \} $$
...
./eq_58jtai6.tex:52: Undefined control sequence.
l.52 $$ a ( x ) = \sum _ { n = 0 } ^ { \infin
                                              } x ^ { a _ { n } } $$
[45] [46] [47] [48] [49] [50] [51]
./eq_58jtai6.tex:59: Undefined control sequence.
l.59 ... \psi _ { p } \colon \mathcal { X } \to \R
                                                   \mid 1 \le p \le M \rbrac...

[52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66]
[67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81]
[82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96]
[97] [98] [99] [100] (./eq_58jtai6.aux) )
(see the transcript file for additional information)
Output written on eq_58jtai6.pdf (100 pages).
Transcript written on eq_58jtai6.log.

above ./eq_58jtai6.tex:59 is represent 59 line is failed to compile. and Output written on eq_58jtai6.pdf (100 pages) 100 can used for validation.
image

4. some other modify

4.1 font problem

I am not sure it's an environmental problem or a mistake, these fonts you list here can't be rendered except Latin Modern Math', 'GFSNeohellenicMath.otf.
illustrate example

\documentclass[varwidth]{standalone}
\usepackage{fontspec,unicode-math}
\usepackage[active,tightpage,displaymath,textmath]{preview}
\setmathfont{here is the font}
\begin{document}
\thispagestyle{empty}
$$ 0 . 3 < = k _ { \mathit { r w , S o r } } < = 0 . 5 $$
\end{document}

4.1.1 Asana Math

! Package fontspec Error: The font "Asana Math" cannot be found.

For immediate help type H <return>.
 ...                                              
                                                  
l.6 \begin
          {document}
No pages of output.
Transcript written on eq4a4vl_in.log.

but I do have Asana-Math.otf, I think they are some thing and Asana-Math.otf is work.

(latex_ocr) yhtao@LG-gram:/mnt/d/tmp/texfile/normal$ find /usr/local/texlive/ -name Asana*.otf
/usr/local/texlive/2022/texmf-dist/fonts/opentype/public/asana-math/Asana-Math.otf

4.1.2 XITS Math

! Package fontspec Error: The font "XITS Math" cannot be found.

and

(latex_ocr) yhtao@LG-gram:/mnt/d/tmp/texfile/normal$ find /usr/local/texlive/ -name XITS*.otf
/usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITS-Bold.otf
/usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITSMath-Regular.otf
/usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITS-Regular.otf
/usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITS-BoldItalic.otf
/usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITS-Italic.otf
/usr/local/texlive/2022/texmf-dist/fonts/opentype/public/xits/XITSMath-Bold.otf

so I write a function to automatically find fonts and add to the font list if the user not specified.

4.2 whole white pixel images

image_1651042290233_0
this will cause an error but you just pass through it, maybe it's fine. but a small judgement will save time on these process.

                        coords = cv2.findNonZero(gray)  # Find all non-zero points (text)
                        a, b, w, h = cv2.boundingRect(coords)  # Find minimum spanning bounding box
                        rect = data[b:b+h, a:a+w]
                        im = Image.fromarray((255-rect[..., -1]).astype(np.uint8)).convert('L')
                        dims = []
                        for x in [w, h]:
                            div, mod = divmod(x, args.divable)
                            dims.append(args.divable*(div + (1 if mod > 0 else 0)))
                        padded = Image.new('L', dims, 255)
                        padded.paste(im, im.getbbox())
                        padded.save(outpath)

4.3 index may not match

Here is a process to remove the blank line of formula.

math = [math_mode+x+math_mode for x in batch if x != '']

but if the math has a blank line, then the number of math is not equal to the number of batch size, furthermore, the output image filename is mismatch with the line index of equation.

            for j, k in enumerate(range(i, i+len(pngs))):
                outpath = os.path.join(args.out, '%07d.png' % names[order[k]])

5. data suggestion

If you know the Wikipedia dump file, just pretend I didn't say it. If you don't know, I think it maybe helps you. just download enwiki-latest-pages-articles.xml.bz2 and unzip, with some simple regular expression, and modification and will get 900 thousand equations.


Thanks again for your time to read this PR. I have tested these modifications at a 380 thousand equation and work fine.

@lukas-blecher
Copy link
Owner

Wow! Thank you for your continuing contributions and support.
I'll have a closer look at this tomorrow. That's definitely a game changer.

Also 5. I did not know about this dump yet. Thanks!
With the recent updates to the demacro functionality (not done with that yet) I was able to get 1.3M raw equations (quite a few were still faulty) from an arxiv dump. I can't find the source right now. It was all of hep-th 2000 to 2002
Combined that would be a mighty dataset.

@TITC
Copy link
Collaborator Author

TITC commented Apr 28, 2022

1.3M raw equations (quite a few were still faulty) from an arxiv dump

It is an honor for me to contribute to such an excellent warehouse, thanks for you create it and your accommodating attitude.

when it comes to the arxiv dump, you mean this kaggle PDF dataset? I know this dataset, but I didn't know it can be de-compile LaTex code.

@lukas-blecher
Copy link
Owner

Found it: https://www.cs.cornell.edu/projects/kddcup/datasets.html

Copy link
Owner

@lukas-blecher lukas-blecher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I've added some comments. Please check if you approve

pix2tex/dataset/latex2png.py Outdated Show resolved Hide resolved
pix2tex/dataset/render.py Outdated Show resolved Hide resolved
pix2tex/dataset/render.py Outdated Show resolved Hide resolved
pix2tex/dataset/latex2png.py Outdated Show resolved Hide resolved
pix2tex/dataset/render.py Outdated Show resolved Hide resolved
pix2tex/dataset/latex2png.py Show resolved Hide resolved
TITC and others added 8 commits April 28, 2022 20:56
```python
    if type is not None:
        type2expression = {"en": r"[a-zA-Z]+", "zh": r"[\u4e00-\u9fa5]+", "num": r"\d+",
                           "punctuation": u"[\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b]"}
        expression = type2expression[type]
```
Add windows support:
- use shell in get_installed_fonts
- resolve backslash and `convert` to `magick convert`
@lukas-blecher lukas-blecher merged commit 1c2ef16 into lukas-blecher:main Apr 28, 2022
@TITC TITC deleted the render branch May 6, 2022 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants