Memory consumption increases continuously when generating multiple PDFs in a Loop. #2130

masahh · 2024-04-22T05:44:58Z

First of all, thank you very much for developing and maintaining this incredibly useful library!

We are encountering the following issues regarding memory consumption when converting HTML to PDF:

Memory usage steadily increases and eventually causes a memory error when repeatedly generating PDFs in a loop.
Memory consumption rises significantly when the HTML includes multibyte characters (e.g., Japanese).

We have created a minimal setup to reproduce this problem:
https://github.com/yamap55/weasyprint_memory_check
(using Python==3.9.7 and 3.12.3, WeasyPrint==61.2, memory_profiler==0.61.0)

In the above repository, when running the container and executing python main.py:

Memory consumption keeps growing with each PDF generation, as observed with memory_profiler.

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    6     56.4 MiB     56.4 MiB           1   @profile()
    7                                         def create_pdf(file_path: str):
    8    117.9 MiB     61.5 MiB           1       generate(file_path)
    9    182.9 MiB     65.0 MiB           1       generate(file_path)
   10    200.0 MiB     17.1 MiB           1       generate(file_path)
   11    236.8 MiB     36.8 MiB           1       generate(file_path)
   12    272.8 MiB     36.0 MiB           1       generate(file_path)
   13    272.8 MiB      0.0 MiB           1       return True

Is this behavior expected? Or are there any methods to reduce memory usage in this scenario?

Approximately tens of MiBs increase in memory consumption per PDF generation when the HTML includes multibyte characters. Are there any strategies to mitigate this memory usage?

The text was updated successfully, but these errors were encountered:

liZe · 2024-04-22T06:19:12Z

Hi!

Thanks for your report.

WeasyPrint can take a lot of memory, that’s a known behavior and we’re open to solutions to improve this. But memory leaks is a different problem.

Approximately tens of MiBs increase in memory consumption per PDF generation when the HTML includes multibyte characters.

Many bug reports like this have already been open, and we have to be sure that it’s a real memory leak. A few (~20) generations is not enough to detect this, because Python’s interpreter can do what it wants with memory. There’s an interesting issue about this, showing that what may appear as a memory leak is not necessarily one: #1977

So, you can try with 200+ generations and see if you’ve find a "real" memory leak.

That being said, your problem seems to be related to fonts, just as is #1977. Even if it’s not a memory leak, maybe there’s something we can do about this.

masahh · 2024-04-22T11:11:48Z

Thank you for your reply!

So, you can try with 200+ generations and see if you’ve find a "real" memory leak.

I tried generating 200 PDFs, and similar to #1977, the memory usage is stable from around the 80th iterations.
(At that point, the usage was approximately 2.8 GiB.)

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7     42.9 MiB     42.9 MiB           1   @profile()
     8                                         def create_pdf(file_path: str):
     9    116.0 MiB     73.1 MiB           1       generate(file_path)
    10    155.2 MiB     39.1 MiB           1       generate(file_path)
    11    195.4 MiB     40.2 MiB           1       generate(file_path)
    12    251.5 MiB     56.2 MiB           1       generate(file_path)
   [...]
    21    583.0 MiB     56.1 MiB           1       generate(file_path)
    22    639.2 MiB     56.2 MiB           1       generate(file_path)
    23    636.2 MiB     -3.0 MiB           1       generate(file_path)
    24    691.3 MiB     55.0 MiB           1       generate(file_path)
    25    749.5 MiB     58.2 MiB           1       generate(file_path)
    26    743.8 MiB     -5.6 MiB           1       generate(file_path)
    27    799.1 MiB     55.3 MiB           1       generate(file_path)
    28    857.0 MiB     57.8 MiB           1       generate(file_path)
    29    910.1 MiB     53.1 MiB           1       generate(file_path)
    30    889.1 MiB    -21.0 MiB           1       generate(file_path)
    31    944.3 MiB     55.2 MiB           1       generate(file_path)
    32    999.9 MiB     55.6 MiB           1       generate(file_path)
    33   1053.9 MiB     54.0 MiB           1       generate(file_path)
    34   1038.1 MiB    -15.9 MiB           1       generate(file_path)
    35   1093.4 MiB     55.3 MiB           1       generate(file_path)
    36   1146.3 MiB     53.0 MiB           1       generate(file_path)
   [...]
    72   2574.5 MiB     56.1 MiB           1       generate(file_path)
    73   2421.6 MiB   -152.8 MiB           1       generate(file_path)
    74   2491.7 MiB     70.1 MiB           1       generate(file_path)
    75   2547.2 MiB     55.6 MiB           1       generate(file_path)
    76   2603.2 MiB     55.9 MiB           1       generate(file_path)
    77   2658.5 MiB     55.3 MiB           1       generate(file_path)
    78   2714.4 MiB     55.9 MiB           1       generate(file_path)
    79   2770.6 MiB     56.3 MiB           1       generate(file_path)
    80   2825.3 MiB     54.6 MiB           1       generate(file_path)
    81   2880.6 MiB     55.3 MiB           1       generate(file_path)
    82   2936.4 MiB     55.8 MiB           1       generate(file_path)
    83   2456.2 MiB   -480.2 MiB           1       generate(file_path)
    84   2508.1 MiB     51.9 MiB           1       generate(file_path)
    85   2561.7 MiB     53.6 MiB           1       generate(file_path)
    86   2615.5 MiB     53.8 MiB           1       generate(file_path)
    87   2669.1 MiB     53.6 MiB           1       generate(file_path)
    88   2722.8 MiB     53.8 MiB           1       generate(file_path)
    89   2776.6 MiB     53.8 MiB           1       generate(file_path)
    90   2830.7 MiB     54.1 MiB           1       generate(file_path)
    91   2885.8 MiB     55.1 MiB           1       generate(file_path)
    92   2940.7 MiB     54.9 MiB           1       generate(file_path)
    93   2996.3 MiB     55.6 MiB           1       generate(file_path)
    94   2460.7 MiB   -535.6 MiB           1       generate(file_path)
    95   2511.5 MiB     50.7 MiB           1       generate(file_path)
    96   2565.1 MiB     53.6 MiB           1       generate(file_path)
   [...]
   188   2866.4 MiB     19.5 MiB           1       generate(file_path)
   189   2885.8 MiB     19.4 MiB           1       generate(file_path)
   190   2905.2 MiB     19.4 MiB           1       generate(file_path)
   191   2924.7 MiB     19.5 MiB           1       generate(file_path)
   192   2944.0 MiB     19.4 MiB           1       generate(file_path)
   193   2963.5 MiB     19.5 MiB           1       generate(file_path)
   194   2982.9 MiB     19.4 MiB           1       generate(file_path)
   195   2808.2 MiB   -174.7 MiB           1       generate(file_path)
   196   2827.6 MiB     19.4 MiB           1       generate(file_path)
   197   2846.9 MiB     19.4 MiB           1       generate(file_path)
   198   2866.4 MiB     19.5 MiB           1       generate(file_path)
   199   2885.8 MiB     19.4 MiB           1       generate(file_path)
   200   2905.2 MiB     19.4 MiB           1       generate(file_path)
   201   2924.7 MiB     19.5 MiB           1       generate(file_path)
   202   2944.1 MiB     19.4 MiB           1       generate(file_path)
   203   2963.6 MiB     19.5 MiB           1       generate(file_path)
   204   2983.1 MiB     19.5 MiB           1       generate(file_path)
   205   2808.3 MiB   -174.8 MiB           1       generate(file_path)
   206   2827.6 MiB     19.4 MiB           1       generate(file_path)
   207   2847.0 MiB     19.4 MiB           1       generate(file_path)
   208   2866.5 MiB     19.5 MiB           1       generate(file_path)
   209   2866.5 MiB      0.0 MiB           1       return True

Actually, we encountered memory errors in a container with a memory limit of 1-2GB. Therefore, it might be necessary to increase the memory limit for the container.

That being said, your problem seems to be related to fonts, just as is #1977. Even if it’s not a memory leak, maybe there’s something we can do about this.

Regarding the font-related problem, I tried the suggested method below but it resulted in an error. It seems that it doesn't work with the type of font we are using.
#1977 (comment)

Code:

class PdfWriter:
    _fonts = {}

    def write_pdf(self, html_str):
        doc = weasyprint.HTML(string=html_str).render()
        doc.fonts = self._fonts
        doc.write_pdf(None)

PdfWriter().write_pdf(f"<div>{'<div>あ</div>' * 20}</div>")

Output:

Traceback (most recent call last):
  File "/app/main.py", line 36, in <module>
    create_pdf(html_str)
  File "/usr/local/lib/python3.9/site-packages/memory_profiler.py", line 1188, in wrapper
    val = prof(func)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/memory_profiler.py", line 761, in f
    return func(*args, **kwds)
  File "/app/main.py", line 10, in create_pdf
    generate(file_path)
  File "/app/main.py", line 28, in generate
    PdfWriter().write_pdf(html_str)
  File "/app/main.py", line 21, in write_pdf
    doc.write_pdf()
  File "/usr/local/lib/python3.9/site-packages/weasyprint/document.py", line 399, in write_pdf
    pdf = generate_pdf(self, target, zoom, **options)
  File "/usr/local/lib/python3.9/site-packages/weasyprint/pdf/__init__.py", line 268, in generate_pdf
    pdf_fonts = build_fonts_dictionary(
  File "/usr/local/lib/python3.9/site-packages/weasyprint/pdf/fonts.py", line 27, in build_fonts_dictionary
    font.clean(cmap, hinting)
  File "/usr/local/lib/python3.9/site-packages/weasyprint/pdf/stream.py", line 116, in clean
    subsetter.subset(self.ttfont)
  File "/usr/local/lib/python3.9/site-packages/fontTools/subset/__init__.py", line 3499, in subset
    self._subset_glyphs(font)
  File "/usr/local/lib/python3.9/site-packages/fontTools/subset/__init__.py", line 3423, in _subset_glyphs
    retain = table.subset_glyphs(self)
  File "/usr/local/lib/python3.9/site-packages/fontTools/subset/cff.py", line 111, in subset_glyphs
    del csi.file, csi.offsets
AttributeError: file

liZe · 2024-04-22T20:08:14Z

Regarding the font-related problem, I tried the suggested method below but it resulted in an error. It seems that it doesn't work with the type of font we are using.

The snippet is just a hack that could help in specific cases. We have yet to find a reliable way to fix this problem.

liZe · 2024-04-25T13:50:39Z

OK, I’ve found where the problem comes from:

WeasyPrint/weasyprint/pdf/stream.py

Lines 329 to 340 in 3a208fe

    
           @lru_cache() 
        
           def add_font(self, pango_font): 
        
               description = pango.pango_font_describe(pango_font) 
        
               mask = ( 
        
                   pango.PANGO_FONT_MASK_SIZE + 
        
                   pango.PANGO_FONT_MASK_GRAVITY) 
        
               pango.pango_font_description_unset_fields(description, mask) 
        
               key = pango.pango_font_description_hash(description) 
        
               pango.pango_font_description_free(description) 
        
               if key not in self._fonts: 
        
                   self._fonts[key] = Font(pango_font) 
        
               return self._fonts[key]

This method is cached, meaning that the font is stored in memory once for each Font object. Of course, it’s important to have a cache to avoid to calculate the key multiple times for the same Pango font. But storing the generated font is stupid, because it way too big to be stored in memory.

Let’s just store the (Pango font + key) couple instead!

liZe · 2024-04-25T13:57:41Z

Before:

pmem(rss=49848320, vms=69750784, shared=16351232, text=4096, lib=0, data=34238464, dirty=0)
pmem(rss=142221312, vms=242700288, shared=23134208, text=4096, lib=0, data=136609792, dirty=0)
pmem(rss=208916480, vms=309448704, shared=22990848, text=4096, lib=0, data=203358208, dirty=0)
pmem(rss=282812416, vms=384045056, shared=22990848, text=4096, lib=0, data=277954560, dirty=0)
pmem(rss=344322048, vms=445583360, shared=22990848, text=4096, lib=0, data=339492864, dirty=0)
pmem(rss=407535616, vms=508080128, shared=22990848, text=4096, lib=0, data=401989632, dirty=0)
pmem(rss=478806016, vms=579670016, shared=22908928, text=4096, lib=0, data=473579520, dirty=0)
pmem(rss=538480640, vms=639197184, shared=22908928, text=4096, lib=0, data=533106688, dirty=0)
pmem(rss=598171648, vms=698810368, shared=22990848, text=4096, lib=0, data=592719872, dirty=0)
pmem(rss=662573056, vms=829816832, shared=22908928, text=4096, lib=0, data=657289216, dirty=0)
pmem(rss=734072832, vms=901210112, shared=22908928, text=4096, lib=0, data=728682496, dirty=0)
pmem(rss=794628096, vms=961871872, shared=22908928, text=4096, lib=0, data=789344256, dirty=0)
pmem(rss=858517504, vms=1025773568, shared=22908928, text=4096, lib=0, data=853245952, dirty=0)
pmem(rss=927875072, vms=1095065600, shared=22908928, text=4096, lib=0, data=922537984, dirty=0)
pmem(rss=986566656, vms=1154682880, shared=22908928, text=4096, lib=0, data=982155264, dirty=0)
pmem(rss=1058230272, vms=1225568256, shared=22908928, text=4096, lib=0, data=1053040640, dirty=0)
pmem(rss=1121529856, vms=1289117696, shared=22908928, text=4096, lib=0, data=1116590080, dirty=0)
pmem(rss=1191395328, vms=1359392768, shared=22908928, text=4096, lib=0, data=1186865152, dirty=0)
pmem(rss=1249349632, vms=1417478144, shared=22908928, text=4096, lib=0, data=1244950528, dirty=0)
pmem(rss=1311813632, vms=1479188480, shared=22908928, text=4096, lib=0, data=1306660864, dirty=0)
pmem(rss=1380007936, vms=1547632640, shared=22908928, text=4096, lib=0, data=1375105024, dirty=0)

After:

pmem(rss=49188864, vms=69419008, shared=16166912, text=4096, lib=0, data=33906688, dirty=0)
pmem(rss=102486016, vms=269824000, shared=22962176, text=4096, lib=0, data=96722944, dirty=0)
pmem(rss=106692608, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106500096, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106704896, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106528768, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106418176, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=106651648, vms=274108416, shared=22962176, text=4096, lib=0, data=101011456, dirty=0)
pmem(rss=107208704, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107048960, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107053056, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107208704, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107208704, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107094016, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107057152, vms=274108416, shared=22872064, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107024384, vms=274108416, shared=22876160, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107048960, vms=274108416, shared=22876160, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107008000, vms=274108416, shared=22876160, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107200512, vms=274108416, shared=22962176, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107200512, vms=274108416, shared=22962176, text=4096, lib=0, data=101584896, dirty=0)
pmem(rss=107204608, vms=274108416, shared=22962176, text=4096, lib=0, data=101584896, dirty=0)

Fix Kozea#2130.

iqbalhusen · 2024-05-08T17:05:37Z

OK, I’ve found where the problem comes from:

WeasyPrint/weasyprint/pdf/stream.py

Lines 329 to 340 in 3a208fe

@lru_cache()

def add_font(self, pango_font):

description = pango.pango_font_describe(pango_font)

mask = (

pango.PANGO_FONT_MASK_SIZE +

pango.PANGO_FONT_MASK_GRAVITY)

pango.pango_font_description_unset_fields(description, mask)

key = pango.pango_font_description_hash(description)

pango.pango_font_description_free(description)

if key not in self._fonts:

self._fonts[key] = Font(pango_font)

return self._fonts[key]

This method is cached, meaning that the font is stored in memory once for each Font object. Of course, it’s important to have a cache to avoid to calculate the key multiple times for the same Pango font. But storing the generated font is stupid, because it way too big to be stored in memory.

Let’s just store the (Pango font + key) couple instead!

This directed me to the right direction after struggling a whole day to solve a font caching related issue. When I was trying to generate a PDF from a static Japanese webpage in a loop, the first 2-3 PDFs were generated correctly. And after that the subsequent PDFs contained garbage texts and I was struggling to figure out what can be the reason, because it was using the same webpage. Then after I started to invalidate the cache after every PDF generation, the results were as expected. This is not mentioned in the documentation.

liZe · 2024-05-08T22:17:28Z

This is not mentioned in the documentation.

That’s because it’s a bug, and it’s fixed in the latest release. Update WeasyPrint and the problem will be gone.

See #2144 and #1977.

liZe added the performance Too slow renderings label Apr 22, 2024

liZe closed this as completed in 2e778eb Apr 25, 2024

liZe mentioned this issue Apr 25, 2024

Memory leak when writing pdf #1977

Closed

liZe added this to the 62.0 milestone Apr 25, 2024

okkays pushed a commit to okkays/WeasyPrint that referenced this issue May 1, 2024

Cache font key instead of whole font content

4693087

Fix Kozea#2130.

liZe mentioned this issue May 2, 2024

Fonts breaking in v62 #2144

Closed

liZe mentioned this issue Jun 7, 2024

Memory leak on render #1123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumption increases continuously when generating multiple PDFs in a Loop. #2130

Memory consumption increases continuously when generating multiple PDFs in a Loop. #2130

masahh commented Apr 22, 2024

liZe commented Apr 22, 2024

masahh commented Apr 22, 2024

liZe commented Apr 22, 2024

liZe commented Apr 25, 2024

liZe commented Apr 25, 2024

iqbalhusen commented May 8, 2024

liZe commented May 8, 2024

Memory consumption increases continuously when generating multiple PDFs in a Loop. #2130

Memory consumption increases continuously when generating multiple PDFs in a Loop. #2130

Comments

masahh commented Apr 22, 2024

liZe commented Apr 22, 2024

masahh commented Apr 22, 2024

liZe commented Apr 22, 2024

liZe commented Apr 25, 2024

liZe commented Apr 25, 2024

iqbalhusen commented May 8, 2024

liZe commented May 8, 2024