-
Notifications
You must be signed in to change notification settings - Fork 534
How to Extract Images or Fonts from a PDF
Please note that comments below the line are obsolete since v1.16.8: PyMuPDF is now a module which can be used in the command line. The script extract-imga.py is therefore no longer required.
Among other things you can extract images like this:
python -m fitz extract input.pdf -images
Like before, this works very fast: runtime mostly depends on the amount of data saved. It should take only 1.5 to 2 seconds to scan the 1.310 pages of the Adobe manual and extract all of its 180 images.
It works for fonts in the exact same way. Fonts and images can be exracted simultaneously. An existing folder to receive the results can also be chosen, and you can restrict extractions to desired pages.
You can extract and save all images from a PDF as PNG files on a page-by-page basis with this little script. If an image has a CMYK colorspace, it will be converted to RGB first.
doc = fitz.open("file.pdf")
for i in range(len(doc)):
for img in doc.getPageImageList(i):
xref = img[0] # check if this xref was handled already?
pix = fitz.Pixmap(doc, xref)
if pix.n < 5: # this is GRAY or RGB
pix.writePNG("p%s-%s.png" % (i, xref))
else: # CMYK needs to be converted to RGB first
pix1 = fitz.Pixmap(fitz.csRGB, pix) # make RGB pixmap copy
pix1.writePNG("p%s-%s.png" % (i, xref))
pix1 = None # release storage early (optional)
pix = None # release storage early (optional)
This runs very fast: it takes less than 2 seconds to extract the 180 images of Adobe's manual on a 4.0 GHz desktop PC. This is a PDF with 1'310 pages, 30+ MB size and 330,000+ PDF objects.
Find a more advanced version of the script here. Major differences include support for masked images and respecting the original image format (i.e. "JPEG", "TIFF", etc. as opposed to converting everything to PNG). It also tries to decide about the "worthiness" of the extraction and excludes images which are too small or just decorations, etc.
- The script relies on the PDF's structural health. It will e.g. not work, if the document's page tree is damaged. There are alternatives for problem PDFs - see below.
- If images are referenced by multiple pages, they will of course be extracted more than once. Use the image's xref number (first entry of the items in
getPageImageList
) to check this.
There is another image extractor, which scans all PDF objects (ignoring pages). It will recover from many PDF structure problems.
You may want to read this recipes chapter of the documentation to find out more about image handling in PyMuPDF.
HOWTO Button annots with JavaScript
HOWTO work with PDF embedded files
HOWTO extract text from inside rectangles
HOWTO extract text in natural reading order
HOWTO create or extract graphics
HOWTO create your own PDF Drawing
Rectangle inclusion & intersection
Metadata & bookmark maintenance