Skip to content

Acha0203/Extract_Superscripts_from_PDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Python Program to Extract Subscripts from a PDF file

Description

You can extract subscript texts from a PDF file by using this Python program.

Usage

  1. In your project root directory, create a Python virtual environment ('pymupdf-venv' in the following commands is your virtual environment name).
  • Windows:
py -m venv pymupdf-venv
.\pymupdf-venv\Scripts\activate
python -m pip install --upgrade pip
  • Linux or MacOS:
python -m venv pymupdf-venv
. pymupdf-venv/bin/activate
python -m pip install --upgrade pip
  1. Install PyMuPDF with the following command:
pip install --upgrade pymupdf
  1. Run this program with the following command:
python extract_subscripts.py
  1. Follow the on-screen instructions.
  2. You can exit your virtual environment with the following command:
deactivate

Tips

  • Due to the nature of this program, text in graphs and figures may be extracted incorrectly. Please ignore such texts by referring to the page number.
  • To successfully extract subscript texts, first set the font size of the subscript you want to extract to around 8.5, extract it once, and check the result. If the subscripts are not extracted, specify the value larger than '8.5' for the font size and try extracting again.
  • If a lot of non-subscript texts are extracted along with subscripts, check the font size of the subscript texts by referring your output file. Within the same file, subscripts are often the same size, so specify a slightly larger size as the font size of the subscript you want to extract, and then try extracting again.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages