CPREx is an end to end tool for Named Entity Recognition (NER) and Relation Extraction (RE) specifically designed for chemical compounds and their properties. The goal of the tool is to identify, extract and link chemical compounds and their properties from scientific literature. For ease of use, CPREx provides a custom spaCy pipeline to perform NER and RE.
The pipeline performs the following steps
flowchart LR
crawler("`**crawler**
fetch PDF articles
from online archives`")
parser("`**parser**
Extract text
from PDF`")
crawler --> parser
parser --> ner
ner("`**NER**
extract named
entities`")
ner --> chem["`**Chem**
*1,3,5-Triazine*
*Zinc bromide*
*C₃H₄N₂*`"] --> rel
ner --> prop["`**Property**
*fusion enthalpy*
*Tc*`"] --> rel
ner --> quantity["`**Value**
*169°C*
*21.49 kJ/mol*`"] --> rel
rel("`**Relation Extraction**
link entities`")
rel --> res
res("`**(Chem, Property, Value)**
*2,2'-Binaphthalene, ΔHfus, 38.9 kJ/mol*`")
CPREx works with a recent version of python (>=python 3.11). Make sure to install CPREx in a virtual environment of your choice.
CPREx depends on GROBID and its extension grobid-quantities for parsing PDF documents and extracting quantities from their text. In order to install and run GROBID, a JDK must also be installed on your machine. GROBID currently supports JDK versions 11 to 17.
You can install CPREx directly with pip:
pip install cprex
This installation is recommended for users who want to customize the pipeline or train some models on their own dataset.
Clone the repository and install the project in your python environment, either using pip
git clone [email protected]:jonasrenault/cprex.git
cd cprex
pip install --editable .
or poetry
git clone [email protected]:jonasrenault/cprex.git
cd cprex
poetry install
CPREx depends on GROBID and its extension grobid-quantities for parsing PDF documents and extracting quantities from their text. For convenience, CPREx provides a command line interface (CLI) to install grobid and start a grobid server.
Run
cprex install-grobid
to install a grobid server and the grobid-quantities extension (by default, grobid and models required by CPREx are installed in a .cprex
directory in your home directory).
Run
cprex start-grobid
to start a grobid server and enable parsing of PDF documents from CPREx.
To perform Named Entity Recognition (NER) of chemical compounds and Relation Extraction (RE), CPREx requires some pretrained models. These models can be installed by running
cprex install-models
This will install a PubmedBert model finetuned on the NLM-CHEM corpus for extraction of chemical named entities. This model was finetuned by the BioCreative VII track.
It will also install a RE model pre-trained on our own annotated dataset.
A base spaCy model, such as en_core_web_sm
, is required for tokenization, lemmatization, and all other features offered by spaCy. To install a spaCy model, you can run
python -m spacy download en_core_web_sm
or directly install it with pip by specifying the model's url
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
The easiest way to run CPREx is to use Docker to start a container running CPREx. Refer to the official documentation for instructions on how to install Docker on your system.
You can then pull the CPREx image from github's container registry with
docker pull ghcr.io/jonasrenault/cprex:latest
You can start a container running this image with
docker run -t --rm -p 80:8501 ghcr.io/jonasrenault/cprex:latest
Note that the image is only compiled for amd64 architectures. Add --platform=linux/amd64
if running on an ARM architecture.
Once the container is started, you can access CPREx's streamlit UI by opening a browser at the http://localhost URL.
CPREx provides a User Interface built with streamlit. The UI lets you upload a PDF file and see the results of running CPREx on it. If you've cloned the CPREx repository locally and installed the projet in a python environment, you can run the UI by executing the following command, in the python environment and from CPREx's root directory
streamlit run cprex/ui/streamlit.py
This will start a web server exposing the UI at http://localhost:8501. Note that for CPREx's pipeline to work, you must have installed the models with the cprex install-models
command, and started the grobid services with cprex start-grobid
.
Notebook examples showing how to use CPREx directly in a Python script are available in the notebooks directory. To run the notebooks, install jupyterlab in your Python environment, start it with jupyter lab
, and open one of the example notebooks. Note that for CPREx's pipeline to work, you must have installed the models with the cprex install-models
command, and started the grobid services with cprex start-grobid
.