TAC corpus

The work is part of the "Networked Mathematics" project at the Topos Institute.

You can read about the project in the blog posts:

Introducing the MathFoldr Project (11 Jul 2021)
The many facets of Networked Mathematics (18 Apr 2022)
Mathematical concepts: how do you recognize them? (16 Nov 2022)
Preparing for Networked Mathematics (5 Jan 2023)

There are also the preprints:

Mathematical Entities: Corpora and Benchmarks (Parmesan3) was presented at LREC-COLING 2024, a short video is here.
https://arxiv.org/abs/2311.12649 (MathGloss)
https://arxiv.org/abs/2309.00642 (MathAnnotator)
Parmesan: mathematical concept extraction for education (Parmesan2)
“Extracting Mathematical Concepts from Text”, presented at 8th Workshop on Noisy User-generated Text 2022(W-NUT) associated with COLING 2022. Here's a short video about the work.

There were some preliminary investigations in the Parmesan0.11 and Parmesan0.12 prototypes. For the unmodified results of our 2022 paper, see this branch.

Parmesan 0.2 is now up at http://www.jacobcollard.com/parmesan2/

This repository

This repository contains a corpus based on the contents of abstracts of the electronic journal Theory and Applications of Categories (TAC) as of c. December 2020. This is used as a training/testing corpus for mathematical NLP and machine learning projects.

The corpus contains the following data files:

tac.conll contains an automatically annotated version of the corpus, with dependency structures and POS tags.
tac.json contains the original corpus, in JSON format.
tac_metadata.json contains the original corpus, in JSON format, with additional metadata such as authors and keywords.
tac_stats.json contains some basic statistics about the corpus, including the frequency of common words and parts of speech.

The tac-experiments folder contains a series of simple experiments evaluating various automatic terminology extraction methods on the TAC corpus. To run the original experiments, you would need an installation of DyGIE++ and Parmenides. Unfortunately, the latter is not freely available, but you can contact the authors for distribution.

Parmesan 0.2 can be run without problems, see instructions in https://github.com/ToposInstitute/parmesan.

Corpus statistics

There are two types of part-of-speech tags in the corpus statistics, both generated by spaCy. The first tagset, labeled "pos" in nlab_stats.json, represents coarse-grained parts of speech and is taken from the Universal POS tag set. The second tagset, "tag", is specific to spaCy's pretrained English model.

Details about the different tagsets, as well as other label schemes for this model can be found on spaCy's website.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
GödertDictionaries		GödertDictionaries
LeinsterBook		LeinsterBook
golden-attempt		golden-attempt
lists		lists
mwe		mwe
scripts		scripts
tac-definitions		tac-definitions
tac-experiments		tac-experiments
.gitignore		.gitignore
README.md		README.md
create_dataset.py		create_dataset.py
noun.csv		noun.csv
propn.csv		propn.csv
stats.json		stats.json
tac.conll		tac.conll
tac.json		tac.json
tac_compounds.tsv		tac_compounds.tsv
tac_data.json		tac_data.json
tac_metadata.json		tac_metadata.json
tac_stats_formatted.json		tac_stats_formatted.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TAC corpus

This repository

Corpus statistics

About

Releases

Packages

Contributors 4

Languages

ToposInstitute/tac-corpus

Folders and files

Latest commit

History

Repository files navigation

TAC corpus

This repository

Corpus statistics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages