Releases: JSv4/OpenContracts
v2.4.0 - Txt-Based Format Annotator + Style Overhaul
This is a pretty significant upgrade vs 2.3.1. We added a number of features:
- We now support ingesting, rendering and annotating txt-based formats like plaintext, markdown, etc.
- Our document ingestion pipeline has a parser for txt-based formats.
- The task decorator for custom tasks will automatically switch from span-based to token-based annotations depending on the underlying format. At the moment this is just pdf vs non-pdf, but could be a richer taxonomy.
- Substantial styling improvements.
What's Changed
- Bump pytest from 8.2.2 to 8.3.3 by @dependabot in #227
- Bump pytz from 2022.7 to 2024.2 by @dependabot in #226
- Bump psycopg2 from 2.9.5 to 2.9.9 by @dependabot in #229
- Bump traefik from 3.1.4 to 3.1.5 in /compose/production/traefik by @dependabot in #232
- Bump actions/checkout from 4.1.7 to 4.2.0 by @dependabot in #231
- Bump cryptography from 43.0.0 to 43.0.1 by @dependabot in #228
- Bump traefik from 3.1.5 to 3.1.6 in /compose/production/traefik by @dependabot in #238
- Bump actions/checkout from 4.2.0 to 4.2.1 by @dependabot in #236
- Add Txt Annotator by @JSv4 in #233
Full Changelog: v2.3.1...v2.4.0
v2.3.1 - Improved Admin & Annotation Loading for Analyses
Two primary improvements in this release:
- The admin views have been built out with more filters, raw_id renders (to cut down on M2M and FK pulls), and custom actions - including a custom dropdown action on selected Corpus(es) to make them public.
- We were previously loading ALL annotations for an analysis in each document view. First off, that's really inefficient for large corpuses. Second, it meant that the annotator got cluttered with random annotations that weren't actually in the loaded document. Added a filter on the
fullAnnotationList
prop of AnalysesType to filter todocument_id
. Updated frontend to only request annotation analyses foropened_document
.
What's Changed
- Bump traefik from 3.1.2 to 3.1.3 in /compose/production/traefik by @dependabot in #217
- Bump pillow from 9.4.0 to 10.4.0 by @dependabot in #186
- Bump djangorestframework from 3.14.0 to 3.15.2 by @dependabot in #214
- Bump gunicorn from 20.1.0 to 23.0.0 by @dependabot in #194
- Improve Admin Views by @JSv4 in #219
- Bump traefik from 3.1.3 to 3.1.4 in /compose/production/traefik by @dependabot in #225
- Bump mypy from 1.11.1 to 1.11.2 by @dependabot in #223
- Bump drf-extra-fields from 3.4.1 to 3.7.0 by @dependabot in #221
Full Changelog: v.2.3.0...v2.3.1
v2.3.0 - Add User Feedback
It is now possible to collect feedback from users on public corpuses where can_comment
is set to true. Added some nice GUI enhancements to the labels to support more action buttons - including a cool parabolic spiral button cloud that sprouts from an action zone.
What's Changed
Full Changelog: v2.2.0...v.2.3.0
v2.2.0 - Document UI Overhaul
This release brings an enormous number of frontend improvements and tweaks, primarily focused on unifying the document annotation and viewer components into a single component that has a single, clean workflow for viewing different extracts and analyses for a given document.
What's Changed
- Finalize 2.1 by @JSv4 in #200
- Bump crispy-bootstrap5 from 0.7 to 2024.2 by @dependabot in #196
- Bump redis from 4.5.1 to 5.0.8 by @dependabot in #201
- Bump pytest-django from 4.5.2 to 4.9.0 by @dependabot in #204
- Bump django-debug-toolbar from 3.7.0 to 4.4.6 by @dependabot in #203
- Enhancement: Sane, Smooth UX for Document-Based Workflows by @JSv4 in #206
Full Changelog: v2.1.0...v2.2.0
v2.1.0 - Corpus Actions
TLDR
This release brings the addition of CorpusActions
, GitHub Action-style automatic analyzers or data extractors that run when a document is uploaded. See more here.
What's Changed
- Upgrade Django App Dependencies to work with Django LTS by @JSv4 in #172
- Add Document Analysis Row by @JSv4 in #175
- Bump django from 4.2.14 to 4.2.15 by @dependabot in #180
- Bump flake8-isort from 6.0.0 to 6.1.1 by @dependabot in #181
- Bump pytest-cov from 4.0.0 to 5.0.0 by @dependabot in #182
- Bump cryptography from 38.0.1 to 43.0.0 by @dependabot in #184
- Bump traefik from 3.1.0 to 3.1.2 in /compose/production/traefik by @dependabot in #179
- Bump django-crispy-forms from 1.14.0 to 2.3 by @dependabot in #166
- Add Corpus Actions by @JSv4 in #183
- Bump pylint-django from 2.5.3 to 2.5.5 by @dependabot in #129
- Bump flower from 1.0.0 to 2.0.1 by @dependabot in #125
- Bump django-coverage-plugin from 2.0.3 to 3.1.0 by @dependabot in #190
- Bump werkzeug from 2.2.2 to 3.0.3 by @dependabot in #188
- Bump celery from 5.2.7 to 5.4.0 by @dependabot in #187
- Bump python-slugify from 6.1.2 to 8.0.4 by @dependabot in #192
- Bump ipdb from 0.13.9 to 0.13.13 by @dependabot in #189
- Bump mypy from 0.991 to 1.11.1 by @dependabot in #191
- Bump marvin from 2.3.4 to 2.3.7 by @dependabot in #195
- Improved doc analyzer task decorator to do more I/O handling by @JSv4 in #185
- Bump factory-boy from 3.2.1 to 3.3.1 by @dependabot in #197
- Added Sample Doc Action Task and Cleanup Task Execution by @JSv4 in #198
- Bump coverage from 6.5.0 to 7.6.1 by @dependabot in #199
Full Changelog: v2.0.0...v2.1.0
v2.0.0.post1 - Post 2.0.0 Fixes
Upgrade Dependencies
The upgrade from Django 3.2* to 4.2.* introduced a syntax change in the management command that caused two django app dependencies to break. In the process of upgrading these, some other dependency issues cropped up.
This release:
- Upgrades django app dependencies for full Django 4.2.* compatibility
- Upgrades opencv and related dependencies
- Introduces additional test cases to improve test coverage.
What's Changed
Full Changelog: v2.0.0...v2.0.0.post1
v2.0.0 - Stable Data Extract Release
This release includes:
- A table-based data extract interface and related models
- Improved test coverage
- Upgrade to Django 4.2.* LTS
What's Changed
- Add Data Extraction by @JSv4 in #117
- Bump pytest from 6.2.5 to 8.2.2 by @dependabot in #126
- v2 Bugfixes by @JSv4 in #128
- Bump actions/upload-artifact from 3 to 4 by @dependabot in #123
- Bump actions/setup-node from 3 to 4 by @dependabot in #121
- Bump actions/checkout from 3.3.0 to 4.1.7 by @dependabot in #120
- Better Docs and Modular Extract Tasks by @JSv4 in #130
- Bump actions/setup-python from 4 to 5 by @dependabot in #122
- Improve Docs and Diagrams by @JSv4 in #131
- Add Testing Docs by @JSv4 in #132
- Update Production Compose by @JSv4 in #136
- Fix Injection of Configurations into Frontend from Env Variables by @JSv4 in #137
- Fix GUI Bugs by @JSv4 in #138
- Create Funding.yaml by @JSv4 in #142
- Update README.md by @JSv4 in #143
- File inspection and Mimetype Limits on Document Upload Mutation. by @JSv4 in #144
- Bump traefik from 2.9.6 to 3.0.4 in /compose/production/traefik by @dependabot in #133
- Use Default Icon for Labelset Where None Provided by @JSv4 in #146
- Updated Terms of Service and Opening Modal by @JSv4 in #147
- Install Embeddings Model @ /models in Production Container + Fix Extract Where Search Text is None by @JSv4 in #156
- Improve Document Selection Workflows by @JSv4 in #157
- Bump traefik from 3.0.4 to 3.1.0 in /compose/production/traefik by @dependabot in #160
- Frontend Cleanup by @JSv4 in #163
- Fix CorpusCards by @JSv4 in #164
- Fix Corpus Query Source Action by @JSv4 in #165
- Dynamically Apply OCR, Improve PDF Utilities and Tests by @JSv4 in #167
- Improve DB Performance with Additional Indexes by @JSv4 in #168
- Long Poll Documents When Document is Processing by @JSv4 in #169
- Upgrade Django LTS by @JSv4 in #170
Full Changelog: v1.3.0...v2.0.0
Improved OCR and PDF Parsing
Some PDF-handling-related improvements:
- Merged some nlm-ingestor changes from upstream repo to fix an issue with missing style tags with certain pdfs
- Improved test coverage for pdf utils
- Turn on OCR dynamically for PDFs that appear to need it, avoiding wasting processing power on all PDFs while preventing text-less PDFs when OCR is required.
Also some minor GUI bug-fixes
v2.0.0 b2 - Improved Documentation and Modular Data Extract
Features:
- The data extract tasks are now dynamically loaded and can be applied on a column-by-column basis. So, you can write very specific extract logic for a given column / data field. Newly-registered tasks are displayed automatically on the frontend and can be selected by the user when building a fieldset for a datagrid.
- Add a search to the Extracts view and improved various load and performance issues.
- Removed the LanguageModel model as it's almost completely subsumed by the ability to create custom extract pipelines. Moreover, it wasn't really doing anything before.
- Expanded our docs and tutorials to explain how data extract works and walk someone through writing a custom data extract task.
What's Changed
- Bump pytest from 6.2.5 to 8.2.2 by @dependabot in #126
- v2 Bugfixes by @JSv4 in #128
- Bump actions/upload-artifact from 3 to 4 by @dependabot in #123
- Bump actions/setup-node from 3 to 4 by @dependabot in #121
- Bump actions/checkout from 3.3.0 to 4.1.7 by @dependabot in #120
- Better Docs and Modular Extract Tasks by @JSv4 in #130
- Bump actions/setup-python from 4 to 5 by @dependabot in #122
Full Changelog: v2.0.0b1...v2.0.0.b2
v2.0.0 b1 - Add Data Extract and Corpus Querying
2.0.0 Beta 1
Added Grid-based Data Extraction and Corpus Querying
This update extends the analytical capabilities of the application, allowing for automated and background extraction of structured data from documents, improving efficiency and scalability.
We've added a couple models on the backend:
Extract: Represents a headless, background annotation task linked to a Corpus and Fieldset.
Fieldset: Defines a reusable set of fields for Extracts, linked to Columns.
Column: Represents a discrete data structure to extract from a document, with various properties like query, match_text, output_type, and more.
Datacell: Represents extracted data for each column and document, storing data as JSON.
LanguageModel: Represents a language model to be used in the extraction process.
Improved Test Suite
- LlamaIndex is being tested with vcr.py so we actually have realistic tests and mocks for corpus query and corpus extract tasks
- Added a lot of graphql query and endpoint tests
New GUI Elements
- There is now an extract tab and a number of GUI elements to make it easy to construct an extract grid made up of documents, corpora and re-usable columns.
- Within the Corpus view, there is a query tab you can use to ask questions of the corpus
What's Changed
Full Changelog: v1.3.0...v2.0.0b1