Pulling together all the bits for text extraction #140

dannylamb · 2019-09-12T17:51:32Z

GitHub Issue: Islandora/documentation#933

Pulls together

What does this Pull Request do?

Installs a version of islandora_text_extraction that I've taken from @ajstanley and the UPEI crew and shuffled a few things around.

What's new?

islandora_text_extraction, but with less files. All remaining files are untouched from @ajstanley's original work, but I did delete some files that were no longer neccessary because...

pdftotext is called in Hypercube now
Some conditions have been replaced by just configuring the existing contexts we have differently
All configuration has been split out into its own feature

How should this be tested?

Pull down this PR
rm -rf roles/external
vagrant up

Once successful, you should be able to trigger derivatives for both tesseract and pdftotext. For tesseract:

Make a node, give it a model of 'Page'
Add a File media, upload a tiff, and tag it as an original file.

For pdftotext

Make a node, give it a model of 'Digital Document'
Add a File media, upload a PDF, and tag it as an original file

Of course this is all customizable through Context, but by default that's what will work. You should get an Extracted Text media for both, which will have the text on it as a field.

Additional Notes:

Ideally we'd ship with better solr config to have these text fields indexed by default.

Interested parties

@Islandora-Devops/committers @ajstanley @dbernstein @Natkeeran

Natkeeran · 2019-09-12T19:57:46Z

@dannylamb

This works as advertise.

As you noted, we can extend it later to give an option to extract or not via UI, when the user uploads the file.

However, I do want to highlight another issue. Currenlty, it does not seem to pass the language parameter to the cmd. The language paramter would have to be extracted from the Media of node.

For example additional language can be added by issuing the following command:

sudo apt-get install tesseract-ocr-tam

Then, to specify a language, the command can be as follows:

tesseract input.tiff output.txt -l   tam

I am good with merging this, then extending into to add the language parameters.

(On a side note, tesseract 4.0 provides good/acceptable support for more languages. Thus, it would be good to support it as the LOE to support is relatively low).

dannylamb · 2019-09-12T20:04:43Z

@Natkeeran We do ship with a handful of other languages, but don't specify any in the command. We can set that up in the Action if you want.

As far as merging goes, it's gotta be in a certain order. Merge Islandora/Crayfish#77 and Islandora/documentation#163. Then I have to update the composer.json in Islandora/islandora_defaults#7. Then merge it and islandora-deprecated/ansible-role-crayfish#30. Then I'll update this PR and we'll merge. 😓

dannylamb · 2019-09-12T20:05:29Z

@dbernstein ^^ That's the rigamarole I was talking about in Slack. Bit of a juggle to get everything in.

Natkeeran · 2019-09-12T20:12:05Z

@dannylamb

hmm, I don't think you can specify the language in the action.

Currently we don't have language metadata in the media, because it is set at the node level. Thus, we would need a hook or another lookup to get the language. Or is there an easier way to do this via Action?

Lets do the merging tomorrow with fresh coffee.

dannylamb · 2019-09-12T20:17:25Z

You can give it the -l param as args in the Action, but I don't know if you can use tokens to get the actual language value for the entity.

dannylamb · 2019-09-12T20:17:55Z

and +1 to fresh ☕, it's almost quittin' time here today.

seth-shaw-unlv · 2019-09-19T16:52:57Z

@dannylamb, make your updates then I will give a final test before merging.

dannylamb · 2019-09-19T17:04:14Z

All done @seth-shaw-unlv. Feel free to merge at your next earliest convenience.

seth-shaw-unlv · 2019-09-19T18:03:51Z

Just 🚀'd a final test.

seth-shaw-unlv · 2019-09-19T18:54:39Z

Huh, it mostly works. The original image context doesn't have the extract text action selected though... I was able to add it manually, but somehow it missed the earlier merges. Anyway, I'm going to merge this because that context edit can be done separately afterward.

seth-shaw-unlv · 2019-09-19T18:58:14Z

OH! because we only have it marked for Pages, not for stand-alone images. Gotcha.

Pulling together all the bits for text extraction

62db10f

dannylamb added 4 commits September 19, 2019 14:01

Update crayfish.yml

700b0b5

Update drupal.yml

ac040eb

Update requirements.yml

e3fb681

Update requirements.yml

0649722

dannylamb mentioned this pull request Sep 19, 2019

Text extraction #138

Closed

seth-shaw-unlv merged commit f3be6d5 into Islandora-Devops:dev Sep 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulling together all the bits for text extraction #140

Pulling together all the bits for text extraction #140

dannylamb commented Sep 12, 2019

Natkeeran commented Sep 12, 2019 •

edited

Loading

dannylamb commented Sep 12, 2019

dannylamb commented Sep 12, 2019

Natkeeran commented Sep 12, 2019

dannylamb commented Sep 12, 2019

dannylamb commented Sep 12, 2019

seth-shaw-unlv commented Sep 19, 2019

dannylamb commented Sep 19, 2019

seth-shaw-unlv commented Sep 19, 2019

seth-shaw-unlv commented Sep 19, 2019

seth-shaw-unlv commented Sep 19, 2019

Pulling together all the bits for text extraction #140

Pulling together all the bits for text extraction #140

Conversation

dannylamb commented Sep 12, 2019

What does this Pull Request do?

What's new?

How should this be tested?

Additional Notes:

Interested parties

Natkeeran commented Sep 12, 2019 • edited Loading

dannylamb commented Sep 12, 2019

dannylamb commented Sep 12, 2019

Natkeeran commented Sep 12, 2019

dannylamb commented Sep 12, 2019

dannylamb commented Sep 12, 2019

seth-shaw-unlv commented Sep 19, 2019

dannylamb commented Sep 19, 2019

seth-shaw-unlv commented Sep 19, 2019

seth-shaw-unlv commented Sep 19, 2019

seth-shaw-unlv commented Sep 19, 2019

Natkeeran commented Sep 12, 2019 •

edited

Loading