Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulling together all the bits for text extraction #140

Merged
merged 5 commits into from
Sep 19, 2019

Conversation

dannylamb
Copy link
Member

GitHub Issue: Islandora/documentation#933

Pulls together

What does this Pull Request do?

Installs a version of islandora_text_extraction that I've taken from @ajstanley and the UPEI crew and shuffled a few things around.

What's new?

islandora_text_extraction, but with less files. All remaining files are untouched from @ajstanley's original work, but I did delete some files that were no longer neccessary because...

  • pdftotext is called in Hypercube now
  • Some conditions have been replaced by just configuring the existing contexts we have differently
  • All configuration has been split out into its own feature

How should this be tested?

  • Pull down this PR
  • rm -rf roles/external
  • vagrant up

Once successful, you should be able to trigger derivatives for both tesseract and pdftotext. For tesseract:

  • Make a node, give it a model of 'Page'
  • Add a File media, upload a tiff, and tag it as an original file.

For pdftotext

  • Make a node, give it a model of 'Digital Document'
  • Add a File media, upload a PDF, and tag it as an original file

Of course this is all customizable through Context, but by default that's what will work. You should get an Extracted Text media for both, which will have the text on it as a field.

Additional Notes:

Ideally we'd ship with better solr config to have these text fields indexed by default.

Interested parties

@Islandora-Devops/committers @ajstanley @dbernstein @Natkeeran

@Natkeeran
Copy link
Contributor

Natkeeran commented Sep 12, 2019

@dannylamb

This works as advertise.

As you noted, we can extend it later to give an option to extract or not via UI, when the user uploads the file.

However, I do want to highlight another issue. Currenlty, it does not seem to pass the language parameter to the cmd. The language paramter would have to be extracted from the Media of node.

For example additional language can be added by issuing the following command:

sudo apt-get install tesseract-ocr-tam

Then, to specify a language, the command can be as follows:

tesseract input.tiff output.txt -l   tam

I am good with merging this, then extending into to add the language parameters.

(On a side note, tesseract 4.0 provides good/acceptable support for more languages. Thus, it would be good to support it as the LOE to support is relatively low).

@dannylamb
Copy link
Member Author

@Natkeeran We do ship with a handful of other languages, but don't specify any in the command. We can set that up in the Action if you want.

As far as merging goes, it's gotta be in a certain order. Merge Islandora/Crayfish#77 and Islandora/documentation#163. Then I have to update the composer.json in Islandora/islandora_defaults#7. Then merge it and islandora-deprecated/ansible-role-crayfish#30. Then I'll update this PR and we'll merge. 😓

@dannylamb
Copy link
Member Author

@dbernstein ^^ That's the rigamarole I was talking about in Slack. Bit of a juggle to get everything in.

@Natkeeran
Copy link
Contributor

@dannylamb

hmm, I don't think you can specify the language in the action.

Currently we don't have language metadata in the media, because it is set at the node level. Thus, we would need a hook or another lookup to get the language. Or is there an easier way to do this via Action?

Lets do the merging tomorrow with fresh coffee.

@dannylamb
Copy link
Member Author

You can give it the -l param as args in the Action, but I don't know if you can use tokens to get the actual language value for the entity.

@dannylamb
Copy link
Member Author

and +1 to fresh ☕, it's almost quittin' time here today.

@seth-shaw-unlv
Copy link
Contributor

@dannylamb, make your updates then I will give a final test before merging.

@dannylamb
Copy link
Member Author

All done @seth-shaw-unlv. Feel free to merge at your next earliest convenience.

@dannylamb dannylamb mentioned this pull request Sep 19, 2019
@seth-shaw-unlv
Copy link
Contributor

Just 🚀'd a final test.

@seth-shaw-unlv
Copy link
Contributor

Huh, it mostly works. The original image context doesn't have the extract text action selected though... I was able to add it manually, but somehow it missed the earlier merges. Anyway, I'm going to merge this because that context edit can be done separately afterward.

@seth-shaw-unlv seth-shaw-unlv merged commit f3be6d5 into Islandora-Devops:dev Sep 19, 2019
@seth-shaw-unlv
Copy link
Contributor

OH! because we only have it marked for Pages, not for stand-alone images. Gotcha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants