-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pulling together all the bits for text extraction #140
Pulling together all the bits for text extraction #140
Conversation
This works as advertise. As you noted, we can extend it later to give an option to extract or not via UI, when the user uploads the file. However, I do want to highlight another issue. Currenlty, it does not seem to pass the language parameter to the cmd. The language paramter would have to be extracted from the For example additional language can be added by issuing the following command:
Then, to specify a language, the command can be as follows:
I am good with merging this, then extending into to add the language parameters. (On a side note, tesseract 4.0 provides good/acceptable support for more languages. Thus, it would be good to support it as the LOE to support is relatively low). |
@Natkeeran We do ship with a handful of other languages, but don't specify any in the command. We can set that up in the Action if you want. As far as merging goes, it's gotta be in a certain order. Merge Islandora/Crayfish#77 and Islandora/documentation#163. Then I have to update the composer.json in Islandora/islandora_defaults#7. Then merge it and islandora-deprecated/ansible-role-crayfish#30. Then I'll update this PR and we'll merge. 😓 |
@dbernstein ^^ That's the rigamarole I was talking about in Slack. Bit of a juggle to get everything in. |
hmm, I don't think you can specify the language in the action. Currently we don't have language metadata in the media, because it is set at the node level. Thus, we would need a hook or another lookup to get the language. Or is there an easier way to do this via Action? Lets do the merging tomorrow with fresh coffee. |
You can give it the -l param as args in the Action, but I don't know if you can use tokens to get the actual language value for the entity. |
and +1 to fresh ☕, it's almost quittin' time here today. |
@dannylamb, make your updates then I will give a final test before merging. |
All done @seth-shaw-unlv. Feel free to merge at your next earliest convenience. |
Just 🚀'd a final test. |
Huh, it mostly works. The original image context doesn't have the extract text action selected though... I was able to add it manually, but somehow it missed the earlier merges. Anyway, I'm going to merge this because that context edit can be done separately afterward. |
OH! because we only have it marked for Pages, not for stand-alone images. Gotcha. |
GitHub Issue: Islandora/documentation#933
Pulls together
What does this Pull Request do?
Installs a version of
islandora_text_extraction
that I've taken from @ajstanley and the UPEI crew and shuffled a few things around.What's new?
islandora_text_extraction
, but with less files. All remaining files are untouched from @ajstanley's original work, but I did delete some files that were no longer neccessary because...pdftotext
is called in Hypercube nowHow should this be tested?
rm -rf roles/external
vagrant up
Once successful, you should be able to trigger derivatives for both tesseract and pdftotext. For tesseract:
For pdftotext
Of course this is all customizable through Context, but by default that's what will work. You should get an
Extracted Text
media for both, which will have the text on it as a field.Additional Notes:
Ideally we'd ship with better solr config to have these text fields indexed by default.
Interested parties
@Islandora-Devops/committers @ajstanley @dbernstein @Natkeeran