Skip to content

Use the Java Tika text extraction library on the .NET platform with OCR

License

Notifications You must be signed in to change notification settings

LeeBear35/tikaondotnet

 
 

Repository files navigation

Tika on .NET

Join the chat at https://gitter.im/KevM/tikaondotnet

This project is a simple wrapper around the very excellent and robust Tika text extraction Java library. This project produces two nugets:

  • TikaOnDotNet - A straight IKVM hosted port of Java Tika project.
  • TikaOnDotNet.TextExtractor - Use Tika to extract text from rich documents.

Building

The build automation expects the source to reside within a Git repo so the first step is to clone the repo using Git.

git clone https://github.com/KevM/tikaondotnet.git

This project uses FAKE for build automation and Paket for managing dependencies.

Note: Your first build should be from the command line to get the assembly version file created.

./build.cmd

The default build will run our Tika text extraction integration tests.

Building Nugets

It's easy to produce updated .nupkg packages.

./build.cmd PackageNugets

Look in ./artifacts for the resulting .nupkg files.

Updating Tika

When a new Tika release comes out you can follow the instructions below to get on the newest version.

  1. Edit the paket.dependencies file to point to the new release of the Tika Jar file.
  2. ./build.cmd PackageNugets

Follow this quick procedure to find the latest Tika release Jar archive:

  1. Visit the Tika download page
  2. Click on the Mirrors for tika-app-.jar link.
  3. Find the Jar hosted on www-us.apache.org.
  4. Copy this url into paket.dependencies.

Note: The automation looks for the Tika Jar file under paket-files/<hostname>/*.jar. If you do not use the www-us.apache.org url you'll need to update build.fsx.

Updating IKVM

When a new release of IKVM comes out you can follow the instructions below to get on the newest version.

  1. Edit the paket.dependencies file
  • Point the IKVM tools binary to the new release.
  • Point the IKVM nuget to the matching version of the Nuget.
  1. ./build.cmd PackageNugets

Note: The automation looks for the IKVM compiler in ./bin/ikvmc.exe of the expanded archive in paket-files.

You should make sure that paket.depdendencieslinked to the same version for the Nuget of IKVM and the build tools

//IKVM dependencies - the nuget and tool versions need to be in sync.
nuget IKVM <version>
http http://www.frijters.net/ikvmbin-<version>.zip

Looking for updated versions of IKVM? Check out their blog.

Releasing a Nuget

  1. Update the Release-Notes.md adding a new section for the next version. This is really important because it controls the version number of the assemblies and nuget packages.
  2. Tag the release commit. git tag -a v{version} -m "Ship it!"
  3. Push the tag git push origin --tags

Appveyor is setup to to automatically push tagged commits to Nuget.

About

Use the Java Tika text extraction library on the .NET platform with OCR

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C# 78.9%
  • F# 20.0%
  • Batchfile 1.1%