For a number of years I’ve been involved in the Apache Tika project as both a committer and PMC member.
With the increase in container technology usage over the past few years we spun up a separate repository for Apache Tika Server in Docker, called tika-docker with convenience images hosted on Docker Hub
This has resulted in questions on how to customise configuration and host instances that link to other services. To help people get started, we’ve created some example scenarios.
So let’s dive in and check them out.
The tika-docker examples
To get the examples started, we’ve created examples using Docker Compose of the following scenarios:
- Recognising and Captioning Video and Images with TensorFlow REST (see here)
- Enriching Academic PDF Parsing with Grobid REST (see here)
- OCR of PDF or Images with Tesseract including a Custom Configuration (see here)
- Named Entity Recognition (see here)
Using the examples
Install Docker and Docker Compose
Follow the instructions for install docker from here.
Follow the instructions for installing docker-compose from here.
Clone the tika-docker
Now fetch the docker-compose files and sample configuation from the tika-docker project on GitHub:
Run Docker Compose for Example You Want
First change into the tika-docker directory
Then you can execute docker-compose for the example you wish to try. For example, to try the Named Entity Recognition (NER) example you can:
You can drop the -d if you want to stay attached to the containers.
Then if you supplied a text file with some sample data to the /meta endpoint:
The RegEx Entity Recogniser configured in the NER sample configuration files would extract the email in the returned metadata:
You can then stop the running containers using
Customising the examples
Each of the examples comes with associated set of configuration files in the sample-configs directory.
Each sample has an appropriately named subfolder, with the associated Tika Config XML file(s) and any other configuration resources, such as properties files specifying URLs or settings.
Tika Config XML
All of the scenarios have Tika Config XML files. These files configure the parsers or recognisers for the example.
In some cases this file is named tika-config.xml and is then loaded in the docker-compose file directly. In other examples, such as the Vision and OCR ones, the docker-compose file loads only one of the XML configurations as the default tika-config.xml through a volume mount.
For example, in the the Vision configuration for the Parsing and Captioning Video or Images with TensorFlow REST, you have a choice of three configurations:
- inception-rest-caption.xml - for image captioning
- inception-rest-video.xml - for object recognition in videos
- inception-rest.xml - for object recognition in images
You can chose which you want to use by leaving commented the appropriate configuration in the docker-compose-tika-vision.yml file within the volumes section:
You can find more on ObjectRecognition and the TensorFlow in Apache Tika Server from my previous blog post here.
I plan to write most specific blog posts, similar to the Tensorflow REST one, on each of these example scenarios. So please subscribe to the RSS feed for this blog if you are interested.
If you would like to see other examples like this, either let me know directly on GitHub or Twitter, or message on the Apache Tika Users or Developer mailing lists.