Apache Tika Docker Examples

woman reading book whilst drinking coffee

Photo by Priscilla Du Preez on Unsplash

For a number of years I’ve been involved in the Apache Tika project as both a committer and PMC member.

With the increase in container technology usage over the past few years we spun up a separate repository for Apache Tika Server in Docker, called tika-docker with convenience images hosted on Docker Hub

This has resulted in questions on how to customise configuration and host instances that link to other services. To help people get started, we’ve created some example scenarios.

So let’s dive in and check them out.

The tika-docker examples

To get the examples started, we’ve created examples using Docker Compose of the following scenarios:

  • Recognising and Captioning Video and Images with TensorFlow REST (see here)
  • Enriching Academic PDF Parsing with Grobid REST (see here)
  • OCR of PDF or Images with Tesseract including a Custom Configuration (see here)
  • Named Entity Recognition (see here)

Using the examples

Install Docker and Docker Compose

Follow the instructions for install docker from here.

Follow the instructions for installing docker-compose from here.

Clone the tika-docker

Now fetch the docker-compose files and sample configuation from the tika-docker project on GitHub:

git clone https://github.com/apache/tika-docker
Run Docker Compose for Example You Want

First change into the tika-docker directory

cd tika-docker

Then you can execute docker-compose for the example you wish to try. For example, to try the Named Entity Recognition (NER) example you can:

docker-compose -f docker-compose-tika-ner.yml up -d

You can drop the -d if you want to stay attached to the containers.

Then if you supplied a text file with some sample data to the /meta endpoint:

cat <<EOT >> test.txt
Hello world from the Apache Tika Team (dev@tika.apache.org).
EOT
curl -T test.txt http://localhost:9998/meta

The RegEx Entity Recogniser configured in the NER sample configuration files would extract the email in the returned metadata:

"X-Parsed-By","org.apache.tika.parser.CompositeParser","org.apache.tika.parser.ner.NamedEntityParser"
"language","en"
"NER_EMAIL","dev@tika.apache.org"
"Content-Type","text/plain"

You can then stop the running containers using

docker-compose -f docker-compose-tika-ner.yml down

Customising the examples

Each of the examples comes with associated set of configuration files in the sample-configs directory.

Each sample has an appropriately named subfolder, with the associated Tika Config XML file(s) and any other configuration resources, such as properties files specifying URLs or settings.

├── customocr
│   ├── org
│   │   └── apache
│   │       └── tika
│   │           └── parser
│   │               └── ocr
│   │                   └── TesseractOCRConfig.properties
│   ├── tika-config-inline.xml
│   └── tika-config-rendered.xml
├── grobid
│   ├── org
│   │   └── apache
│   │       └── tika
│   │           └── parser
│   │               └── journal
│   │                   └── GrobidExtractor.properties
│   └── tika-config.xml
├── ner
│   ├── run_tika_server.sh
│   └── tika-config.xml
└── vision
    ├── inception-rest-caption.xml
    ├── inception-rest-video.xml
    └── inception-rest.xml
Tika Config XML

All of the scenarios have Tika Config XML files. These files configure the parsers or recognisers for the example.

In some cases this file is named tika-config.xml and is then loaded in the docker-compose file directly. In other examples, such as the Vision and OCR ones, the docker-compose file loads only one of the XML configurations as the default tika-config.xml through a volume mount.

For example, in the the Vision configuration for the Parsing and Captioning Video or Images with TensorFlow REST, you have a choice of three configurations:

You can chose which you want to use by leaving commented the appropriate configuration in the docker-compose-tika-vision.yml file within the volumes section:

#... snip ...

volumes:
  # Replace the below with the configuration you want to use, or with your own custom one 
  # -  ./sample-configs/vision/inception-rest.xml:/tika-config.xml
  # -  ./sample-configs/vision/inception-rest-video.xml:/tika-config.xml
  -  ./sample-configs/vision/inception-rest-caption.xml:/tika-config.xml

#... snip ...

You can find more on ObjectRecognition and the TensorFlow in Apache Tika Server from my previous blog post here.

Want more?

I plan to write most specific blog posts, similar to the Tensorflow REST one, on each of these example scenarios. So please subscribe to the RSS feed for this blog if you are interested.

If you would like to see other examples like this, either let me know directly on GitHub or Twitter, or message on the Apache Tika Users or Developer mailing lists.