Apache Tika and the ObjectRecognitionParser for Object Recognition and Captioning Using TensorFlow REST.

people looking at a laptop screen

Photo by John Schnobrich on Unsplash

One of the coolest new features added to Apache Tika in the past few years has been the addition of Parsers that leverage Deep Learning to perform object recognition and captioning.

Contributed by Chris Mattmann and Thejan Wijesinghe, through their work with USC Data Science, you can configure Apache Tika to call of to predefined models and get deep learning equivalent of ‘Hello World’ - tagging dog or cat pictures!

So let’s try it out.

Apache Tika and the ObjectRecognitionParser

What is the ObjectRecognitionParser?

The ObjectRecognitionParser is a Parser that can be configured to recognise objects within content and annotate the metadata with information on the objects it has recognised.

Internally, the recognised objects are returned in a RecognisedObject for generic objects or it’s CaptionObject sub-class for captioning:

  • RecognisedObject instances contain an ID, Label, Label Language and Confidence Score.
  • CaptionObject instances contain an ID, Caption Sentence, Caption Language and Confidence Score.

Both types are placed in the metadata collection during parsing.

The type of recognition to be performed needs to be defined within Apache Tika’s tika-config.xml through the configuration of ObjectRecogniser instances to be used by the parser.

Available Object Recognisers

There are a number of ObjectRecogniser implementations in Apache Tika, including offline recognisers that need Deep Learning tools installed on the local machine (e.g. DL4J or Tensorflow) as well as online recognisers that make REST call to services.

For now we are going to focus on the online recognisers, specifically ones that use Tensorflow REST APIs runnable in Docker from the USC Data Science team.

These are:

  • TensorflowRESTRecogniser - which uses a custom REST API around Tensorflow to perform recognition on images
  • TensorflowRESTVideoRecogniser - which uses a custom REST API around Tensorflow to perform recognition on videos
  • TensorflowRESTCaptioner - which uses a custom REST API around Tensorflow to perform image captioning

These instances us an implementation based on the paper “Show and Tell: A Neural Image Caption Generator” for captioning images, and the Inception-V4 model from Tensorflow for recognition in video and images.

I’ll come back to the offline ones in another post.

Let’s try it out on Tika Server

Get the Docker Images

To make it easier to get up and running we’ll use Apache Tika Docker and the helper docker-compose file.

First, get the tika-docker project from GitHub:

git clone https://github.com/apache/tika-docker

In here is a file called docker-compose-tika-vision.yml which contains everything you need.

To make it easier we’ll create a symlink to allow us to execute docker-compose without specifying the file each time:

ln -s docker-compose-tika-vision.yml docker-compose.yml

Configure our instance

Like most things is Apache Tika, the ObjectRecogniser can be configured using the tika-config.xml file format.

To make things easier, there are three sample configuration to choose from to get your started:

  • sample-configs/inception-rest.xml - for image recognition
  • sample-configs/inception-rest-caption.xml - for image captioning
  • sample-configs/inception-rest-video.xml - for video recognition

You can do this by leaving only the configuration file entry you wish to use uncommented (or present) in the volumes section of the docker-compose file.

For example, to use image captioning you can leave the following set:

 - ./sample-configs/inception-rest-caption.xml:/tika-config.xml

Run Tika Server + Inception Services

With the above configuration set in the docker-compose.yml file, you can now load up the containers:

docker-compose up

Apache Tika Server will keep trying to reload until it can detect the configured Inception Service instance. If so want to avoid this, you can start the Inception Service first and then Tika.

Once they are loaded, you can now send some files to it:

wget https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg -O test.jpg
curl -T test.jpg htp://localhost:9998/meta

This should then give you suggested captions as part of the metadata collection parsed:

"CAPTION","a man standing next to a dog on a leash . (0.00022)","a man standing next to a dog on a bench . (0.00017)","a man and a dog sitting on a bench . (0.00011)","a man standing next to a dog in a park . (0.00007)","a man and a dog sitting on a bench (0.00006)"

How about the Tika App?

You can also re-use the Inception Services from the docker-compose.yml file for the Apache Tika app interactively.

To do the captioning, you can just start the inception service you want - in this case inception-caption:

docker-compose up inception-caption 

You can then create a custom tika-config.xml and setting the appropriate apiBaseUri

EOT >> tika-config.xml
<?xml version="1.0" encoding="UTF-8"?>
        <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser">
                <param name="apiBaseUri" type="uri">http://localhost:8765/inception/v3</param>
                <param name="captions" type="int">5</param>
                <param name="maxCaptionLength" type="int">15</param>
                <param name="class" type="string">org.apache.tika.parser.captioning.tf.TensorflowRESTCaptioner</param>

It’s worth noting in the sample configuration the apiBaseUri uses the Docker Compose service name and internal port. For running outside, you’ll need to use the external facing port mapping and IP/Hostname of the machine it is running on.

Then you can run the Apache Tika App JAR using your custom configuration. For example, to launch it in GUI mode you could use:

java -jar tika-app-1.25.jar --config=tika-config.xml -g

What’s next?

These REST based Tensorflow models are great examples of how Deep Learning can be used to augment the logic approach of Apache Tika for content parsing or detection.

If you want to try adding basic tagging or captioning to your search or asset pipelines, these models could provide a start, or the REST API implementation provide inspiration for hosting your own Tensorflow models.

It is an area that will continue to expand in the project and provides another API extension point where you can build your own ObjectRecogniser implementations. Happy Parsing!