Indexing UIMA Annotated Docs, with Solr

January 7, 2016 by guidj

In this post, I'm going to walk you through the process of indexing UIMA annotated documents with Solr.

In a previous post, Finding Movie Starts, I demonstrated how we can use UIMA to find and tag structured information in unstructured data. In most scenarios, once we have that data we extracted, we want to be able to query it. To do this, we can put our data into a data store, be it an RDMS, document store, graph, or other. Likewise, it is also very common for us to need some kind of search capability on our system, so that we or our users can find relevant information. This is what we're going to do. Having UIMA annotations with information on what directors and actors are mentioned in a review, we want to be able to search for reviews that mention specific actors, directors, or just search for reviews that do mention screenwriters.

Lucene is a very popular and powerful information retrieval system, or indexing engine. It was built as library, by Doug Cutting, who also created Hadoop. It offers a rich set of features and capabilities when it comes to configuring how we want to index our data. In recent years though, there have been several projects that were built on top of Lucene, to offer more features on top of the library. Solr, and ElasticSearch are two such projects. While ElasticSearch offers near real-time search, and ease of setup for a distributed environment, we're going to be using Solr, as it still offers great functionality for standard search use cases, and it easily integrates with UIMA.

There was project name LUCAS created to allow people to index UIMA documents with Lucene. However, it seems to have since been abandoned in favour of SolrCas, a project designed to index UIMA annotated documents with Solr. This last one, however, has since been abandoned in favour of Solr4UIMA, which is what we're going to use here to feed UIMA annotated documents directly to Solr.

Solr4UIMA makes it relatively easy to index CAS files generated by UIMA into Solr. We're going to take the following steps to do this:

  1. Install Solr-4.9.1
  2. Configure Solr4UIMA
  3. Index some CAS documents

Installing Solr

We'll be making use of Solr version 4.9.1 You can get it from here. You can get the solr4-x.tgz or solr4-x.zip version. Download it to a folder of your preference. I'll use ~/SDK/:

cd ~/SDK/
wget http://archive.apache.org/dist/lucene/solr/4.9.1/solr-4.9.1.tgz
tar -xvf solr-4.9.1.tgz
cd solr-4.9.1/

Solr was designed to be deployed in a Java container. Here, we're just gonna run a simple server that comes shipped with Solr 4:

cd example/
java -jar start.jar

Once this is done, go to the path http:{IP/domain}:{Port}/solr of the server. The default port in Solr 4 is 8983; you should be able to see the admin panel. There, you can select a collection, and do things like check its documents, run queries and analyse the schema. When be started the simple server earlier, it created a collection named collection1. This is now the default collection for our data on the server.

If you head on the Query section, you will notice that the request handler is set to /select, which is used for just reading data, and the q parameter, which represents a query, is set to *:*, meaning we will be querying for documents that contain anything in any field. Executing the query, we should get an empty set as our result. That's because we don't have any documents in Solr yet. Before we insert our CAS files, we have to configure Solr4UIMA with our annotator.

Configuring Solr4UIMA

In order to setup Solr4UIMA, we have to configure 3 things: annotations’ fields mapping, Solr's request handlers, and solr's schema indexing options.

Configuring the Annotators’ Fields Mapping, and Solr's Request Handler

This configuration is done on the solrconfig.xml file, located under our collection folder, i.e. solr-4.9.1/example/solr/collection1/conf/solrconfig.xml. I am going to assume here that you have a working annotator, neatly packaged into a jar with all of its dependencies. I'm going to use an annotator I wrote to extract film figures from movie reviews. If you want to follow the semantics of what we're doing, I suggest you pause here, follow that tutorial, and then come back.

We're first going to copy our jar, and a few others, to the example folder of our solr4 installation:

cd ~/SDK/solr-4.9.1/
mkdir solr/example/solr/collection1/lib
cp solr/dist/solr-uima*.jar solr/example/solr/collection1/lib/
cp solr/contrib/uima/lib/*.jar solr/example/solr/collection1/lib/
cp solr/contrib/uima/lucene-libs/lucene-analyzers-uima*.jar solr/example/solr/collection1/lib/
# our UIMA annotator in a jar with its dependencies
cp ~/{src-dir}/spotlight-0.0.1-SNAPSHOT-jar-with-dependencies.jar solr/example/solr/collection1/lib/

The purpose of this step is to add the necessary Solr-UIMA libraries, as well as to add our annotator, to the classpath of our server instance.

Next, we configure our annotator, and the Solr request handler:

Annotator

We configure the annotator by defining an updateRequestProcessorChain, named uima. You can name it anything you'd like. In addition to this:

  • We created a folder named desc, where we placed our descriptor files, and then we placed this folder under the solr-4.9.1/example folder.
  • Under the property fieldMappings, we defined how each property in each annotation should be mapped to a document property. For example, for the dsc.spotlight.entity.Director annotation, we mapped the field name to a field directors.

Solr has two data abstractions: documents and collections. A document is basically set of fields with values. A field can be multi-value, and the basic data types are supported. You can get more insight on Solr's structure and inner-workings from their official documentation. Right now, the most important thing that you need to understand is that in Solr, documents are flat entities, i.e. there are no nested fields. So, if we have a tree-like structure, or another type of document we want to index with nested values, then we have to figure out a way to map them to a flat document. For the case of CAS files, this has an implication for annotations that can have multiple instances in a single document. For example, we can find several screenwriters and directors in a single review, each having their own name field. The only way to capture all of them is to map them to a multi-valued field, that will store all values for each instance, as we did here.

Configuring the request handler is slightly more straightforward. We just tell Solr to alter the /update endpoint handler, and use our own by providing the update.processor name field a value that matches that of our updateRequestProcessorChain. Note that the class for the request handler I used is solr.UpdateRequestHandler, and not solr.XmlUpdateRequestHandler as recommended in the documentation, because the latter is now deprecated.

Configuring Schema

In our field mapping configuration, we told Solr what fields in Solr it should map our UIMA annotation fields to. Now, we need to make sure those fields are defined in the collection's schema. The file of interest here is schema.xml, and it is located on the same folder as the solr-4.9.1/example/solr/collection1/conf/schema.xml file

Solr's default schema provides many fields. It's important though that we specify how we want our fields to be indexed. Here, we specify configuration options for each field such as whether it's multi-valued, if it should be indexed, stored, as well as its data type. Since Solr runs on Lucene, if you're familiar with it, then these options should be intuitive for you. If not, here's a brief rundown of the basics:

  • multi-value: false means it just holds one value, true means can hold a list/array
  • index: false means it will not be indexed, true means it will. Depending on your indexing pipeline, not indexing is generally good if you only need direct comparison.
  • store: true to keep the data in Solr (Lucene, actually), and false otherwise. Storing is good to retrieve the data directly from the system, but it's only recommended for small fields, such as names and title. However, if it is perfectly valid for you to store larger fields, especially if you intend to use Solr (or Lucene) as your data store as well as indexing/search system.

Due to the fact that all annotators can have multiple instances in a single document (review), I made the corresponding fields in Solr multi-value, and used the name of the annotator to map it nicely. This way, for instance, all Directors’ name will be in a directors field, and every Screenwriters’ name in a screenwriter's field, for each document/review we index. Unfortunately though, we lose the meta information for each annotation here, such as their start and end position. While I think it would be possible to capture them as well, by for instance using two multi-value fields with start and end positions each for each field, I believe this approach would make data manipulation tricky for the application. Notwithstanding, our solution here fits our purpose, as we're only interested in which reviews mention actors, directors, or specific screenwriters.

Indexing Some Docs

After making all of these changes, we can restart our server:

java -jar start.jar

If we did everything correctly, things should be running fine. Be sure to check the server log to see if our libs, annotator and configuration were all loaded successfully.

Once the server is running, we are ready to feed some data to it. We're going to use some sample reviews, in XML format:

We have 3 reviews in our file. Each review is represented by a document of two fields: id, and content. This is the format of data that Solr supports, for inserting data. With the data ready, we now can use the Solr's update endpoint to index our review annotations:

curl -X POST --header "Content-Type:text/xml;charset=UTF-8" -d @data.xml http://localhost:8983/solr/update?commit=true

data.xml is an XML file with our docs. If our query is successful, we can run the Solr query we ran earlier to see our data, and the output should look like this:

I removed the content field that is also returned from Solr. We can see the directors, screenwriters, cinematographers, etc as we specified in our field mapping.

Let's run two more queries, to see how Solr's indexing can help us narrow results.

First, we run a query to select reviews that do mention screenwriters:

screenwriters:*

We get the following result:

One document matched our query, it's the first review, and the name Terry Hayes is mentioned in it.

Finally, we're going to search for any documents that mention a person with the name Shaw:

persons:"shaw"

Result:

As expected, we get the third review document. And we're done. We a system ready to index our annotations, and allow us to search through them.

Closing

In this tutorial, we saw how we can use Solr4UIMA to help us take advantage of Solr's search capabilities on our UIMA annotated files. We can add new annotators, or modify our existing ones, and quickly setup a search platform for our CAS data.

References