How are passages constructed?

Before we discuss the API for constructing highlighted search results, we need to talk about how passages are constructed. For the sake of clarity, we'll say that we want to highlight the search results for a query that contains the terms t1 ... tn. For the moment we won't worry about what query operators occur in the query, we'll just worry about the terms.

The first question we need to ask is: how do we decide which occurrences of the query terms to highlight? In Minion, we use our passage retrieval algorithm to rank the passages from the search result that best match a passage query constructed from the query terms. We use a slightly modified version of the algorithm that provides for smaller gap and out-of-order penalties. Using the passage retrieval algorithm means that we can find the most compact passages from the document that match the query, even when the query didn't use any of the passage retrieval operators.

The passage retrieval algorithm provides a set of passages from the search result. These passages are really just the word positions in the document where the search terms occur, and they are divided up by the fields in which they occur. Note that it might be the case that a passage will not contain all of the query terms. This is possible when the query terms are widely separated in a document. The highlighting API provides ways to select and combine those passages in order to display the highlighted results.

The passage that we want to display is made up of three parts: the first is the query terms themselves. The query terms that appear in a passage will always be displayed. The second part of the passage is made up of words that occur between the query terms. We call these words the passage body. The third part of the passage is made up of words that occur before the first query term or after the last query term in the passage. These words are called the context for the passage.

We make a distinction among these parts because we want to be able to highlight them differently. So, for example, the query terms could be displayed in a bold font, the body terms could be in a normal font, and the context could be in a grayed out font. So, for a query like install fonts solaris, we could see a highlighted passage that looks like:

customers problem is, and how to resolve it. We have installed OCR fonts through Solaris fontadmin gui tools, once fonts installed we are trying to

In this example, the passage itself is "installed OCR fonts through Solaris" and the leading and trailing text are the context for the passage. We provide the context for the passage because in the instance where we find a passage that matches the query perfectly, we want to make sure that the user can make a determination from the context whether this result is worth looking at in full. Note that the passage highlighting is capable of highlighting passages that contain morphological variations of the query terms: here it's highlighted installed when the query term was install. This is a handy side-effect of using the passage retrieval algorithm to do the passage selection.

What to highlight, and how

The passage highlighting API consists of a few of parts. The first part is an implementation of the PassageBuilder interface, which is used to determine what fields should be highlighted, how multiple passages from the fields should be treated, and how big the passages should be. The second part of the highlighting API is an implementation of the PassageHighlighter interface, which can be used to add application specific markup to the passages selected from a search result. The third part is an implemenation of the Result.getPassageBuilder() method.

The first thing that you want to do with the PassageBuilder is define the fields from which you want highlighted results. Let's say that we've indexed some email messages and we want to highlight the results of a search against those messages. We'd like to display the subject of the email as the link to click on to get the full message and we'd like to provide a highlighted passage from the body of the email as the "snippet" to show with the result. Here's some code that we could use for that:

   PassageBuilder pb = r.getPassageBuilder();
   pb.addPassageField("subject", Passage.Type.JOIN, -1, -1, false);
   pb.addPassageField("body", Passage.Type.UNIQUE, 6, 512, true);

The addPassageField method requires a number of parameters.

The first parameter is the name of the field for which you would like to get passages. If this parameter is the string NonField, then the passage builder will collect up passages from the text that doesn't occur in any defined field. If this parameter is null, then the rest of the parameters apply to any fields that have not been explicitly named in any other call to addPassageField.

The second parameter tells the passage builder how it should handle multiple passages occurring in the same field. If this parameter is JOIN, then the passages will be joined together to create a single passage that spans all of the sub-passages. This is useful for fields like our subject where we want a single string with all occurrences of the query terms highlighted. If this parameter is UNIQUE, then a separate highlightable passage will be generated for each occurence of a valid passage in that field.

The third parameter is the size of the context to keep around the passage. This size is provided as the number of words to keep before and after the passage. A value of -1 indicates that the entire rest of the field is to be kept as context. Again, this is useful for our subject field, because we want to display the whole field. For our body passages, we specified a context size of 6 words, which should give a good idea of the gist of the document.

The fourth parameter is the maximum size of the highlighted passage, exclusive of any markup that may be added during highlighting. This parameter can be used to ensure that a highlighted passage doesn't get so large that it affects the look of the highlighted results. A value of -1 means that any size passage is acceptable. If the size provided is less than the size of the passage, then some words will be elided out of the passage when performing the highlighting.

The fifth and final parameter specifies how multiple passages should be treated. If this parameter is true, then the passages will be returned to the application sorted by their passage score. This allows the application to highlight the best matching passage when multiple passages are available. If this parameter is false, then the passages will be returned in the order in which they were found in the search result.

Getting the passages from a result

Once the passages that you're interested in have been defined for the passage builder, you need to pass your document through the passage builder so that it can select out the words that make up the passages. The document is provided as a map from field names to field values, and it's essential that the field names and values are presented to the passage builder in exactly the same order as when the document was originally indexed! If you don't ensure that this is the case, you will see that the incorrect words are highlighted in the resulting passages.

If you're using a SimpleIndexer to index your data, then you need to make sure that the fields and values in the map are in the same order as they were provided to the indexer. You can ensure the order of elements in a map by using the java.util.LinkedHashMap as the map that you pass to the indexer and to the passage builder.

Assuming that you have a method that can get the field/value map corresponding to a given document key, you can pass your map through the passage builder using code like the following:

    Map<String,Object> docMap = getDocumentMap(r.getKey());
    Map<String,List<Passage>> pmap = pb.getPassages(docMap);

There are a couple of variations of the getPassages method that provide ways to specify what to do with fields that were not mentioned in any addPassageField calls. The result of the call to getPassages is a map from field names to a list of Passage.

Highlighting passages

Once you have an instance of Passage that you would like to highlight, you need an instance of PassageHighlighter to actually highlight the passage. In our example above, we could highlight the passages in the subject field in the following way:

   Passage sp = pmap.get("subject").get(0);
   SimpleHighlighter sh = new SimpleHighlighter("<font color=\"#00ff00\">",
                                          "</font>",
                                          "<b>", "</b>");
   String hlSubj = sp.highlight(sh);

SimpleHighlighter is a simple, tag-based highlighter that can be used to place tags around the passage and around each of the query terms in the passage. The highlight method will use the passage highlighter to highlight the passage and it returns the highlighted passage. There are other methods on Passage to get un-highlighted field values and to get field values that are not cut down to the size specified when the field was defined.