How are passages constructed?
Before we discuss the API for constructing highlighted search results, we need to talk about how passages are constructed. For the sake of clarity, we'll say that we want to highlight the search results for a query that contains the terms t1 ... tn. For the moment we won't worry about what query operators occur in the query, we'll just worry about the terms.
The first question we need to ask is: how do we decide which occurrences of the query terms to highlight? In Minion, we use our passage retrieval algorithm to rank the passages from the search result that best match a passage query constructed from the query terms. We use a slightly modified version of the algorithm that provides for smaller gap and out-of-order penalties. Using the passage retrieval algorithm means that we can find the most compact passages from the document that match the query, even when the query didn't use any of the passage retrieval operators.
The passage retrieval algorithm provides a set of passages from the search result. These passages are really just the word positions in the document where the search terms occur, and they are divided up by the fields in which they occur. Note that it might be the case that a passage will not contain all of the query terms. This is possible when the query terms are widely separated in a document. The highlighting API provides ways to select and combine those passages in order to display the highlighted results.
The passage that we want to display is made up of three parts: the first is the query terms themselves. The query terms that appear in a passage will always be displayed. The second part of the passage is made up of words that occur between the query terms. We call these words the passage body. The third part of the passage is made up of words that occur before the first query term or after the last query term in the passage. These words are called the context for the passage.
We make a distinction among these parts because we want to be able to highlight them differently. So, for example, the query terms could be displayed in a bold font, the body terms could be in a normal font, and the context could be in a grayed out font. So, for a query like install fonts solaris, we could see a highlighted passage that looks like:
customers problem is, and how to resolve it. We have installed OCR fonts through Solaris fontadmin gui tools, once fonts installed we are trying to
In this example, the passage itself is "installed OCR fonts through Solaris" and the leading and trailing text are the context for the passage. We provide the context for the passage because in the instance where we find a passage that matches the query perfectly, we want to make sure that the user can make a determination from the context whether this result is worth looking at in full. Note that the passage highlighting is capable of highlighting passages that contain morphological variations of the query terms: here it's highlighted installed when the query term was install. This is a handy side-effect of using the passage retrieval algorithm to do the passage selection.
What to highlight, and how
The passage highlighting API consists of a few of parts. The first part is an implementation of the
PassageBuilder interface, which is used to determine what fields should be highlighted, how multiple passages from the fields should be treated, and how big the passages should be. The second part of the highlighting API is an implementation of the
PassageHighlighter interface, which can be used to add application specific markup to the passages selected from a search result. The third part is an implemenation of the
The first thing that you want to do with the
PassageBuilder is define the fields from which you want highlighted results. Let's say that we've indexed some email messages and we want to highlight the results of a search against those messages. We'd like to display the subject of the email as the link to click on to get the full message and we'd like to provide a highlighted passage from the body of the email as the "snippet" to show with the result. Here's some code that we could use for that:
PassageBuilder pb = r.getPassageBuilder(); pb.addPassageField("subject", Passage.Type.JOIN, -1, -1, false); pb.addPassageField("body", Passage.Type.UNIQUE, 6, 512, true);
addPassageField method requires a number of parameters.
The first parameter is the name of the field for which you would like to get passages. If this parameter is the string
NonField, then the passage builder will collect up passages from the text that doesn't occur in any defined field. If this parameter is
null, then the rest of the parameters apply to any fields that have not been explicitly named in any other call to
The second parameter tells the passage builder how it should handle multiple passages occurring in the same field. If this parameter is
JOIN, then the passages will be joined together to create a single passage that spans all of the sub-passages. This is useful for fields like our subject where we want a single string with all occurrences of the query terms highlighted. If this parameter is
UNIQUE, then a separate highlightable passage will be generated for each occurence of a valid passage in that field.
The third parameter is the size of the context to keep around the passage. This size is provided as the number of words to keep before and after the passage. A value of -1 indicates that the entire rest of the field is to be kept as context. Again, this is useful for our subject field, because we want to display the whole field. For our body passages, we specified a context size of 6 words, which should give a good idea of the gist of the document.
The fourth parameter is the maximum size of the highlighted passage, exclusive of any markup that may be added during highlighting. This parameter can be used to ensure that a highlighted passage doesn't get so large that it affects the look of the highlighted results. A value of -1 means that any size passage is acceptable. If the size provided is less than the size of the passage, then some words will be elided out of the passage when performing the highlighting.
The fifth and final parameter specifies how multiple passages should be treated. If this parameter is
true, then the passages will be returned to the application sorted by their passage score. This allows the application to highlight the best matching passage when multiple passages are available. If this parameter is
false, then the passages will be returned in the order in which they were found in the search result.
Getting the passages from a result
Once the passages that you're interested in have been defined for the passage builder, you need to pass your document through the passage builder so that it can select out the words that make up the passages. The document is provided as a map from field names to field values, and it's essential that the field names and values are presented to the passage builder in exactly the same order as when the document was originally indexed! If you don't ensure that this is the case, you will see that the incorrect words are highlighted in the resulting passages.
If you're using a
SimpleIndexer to index your data, then you need to make sure that the fields and values in the map are in the same order as they were provided to the indexer. You can ensure the order of elements in a map by using the
java.util.LinkedHashMap as the map that you pass to the indexer and to the passage builder.
Assuming that you have a method that can get the field/value map corresponding to a given document key, you can pass your map through the passage builder using code like the following:
Map<String,Object> docMap = getDocumentMap(r.getKey()); Map<String,List<Passage>> pmap = pb.getPassages(docMap);
There are a couple of variations of the
getPassages method that provide ways to specify what to do with fields that were not mentioned in any
The result of the call to
getPassages is a map from field names to a list of
Once you have an instance of
that you would like to highlight, you need an instance of
PassageHighlighter to actually highlight the passage. In our example above, we could highlight the passages in the subject field in the following way:
Passage sp = pmap.get("subject").get(0); SimpleHighlighter sh = new SimpleHighlighter("<font color=\"#00ff00\">", "</font>", "<b>", "</b>"); String hlSubj = sp.highlight(sh);
SimpleHighlighter is a simple, tag-based highlighter that can be used to place tags around the passage and around each of the query terms in the passage. The
highlight method will use the passage highlighter to highlight the passage and it returns the highlighted passage. There are other methods on
Passage to get un-highlighted field values and to get field values that are not cut down to the size specified when the field was defined.