Bug #6044: Enable sub-text searching in Solr queries - MetacatUI - Ecoinformatics Redmine

Actions

Copy link

Bug #6044

open

Enable sub-text searching in Solr queries

Added by Chris Jones over 11 years ago. Updated over 11 years ago.

Status:

In Progress

Priority:

Normal

Assignee:

Chris Jones

Target version:

Start date:

08/01/2013

Due date:

% Done:

Estimated time:

Bugzilla-Id:

Description

When submitting a search to the Solr index, we are only getting hits for whole-word searches. We want to be able to return results for fragments of text, like, a search for 'ocean' would also return hits of 'oceanographic'.

Actions

Copy link

Updated by Chris Jones over 11 years ago

In researching this, it looks like we'll need to change the Solr schema to use a different type of analyzer than the standard whitespace-delimited analyzer. We can chain the analyzers together, and use the NGramFilterFactory, or potentially the EdgeNGramFilterFactory. These filters will analyze text fields in the documents, and will decompose them to partial words at a length specified by the minGramSize and maxGramSize parameters. So, for the word 'oceanographic', 'ocean' would be a hit for EdgeNGramFilterFactory with a min gram size of 5, 4, 3, etc. To match the sub-term of 'graphic', we'd need the NGramFilterFactory, with a min gram size of no more than 7.

The impact of these filters on the index is that it will increase the number of indexed words many fold. I would think an NGram min size of 3 would be the shortest word we would want to match, possibly 4 or 5.

Changing the schema filters will require a re-index of the database contents.

Actions

Copy link

Updated by Chris Jones over 11 years ago

Target version changed from 1.0.0 to 1.1.0

Moving this to 1.1.0. The filters above might really bloat the index, and the current filters in schema.xml allow for * searches as a wildcard, but we're finding inconsistent results. *oil seems to match oil and soil, but *henology doesn't match phenology. Trailing *s seem to work better. This needs more investigation.

Actions

Copy link