Tuesday, 4 February 2014

ElasticSearch with WordNet

We will create a simple elasticsearch index that uses a wordnet synonyms to search a word.

What is ElasticSearch?

Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elasticsearch gives you the ability to move easily beyond simple full-text search. Through its robust set of APIs and query DSLs, plus clients for the most popular programming languages, Elasticsearch delivers on the near limitless promises of search technology.

What is WordNet?

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and can be downloaded and used freely. The database can also be browsed online.

What is ElasticSearch Analyzers?

Analyzers are composed of a single Tokenizer and zero or more TokenFilters. The tokenizer may be preceded by one or more CharFilters. The analysis module allows you to register Analyzers under logical names which can then be referenced either in mapping definitions or in certain APIs.

What is ElasticSearch Filters?

Filters can be a great candidate for caching. Caching the result of a filter does not require a lot of memory, and will cause other queries executing against the same filter (same parameters) to be blazingly fast.

Some filters already produce a result that is easily cacheable, and the difference between caching and not caching them is the act of placing the result in the cache or not. These filters, which include the term, terms, prefix, and range filters, are by default cached and are recommended to use (compared to the equivalent query version) when the same filter (same parameters) will be used across multiple different queries (for example, a range filter with age higher than 10).

Steps to Configure the WordNet in ElasticSearch

After installing the elasticsearch , you need to configure the WordNet to access the synonyms.

Step 1: Create a directory called "analysis" in the elasticsearch config directory.
  
Step 2: Download the Wordnet Zip file from internet.

Step 3: Extract the Zip file.

Step 4: Copy the "wn_s.pl" file from the Wordnet extracted folder and Paste to elasticsearch "analysis" folder.

Step 5: Start the elasticsearch server.

ElasticSearch Synonyms Filter using WordNet

The following example is used to create a ElasticSearch Synonyms Filter with WordNet.

PUT Requests:

Create a Index with Wordnet Mappings.

http://localhost:9200/projects/
{
  "settings" : {
    "index" : {
        "analysis" : {
            "analyzer" : {
                "synonym" : {
                    "tokenizer" : "whitespace",
                    "filter" : ["synonym"]
                }
            },
            "filter" : {
                "synonym" : {
                    "type": "synonym",
                        "format": "wordnet",
                        "synonyms_path": "analysis/wn_s.pl"
                }
            }
        }
    }
  },
  "mappings" : {
       "_default_": {
           "properties" : {
               "name" : {
                   "type" : "string",
                   "analyzer" : "synonym"
               }
           }
        }
  }
}

Add a values in that index,
http://localhost:9200/projects/project/1
{
"name" : "child"
}
http://localhost:9200/projects/project/2
{
"name" : "baby"
}


POST Request:

http://localhost:9200/projects/_search?pretty=true
{
            "query" : {
        "match": {
             "name": {
                 "query": "child"
             }
        }
            }
}

Output:

    {
       "took": 2,
       "timed_out": false,
       "_shards":
       {
           "total": 1,
           "successful": 1,
           "failed": 0
       },
       "hits":
       {
           "total": 2,
           "max_score": 2.3731742,
           "hits":
           [
               {
                   "_index": "projects",
                   "_type": "project",
                   "_id": "1",
                   "_score": 2.3731742,
                   "_source":
                   {
                       "name": "child"
                   }
               },
               {
                   "_index": "projects",
                   "_type": "project",
                   "_id": "2",
                   "_score": 0.028331274,
                   "_source":
                   {
                       "name": "baby"
                   }
               }
           ]
       }
    }


NOTE: I am using firefox rest client to run this example.

                                                                           Home



6 comments:

  1. You can check that the index contains synonyms for a given word like this:
    curl -XGET 'localhost:9200/projects/_analyze?pretty&analyzer=synonym' -d 'child'

    Note that the index with synonyms takes 3-4x disk space than the one without.

    ReplyDelete
  2. was struggling to find the Wordnet file to integrate with Elastic search,
    Worked Like a charm, Thanks man!.

    ReplyDelete