ITPS | Full-Text Search in MongoDB

MongoDB, one of the leading NoSQL databases,
is well known for its fast performance, flexible schema, scalability and
great indexing capabilities. At the core of this fast performance lies MongoDB
indexes, which support efficient execution of queries by avoiding full-collection
scans and hence limiting the number of documents MongoDB searches.

Starting from
version 2.4, MongoDB began with an experimental feature supporting Full-Text Search using Text Indexes. This feature has now
become an integral part of the product (and is no longer an experimental
feature). In this article we are going to explore the full-text search
functionalities of MongoDB right from fundamentals.

If you are new to MongoDB, I recommend that you read the
following articles on Envato Tuts+ that will help you understand the basic concepts
of MongoDB:

The Basics

Before we get into any details, let us look at some background.
Full-text search refers to the technique of searching a full-text database against the search criteria specified by the user.
It is something similar to how we search any content on Google (or in fact any
other search application) by entering certain string keywords/phrases and
getting back the relevant results sorted by their ranking.

Here
are some more scenarios where we would see a full-text search happening:

Consider
searching your favorite topic on Wiki. When you enter a search text on Wiki,
the search engine brings up results of all the articles related to the keywords/phrase you searched for (even if those keywords were used deep inside
the article). These search results are sorted by relevance based on their
matched score.
As
another example, consider a social networking site where the user can make a
search to find all the posts which contain the keyword cats in them; or to be more complex, all the posts which have
comments containing the word cats.

Before we move on, there are certain general terms related
to full-text search which you should know. These terms are applicable to any
full-text search implementation (and not MongoDB-specific).

Stop Words

Stop words are the irrelevant words that should be filtered
out from a text. For example: a, an, the, is, at, which, etc.

Stemming

Stemming is the process of reducing the words to their stem.
For example: words like standing, stands, stood, etc. have a common base stand.

Scoring

A relative ranking to measure which of the search results is
most relevant.

Alternatives to
Full-Text Search in MongoDB

Before MongoDB came up with the concept of text indexes, we
would either model our data to support keyword searches or use regular expressions for implementing such search
functionalities. However, using any of these approaches had its own limitations:

Firstly,
none of these approaches supports functionalities like stemming, stop words,
ranking, etc.
Using
keyword searches would require the creation of multi-key indexes, which are not
sufficient compared to full-text.
Using
regular expressions is not efficient from the performance point of view, since
these expressions do not effectively utilize indexes.
In
addition to that, none of these techniques can be used to perform any phrase searches
(like searching for ‘movies released in 2015’) or weighted searches.

Apart from these approaches, for more advanced and complex
search-centric applications, there are alternative solutions like Elastic
Search or SOLR. But
using any of these solutions increases the architectural complexity of the
application, since MongoDB now has to talk to an additional external database.

Note
that MongoDB’s full-text search is not proposed as a complete replacement of search
engine databases like Elastic, SOLR, etc. However, it can be effectively used
for the majority of applications that are built with MongoDB today.

Introducing MongoDB
Text Search

Using MongoDB full-text search, you can define a text index
on any field in the document whose value is a string or an array of strings. When we create a text
index on a field, MongoDB tokenizes and stems the indexed field’s text content,
and sets up the indexes accordingly.

To understand things further, let us now dive into some practical
things. I want you to follow the tutorial with me by trying out the
examples in mongo shell. We will first create some sample data which we will be
using throughout the article, and then we’ll move on to discuss key concepts.

For the purpose of this article, consider a collection messages which stores documents of the
following structure:

{
    "subject":"Joe owns a dog", 
    "content":"Dogs are man's best friend", 
    "likes": 60, 
    "year":2015, 
    "language":"english"
}

Let us insert some sample documents using the insert command to create our test data:

db.messages.insert({"subject":"Joe owns a dog", "content":"Dogs are man's best friend", "likes": 60, "year":2015, "language":"english"})

db.messages.insert({"subject":"Dogs eat cats and dog eats pigeons too", "content":"Cats are not evil", "likes": 30, "year":2015, "language":"english"})

db.messages.insert({"subject":"Cats eat rats", "content":"Rats do not cook food", "likes": 55, "year":2014, "language":"english"})

db.messages.insert({"subject":"Rats eat Joe", "content":"Joe ate a rat", "likes": 75, "year":2014, "language":"english"})

Creating a Text Index

A text index is created quite similar to how we create a
regular index, except that it specifies the text
keyword instead of specifying an ascending/descending order.

Indexing a Single Field

Create a text index on the subject
field of our document using the following query:

db.messages.createIndex({"subject":"text"})

To test this newly created text index on the subject field, we will search documents using the $text operator. We will be looking for
all the documents that have the keyword dogs
in their subject field.

Since we
are running a text search, we are also interested in getting some
statistics about how relevant the resultant documents are. For this purpose, we
will use the { $meta: "textScore" } expression, which provides information on the processing
of the $text operator. We will also sort
the documents by their textScore using the sort command. A higher textScore indicates a more relevant
match.

db.messages.find({$text: {$search: "dogs"}}, {score: {$meta: "toextScore"}}).sort({score:{$meta:"textScore"}})

The above query returns the following documents containing
the keyword dogs in their subject field.

{ "_id" : ObjectId("55f4a5d9b592880356441e94"), "subject" : "Dogs eat cats and dog eats pigeons too", "content" : "Cats are not evil", "likes" : 30, "year" : 2015, "language" : "english", "score" : 1 }

{ "_id" : ObjectId("55f4a5d9b592880356441e93"), "subject" : "Joe owns a dog", "content" : "Dogs are man's best friend", "likes" : 60, "year" : 2015, "language" : "english", "score" : 0.6666666666666666 }

As you can see, the first document has a score of 1 (since
the keyword dog appears twice in its subject) as opposed to the second document
with a score of 0.66. The query has also sorted the returned documents in descending
order of their score.

One question that might
arise in your mind is that if we are searching for the keyword dogs, why is the search engine is taking
the keyword dog (without ‘s’) into
consideration? Remember our discussion on stemming, where any search keywords
are reduced to their base? This is the reason why the keyword dogs is reduced to dog.

Indexing Multiple
Fields (Compound Indexing)

More often than not, you will be using text search on
multiple fields of a document. In our example, we will enable compound text
indexing on the subject and content fields. Go ahead and execute
the following command in mongo shell:

db.messages.createIndex({"subject":"text","content":"text"})

Did this work? No!! Creating a second text index will give
you an error message saying that a full-text search index already exists. Why is it so? The answer is that text
indexes come with a limitation of only one text index per collection. Hence if
you would like to create another text index, you will have to drop the existing
one and recreate the new one.

db.messages.dropIndex("subject_text") 
db.messages.createIndex({"subject":"text","content":"text"})

After executing the above index creation queries, try
searching for all documents with keyword cat.

db.messages.find({$text: {$search: "cat"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

The above query would output the following documents:

{ "_id" : ObjectId("55f4af22b592880356441ea4"), "subject" : "Dogs eat cats and dog eats pigeons too", "content" : "Cats are not evil", "likes" : 30, "year" : 2015, "language" : "english", "score" : 1.3333333333333335 }

{ "_id" : ObjectId("55f4af22b592880356441ea5"), "subject" : "Cats eat rats", "content" : "Rats do not cook food", "likes" : 55, "year" : 2014, "language" : "english", "score" : 0.6666666666666666 }

You can see that the score of the first document, which contains
the keyword cat in both subject
and content fields, is higher.

Indexing the Entire
Document (Wildcard Indexing)

In the last example, we put a combined index on the subject and content fields. But there can be scenarios where you want any text
content in your documents to be searchable.

For example, consider storing
emails in MongoDB documents. In the case of emails, all the fields, including
Sender, Recipient, Subject and Body, need to be searchable. In such scenarios you
can index all the string fields of your document using the $** wildcard specifier.

The query would go something like this (make sure you are
deleting the existing index before creating a new one):

db.messages.createIndex({"$**":"text"})

This query would automatically set up text indexes on any
string fields in our documents. To test this out, insert a new document with a
new field location in it:

db.messages.insert({"subject":"Birds can cook", "content":"Birds do not eat rats", "likes": 12, "year":2013, location: "Chicago", "language":"english"})

Now if you try text searching with keyword chicago (query below), it will return
the document which we just inserted.

db.messages.find({$text: {$search: "chicago"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

A few things I would like to focus on here:

Observe
that we did not explicitly define an index on the location field after we inserted a new document. This is because we
already have defined a text index on the entire document using the $** operator.
Wildcard
indexes can be slow at times, especially in scenarios where your data is very
large. For this reason, plan your document indexes (aka wildcard indexes)
wisely, as it can cause a performance hit.

Advanced Searching

Phrase Search

You can search for phrases like “smart birds who love cooking” using text indexes. By default, the
phrase search makes an OR search on
all the specified keywords, i.e. it will look for documents which contains
either the keywords smart, bird, love or cook.

db.messages.find({$text: {$search: "smart birds who cook"}}, {score: {$meta: "text Score"}}).sort({score:{$meta:"text Score"}})

This query would output the following documents:

{ "_id" : ObjectId("55f5289cb592880356441ead"), "subject" : "Birds can cook", "content" : "Birds do not eat rats", "likes" : 12, "year" : 2013, "location" : "Chicago", "language" : "english", "score" : 2 }

{ "_id" : ObjectId("55f5289bb592880356441eab"), "subject" : "Cats eat rats", "content" : "Rats do not cook food", "likes" : 55, "year" : 2014, "language" : "english", "score" : 0.6666666666666666 }

In case you would like to perform an exact phrase search
(logical AND), you can do so by
specifying double quotes in the search text.

db.messages.find({$text: {$search: ""cook food""}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

This query would result in the following document, which
contains the phrase “cook food” together:

{ "_id" : ObjectId("55f5289bb592880356441eab"), "subject" : "Cats eat rats", "content" : "Rats do not cook food", "likes" : 55, "year" : 2014, "language" : "english", "score" : 0.6666666666666666 }

Negation Search

Prefixing a search keyword with – (minus sign) excludes all the documents that contain the negated
term. For example, try searching for any document which contains the
keyword rat but does not contain birds using the following query:

db.messages.find({$text: {$search: "rat -birds"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

Looking Behind the Scenes

One important functionality I did not disclose till now is
how you look behind the scenes and see how your search keywords are being stemmed,
stop wording applied, negated, etc. $explain
to the rescue. You can run the explain query by passing true as its parameter, which will give you detailed stats on the
query execution.

db.messages.find({$text: {$search: "dogs who cats dont eat ate rats "dogs eat" -friends"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}}).explain(true)

If
you look at the queryPlanner object
returned by the explain command, you will be able to see how MongoDB parsed the
given search string. Observe that it neglected stop words like who, and stemmed dogs to dog.

You can also see the terms which we neglected from our search
and the phrases we used in the parsedTextQuery
section.

"parsedTextQuery" : {
         "terms" : [
                 "dog",
                 "cat",
                 "dont",
                 "eat",
                 "ate",
                 "rat",
                 "dog",
                 "eat"
         ],
         "negatedTerms" : [
                 "friend"
         ],
         "phrases" : [
                 "dogs eat"
         ],
         "negatedPhrases" : [ ]
 }

The explain query will be highly useful as we perform more
complex search queries and want to analyze them.

Weighted Text Search

When we have indexes on more than one field in our document,
most of the times one field will be more important (i.e. more weight) than
the other. For example, when you are searching across a blog, the title of the
blog should be of highest weight, followed by the blog content.

The default weight for every indexed field is 1. To assign
relative weights for the indexed fields, you can include the weights option while using the createIndex
command.

Let’s understand this with an example. If you try searching for the cook keyword with our current
indexes, it will result in two documents, both of which have the same
score.

db.messages.find({$text: {$search: "cook"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

{ "_id" : ObjectId("55f5289cb592880356441ead"), "subject" : "Birds can cook", "content" : "Birds do not eat rats", "likes" : 12, "year" : 2013, "location" : "Chicago", "language" : "english", "score" : 0.6666666666666666 }

{ "_id" : ObjectId("55f5289bb592880356441eab"), "subject" : "Cats eat rats", "content" : "Rats do not cook food", "likes" : 55, "year" : 2014, "language" : "english", "score" : 0.6666666666666666 }

Now let us modify our indexes to include weights; with the subject field having a weight of 3
against the content field having a weight
of 1.

db.messages.createIndex( {"$**": "text"}, {"weights": { subject: 3, content:1 }} )

Try searching for keyword cook now, and you will see that the document which contains this keyword
in the subject field has a greater score
(of 2) than the other (which has 0.66).

{ "_id" : ObjectId("55f5289cb592880356441ead"), "subject" : "Birds can cook", "content" : "Birds do not eat rats", "likes" : 12, "year" : 2013, "location" : "Chicago", "language" : "english", "score" : 2 }

{ "_id" : ObjectId("55f5289bb592880356441eab"), "subject" : "Cats eat rats", "content" : "Rats do not cook food", "likes" : 55, "year" : 2014, "language" : "english", "score" : 0.6666666666666666 }

Partitioning Text
Indexes

As the data stored in your application grows, the size of your text indexes keeps on growing too. With this increase in size of text indexes,
MongoDB has to search against all the indexed entries whenever a text search is
made.

As a technique to keep your text search efficient with growing indexes, you
can limit the number of scanned index entries by using equality conditions with a regular $text search. A very common
example of this would be searching all the posts made during a certain
year/month, or searching all the posts with a certain category/tag.

If you observe the documents which we are working upon, we
have a year field in them which we
have not used yet. A common scenario would be to search messages by year, along
with the full-text search that we have been learning about.

For this, we can
create a compound index that specifies an ascending/descending index key on year followed by a text index on the subject field. By doing this, we are
doing two important things:

We are logically partitioning the
entire collection data into sets separated by year.
This would limit the text search to
scan only those documents which fall under a specific year (or call it set).

Drop the indexes that you already have and create a new
compound index on (year, subject):

db.messages.createIndex( { "year":1, "subject": "text"} )

Now execute the following query to search all the messages
that were created in 2015 and contain the cats keyword:

db.messages.find({year: 2015, $text: {$search: "cats"}}, {score: {$meta: "textScore"}}).sort({score:{$meta:"textScore"}})

The query would return only one matched document as expected.
If you explain this query and look
at the executionStats, you will find
that totalDocsExamined for this
query was 1, which confirms that our new index got utilized correctly and
MongoDB had to only scan a single document while safely ignoring all other
documents which did not fall under 2015.

Text Indexes: Benefits

What More Can Text
Indexes Do?

We have come a long way in this article learning about text
indexes. There are many other concepts that you can experiment with text
indexes. But owing to the scope of this article, we will not be able to discuss
them in detail today. Nevertheless, let’s have a brief look at what these
functionalities are:

Text
indexes provide multi-language support, allowing you to search in different
languages using the $language operator. MongoDB currently supports around 15
languages, including French, German, Russian, etc.
Text
indexes can be used in aggregation pipeline queries. The match stage in an
aggregate search can specify the use of a full-text search query.
You
can use your regular operators for projections, filters, limits, sorts, etc., while
working with text indexes.

MongoDB Text Indexing
vs. External Search Databases

Keeping in mind the fact that MongoDB full-text search is not
a complete replacement for traditional search engine databases used with
MongoDB, using the native MongoDB functionality is recommended for the
following reasons:

As
per a recent talk at MongoDB, the current scope of text search works
perfectly fine for a majority of applications (around 80%) that are built using
MongoDB today.
Building
the search capabilities of your application within the same application
database reduces the architectural complexity of the application.
MongoDB
text search works in real time, without any lags or batch updates. The moment
you insert or update a document, the text index entries are updated.
Text
search being integrated into the db kernel functionalities of MongoDB, it is
totally consistent and works well even with sharding and replication.
It
integrates perfectly with your existing Mongo features such as filters,
aggregation, updates, etc.

Text Indexes: Drawbacks

Full-text search being a relatively new feature in MongoDB,
there are certain functionalities which it currently lacks. I would divide them into three categories. Let’s have a look.

Functionalities Missing
From Text Search

Text
Indexes currently do not have the capability to support pluggable interfaces
like pluggable stemmers, stop words, etc.
They
do not currently support features like searching based on synonyms, similar words, etc.
They
do not store term positions, i.e. the number of words by which the two keywords
are separated.
You
cannot specify the sort order for a sort expression from a text index.

Restrictions in
Existing Functionalities

A
compound text index cannot include any other type of index, like multi-key
indexes or geo-spatial indexes. Additionally, if your compound text index
includes any index keys before the text index key, all the queries must specify
the equality operators for the preceding keys.
There
are some query-specific limitations. For example, a query can specify only a single $text expression, you can’t use $text with $nor, you can’t use the hint()
command with $text, using $text with $or needs all the clauses in your $or expression to be indexed, etc.

Performance Downsides

Text indexes create an overhead while inserting new documents. This in turn hits the insertion throughput.
Some queries like phrase searches can be relatively slow.

Wrapping Up

Full-text search has always been one of the most demanded
features of MongoDB. In this article, we started with an introduction to what full-text search is, before moving on to the basics of creating text indexes.

We then explored
compound indexing, wildcard indexing, phrase searches and negation searches. Further,
we explored some important concepts like analyzing text indexes, weighted
search, and logically partitioning your indexes. We can expect some major updates to this functionality in the
upcoming releases of MongoDB.

I recommend that you give text-search a try and share your thoughts. If you have already implemented it in your application, kindly share your experience here. Finally, feel free to post your questions, thoughts and
suggestions on this article in the comment section.

Source: Envato Tuts+ CodeNew feed

Full-Text Search in MongoDB

The Basics

Stop Words

Stemming

Scoring

Alternatives to Full-Text Search in MongoDB

Introducing MongoDB Text Search

Creating a Text Index

Indexing a Single Field

Indexing Multiple Fields (Compound Indexing)

Indexing the Entire Document (Wildcard Indexing)