Free Text Search At OLX

OLX Search powers India’s leading classified platform, serving over 30 million monthly active users. We serve over 2 million active ads at any given point in time, with more than 80,000 new Ads posted daily. This scale makes us a go-to destination for buying and selling across the country.

In this article, we’ll explore the inner workings of our search system, delving into the technologies and concepts that power the OLX search engine. We’ll take a closer look at how our search handles the terms you enter into our search box as free-text queries.

Screenshot of Search Screen from OLX Web App

Building a Performant Search Experience

To provide an efficient search experience on the OLX classified platform, our system is designed to deliver quick and relevant results from both titles and descriptions. It needs to smoothly handle large volumes of data and scale seamlessly as demand grows. Key features include delivering accurate search results with typo tolerance and enabling users to filter results based on various attributes. Enhancements such as autocomplete, search suggestions, and highlighting popular searches further enhance the user experience. Additionally, tracking search patterns and performance metrics ensures the system remains efficient, while support for multiple languages guarantees accessibility for all users.

Building Block

We are using ElasticSearch for building a performant free text search engine.

What is ElasticSearch ?

ElasticSearch is a distributed, RESTful search and analytics engine, scalable data store, and vector database capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning-fast search, fine‑tuned relevancy, and powerful analytics that scale with ease.

Why are we using ElasticSearch?

ElasticSearch helps meet these expectations by offering fast search responses and handling large datasets efficiently, thanks to its distributed architecture. It ensures relevant results with advanced full-text search capabilities, including fuzzy matching for typos and our ranking algorithms. Users benefit from flexible query options, including Boolean operators and phrase matching, while dynamic filtering and faceted search allow for precise result refinement. Enhancements like autocomplete, search suggestions and popular searches improve the user experience. Additionally, ElasticSearch scales seamlessly to manage high data volumes, and high traffic, tracks search patterns and performance for ongoing optimization, and supports multiple languages for diverse user needs.

ElasticSearch Mappings

Mappings are used to define how a document is indexed and how it indexes and stores its fields. Usually, mapping is compared to a database schema which describes the fields and properties that documents hold, the datatype of each field (e.g., string, integer, or date), and how those fields should be indexed and stored by Lucene.

Types of Mappings:

Static Mapping: It is used in cases where we know well in advance what kind of data the documents are going to store. So we can define the fields and their types while creating the index.
Dynamic Mapping: This method, which is used at OLX, allows for more flexibility. When indexing a document, you don’t necessarily need to predefine field names and types. Elasticsearch automatically adds new fields based on predefined custom rules. This includes adding fields to both the top-level mapping and any inner objects or nested fields. Moreover, dynamic mapping rules can be set up to adjust the existing mappings. These rules can be applied using dynamic field mapping or dynamic templates.

Let’s look at an example of the mappings in the OLX search index, which includes some dynamic templates (dynamic mapping rules) and properties (fields in a document) –

{
  "index_name" : {
    "mappings" : {
      "dynamic_templates" : [
        {
          "textfield" : {
            "match" : "*_txt",
            "mapping" : {
              "analyzer" : "text_en",
              "search_analyzer" : "text_en_search",
              "store" : true,
              "type" : "text_en"
            }
          }
        },
        {
          "longfield" : {
            "match" : "*_pl",
            "mapping" : {
              "type" : "long"
            }
          }
        }
      ],
      "properties" : {
        "title" : {
          "type" : "text",
          "store" : true,
          "copy_to" : [
            "title_legacy",
            "title_extended",
            "title_exact",
            "_text_",
            "title_exact_extended"
          ],
          "analyzer" : "text_en",
          "search_analyzer" : "text_en_search"
        },
        "title_exact" : {
          "type" : "text",
          "analyzer" : "text_exactish"
        },
        "title_extended" : {
          "type" : "text",
          "analyzer" : "text_en",
          "search_analyzer" : "text_en_search"
        },
      }
    }
  }
}

FYI – This configuration is just a shorter sample version of the actual configuration.

ElasticSearch Settings:

In ElasticSearch, we can configure cluster-level settings, node-level settings and index-level settings. Here we are going to talk about only index-level settings which involve text analysis and other configurations.

Let’s look at an example of the settings in the OLX search index –

{
  "index_name" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "<REFRESH_INTERVAL>",
        "number_of_shards" : "<NUMBER_OF_SHARDS>",
        "blocks" : {
          "read_only_allow_delete" : "false",
          "write" : "false"
        },
        "max_result_window" : "<MAX_RESULT_WINDOW>",
        "analysis" : {
          "filter" : {
            "stopword_en" : {
              "type" : "stop",
              "stopwords_path" : "analyzers/F204423403" 
            },
            "synonym_en" : {
              "type" : "synonym",
              "synonyms_path" : "analyzers/F220045584",
              "updateable" : "true"
            }
          },
          "analyzer" : {
            "text_en" : {
              "filter" : [ "stopword_en", "lowercase", "asciifolding", "porter_stem" ],
              "type" : "custom",
              "tokenizer" : "standard"
            },
            "text_exactish" : {
              "filter" : [ "lowercase" ],
              "type" : "custom",
              "tokenizer" : "standard"
            },
            "text_en_search" : {
              "filter" : [ "lowercase", "synonym_en", "stopword_en", "porter_stem" ],
              "type" : "custom",
              "tokenizer" : "standard"
            }
          }
        }
      }
    }
  }
}

How text analysis is being carried out?

Text analysis

The text analysis process is tasked with two functions: tokenization and normalization.

Tokenization is the process of dividing text content into separate units called tokens, typically representing individual words. This is done by a component known as a tokenizer, which splits the content based on specific criteria like whitespace, specific letters, or patterns. The primary role of the tokenizer is to follow set rules to break the content into these tokens.
Normalization involves modifying, transforming and enriching these tokens by applying processes like stemming which reduces words to their base or root form (e.g., “gaming” and “gamer” become “game”). It also includes handling synonyms, removing stop words, and applying other transformations.

The analysis of text is performed by analyzers, which encompass both tokenization and normalization processes. An analyzer utilizes one tokenizer along with zero or more token filters to process the text.

ElasticSearch offers a variety of prebuilt analyzers suited to most standard use cases. Besides the common standard and keyword analyzers, it includes options like simple, stop, whitespace, and pattern analyser.

Analyser	Description
Standard analyzer	This is the default analyzer that tokenizes input text based on grammar, punctuation, and whitespace. The output tokens are lowercase and stop words are removed.
Simple analyzer	A simple analyzer splits input text on any non-letters such as whitespaces, dashes, numbers, etc. It lowercase the output tokens, too.
Stop analyzer	A variation of the simple analyzer with English stop words enabled by default.
Whitespace analyzer	The whitespace analyzer’s job is to tokenize input text based on whitespace delimiters.
Keyword analyzer	The keyword analyzer doesn’t mutate the input text. The field’s value is stored as is.
Pattern analyzer	The pattern analyzer splits the tokens based on a regular expression (regex). By default, all the non-word characters help to split the sentence into tokens.
Fingerprint analyzer	The fingerprint analyzer sorts and removes the duplicate tokens to produce a single concatenated token.

Table 1 : Default Analysers

While the default analyzers generally meet common needs, Elasticsearch also allows for the creation of custom analyzers. This customization can be achieved by combining different tokenizers and character and token filters from a predefined set.

Match query

The match query is used to find documents that contain specific searched tokens, such as texts, numbers, dates, or boolean values. It supports fuzzy matching and is the standard method for performing full-text searches. This type of query allows for a flexible and comprehensive search experience by identifying relevant documents based on the specified criteria.

The following is the part of our Elastic search query for performing a match query:

GET index_name/_search
{
  "from": 0,
  "size": 1000,
  "query": {
     "bool": {
       "filter": [
         {
         "match": {
           "title_extended": {
           "query": <field>,
           "operator": "OR",
           "prefix_length": 0,
           "max_expansions": 50,
           "fuzzy_transpositions": true,
           "lenient": false,
           "zero_terms_query": "NONE",
           "auto_generate_synonyms_phrase_query": true,
           "boost": 1.0
          }
         }
        }
       ]
     },
     "bool": {
       "filter": [
         {
          "match": {
             "title_exact": {
             "query": <field>,
             "operator": "OR",
             "prefix_length": 0,
             "max_expansions": 50,
             "fuzzy_transpositions": true,
             "lenient": false,
             "zero_terms_query": "NONE",
             "auto_generate_synonyms_phrase_query": true,
             "boost": 1.0
            }
          }
        }
      ]
     }
   }
 }
<field> represents the field that you want to search on our platform.

Lets see the definitions of match query parameters:

Query: A required parameter that can be a text, number, a boolean value, or date, that users are searching for in the specified field.

Analyzer: An optional type string parameter, that will be used to convert the query text into tokens. The index analyzer mapped for the <field> is the default value.

Auto_generate_synonyms_phrase_query: An optional parameter of type boolean.When set to true, multi-term synonym match phrase queries are automatically generated. This is particularly relevant if the index uses a graph token filter, allowing for more comprehensive matching of phrases that include synonyms. The default value for this parameter is true.

Fuzziness: An optional parameter of type string that defines the maximum edit distance allowed for matching in a search query. Edit distance measures how similar two strings are by counting the number of single-character edits (insertions, deletions, or substitutions) needed to change one string into another. A lower edit distance means higher similarity between the strings. For example, the edit distance between “lisence” and “license” is 2, since two edits (changing “s” to “c” and “e” to “n”) are required. The fuzziness parameter can be set to a specific number (e.g., 1 or 2) or to “auto,” which allows variable fuzziness depending on the length of the word. This setting is useful for handling minor typographical errors or variations in the query terms.

Max_expansions: An optional parameter of type integer that specifies the maximum number of terms into which a query can be expanded during a fuzzy search. Fuzzy searches rewrite the query into multiple terms based on the specified fuzzy parameters, allowing for variations in the search terms. The max_expansions parameter limits the number of these expanded terms, helping to control the load on the cluster. The default value for this parameter is 50.

Prefix_length: An optional parameter of type integer that specifies the number of starting characters in a word that should remain unchanged during fuzzy matching. The default value is 0. By setting a prefix_length, the search assumes that spelling mistakes are unlikely to occur at the beginning of a word, thereby improving the efficiency of fuzzy queries. This parameter allows the query to focus on variations in the latter part of the word while keeping the prefix consistent.

Fuzzy_transpositions: An optional parameter of type boolean. The transpositions parameter is an optional boolean that determines whether fuzzy matching can involve transposing two adjacent characters (e.g., “ab” to “ba”). In this case, 2 edits count as 1. Its default value is true.

Lenient: An optional parameter of type boolean. If true, format-based errors are ignored, like entering a text query value for a number field. Its default value is false.

Operator: An optional parameter of type string that represents the boolean logic used in interpreting the text in the query value. It has two valid values, OR and AND, where OR is the default value.

Zero_terms_query: An optional parameter of type string. When employing a stop filter, for example, this value indicates whether documents are returned if the analyzer deletes all tokens. It has two valid values, none and all, none is the default value indicating whether the analyzer removes all tokens, (for example, in a query “the” using an English language analyzer), no documents will be returned.

Understanding the Search Flow From Start to End

Suppose a user searches for the query “bullet.” As shown in the match query above, there are two fields (title_exact and title_extended) being used to retrieve the results. However, a lot of text analysis, tokenization, and processing occur in the background to find the relevant result.

For these two fields, we have created custom analyzers (refer to the mappings of the olxin index). According to the definitions of these analyzers (refer to the settings of the olxin index), the custom analyzers used are:

For title_exact, the text_exactish analyzer is used.
For title_extended, both text_en and text_en_search analyzers are used.

"text_en" : {
    "filter" : [ "stopword_en", "lowercase", "asciifolding", "porter_stem" ],
    "type" : "custom",
    "tokenizer" : "standard"
},
"text_exactish" : {
    "filter" : [ "lowercase" ],
    "type" : "custom",
    "tokenizer" : "standard"
},
"text_en_search" : {
    "filter" : [ "lowercase", "synonym_en", "stopword_en", "porter_stem" ],
    "type" : "custom",
    "tokenizer" : "standard"
}

In all three of the above custom analyzers, we have a tokenizer and token filters:

Standard Tokenizer: This is the default tokenizer that splits the input text based on grammar, punctuation, and whitespace. The output tokens are converted to lowercase, and stop words are removed.
Stopwords_en: This is our custom set of stopwords, which filters out standard English stopwords and profane words.
Synonym_en: This is our custom set of synonyms keywords, which converts the query into multiple synonyms before matching. For bullet, we have a bullet, bulat, bolet, bult, boolet, bulit, buleat, bulett, bullate, bullet-a, built.
Lowercase: This filter converts all characters to lowercase.
Asciifolding: This filter converts characters with diacritical marks (accents) to their ASCII equivalents. This process is known as “folding.” For example, it transforms characters like “é” to “e,” “ü” to “u,” and “ñ” to “n.”
Porter_stem: This filter reduces words to their root form. For example, the words “programming,” “programmer,” and “programs” can all be reduced to the common word stem “program.”

First, the text passes through the tokenizer, and then through the token filters.

Now, the match parameters come into play, as configured in the match query above. For instance, the operator being used is of the OR type, and the maximum expansion is 50, meaning the query can be expanded into fifty terms during fuzzy searching. Additionally, fuzzy transpositions are enabled.

Here’s a breakdown of possible fuzzy transpositions for “bullet”:

Single Transpositions

These involve swapping adjacent characters in the word:

“bullet” (no transposition, the original word)
“blulet” (swap “u” and “l”)
“buellt” (swap “l” and “e”)

Two Transpositions –

These involve two adjacent swaps or more complex changes:

“blute” (could be a result of swapping “u” and “l” and removing the extra “l”)
“butlle” (could be a result of swapping “l” and “e” and swapping “t” and “e”)

As max expansion is 50 hence Elastic search will generate fifty additional terms based on the original fuzzy search term this limits the number of these expanded terms, helping to control the load on the cluster.

Thank you for reading 🙂

In conclusion, the OLX search system is a sophisticated and dynamic component of our platform, designed to offer users a seamless and efficient search experience. By leveraging advanced technologies and implementing key features such as typo tolerance, filters, and multilingual support, we ensure that users can easily find what they’re looking for. Our continuous focus on performance monitoring and system enhancements keeps the platform responsive and user-friendly. As we move forward, we remain committed to evolving and improving our search capabilities to meet the growing needs of our diverse user base, making OLX the preferred destination for buying and selling across India.

Author

Ritika Nagar

I’m a tech enthusiast passionate about advanced search solutions.At OLX, I focus on backend development, optimizing search performance, and elevating user experiences.
View all posts

Categories: Backend, Tech Blogs

August 5, 2024

Tags: Backend, Free Search, System Architecture

Building Technology at OLX India

India’s Leading Classified Platform