Beyond Keywords: A Practical Guide to Hybrid Search with Qdrant and a RoBERTa-based Query Router
Model Card: 🤗 timofeyk/roberta-query-router-ecommerce
Demo: 🌌 roberta-query-router-demo
Introduction
In the competitive landscape of e-commerce, the search bar is more than a utility; it's the primary conduit between a customer's intent and a potential purchase. For decades, this interaction has been dominated by lexical search systems powered by algorithms like BM25. While incredibly fast and effective for specific queries like SKUs or exact product names, they often fail when faced with the ambiguity and nuance of human language. A customer searching for a "warm top for winter" isn't just matching keywords; they are expressing a semantic need that traditional systems struggle to comprehend.
This article details the journey to overcome these limitations. I engineered a hybrid search system that combines the precision of traditional full-text search with the semantic understanding of a modern vector search engine, Qdrant. The cornerstone of this architecture is a custom-trained RoBERTa-based classifier that acts as an intelligent "query router," dynamically weighting the influence of each search system based on the nature of the query itself. By adopting this data-driven approach, I achieved significant and consistent improvements of the key relevance metric, nDCG, increasing it by 20%.
Where Lexical-Only Search Fails
The journey began with a standard, battle-tested full-text search engine (Apache Solr) using a BM25 relevance model. This system excels at "known-item" searches where the user knows exactly what they are looking for. However, its performance degrades on more exploratory or abstract queries.
The core limitations fall into three categories:
- The Synonym Problem: A user might search for "sneakers," but the product title uses the word "trainers." Without a massive, manually curated synonym dictionary, the lexical system will fail to make the connection.
- The Intent Gap: A query like "outfit for a summer wedding" has no keywords that directly match product titles. The user's intent is conceptual, requiring an understanding of style, occasion, and product attributes that go beyond simple text matching.
- The "Long-Tail" Challenge: Highly specific, multi-word queries often have too few matching documents, leading to zero-result pages and customer frustration, even when relevant products exist.
Fig.1. Lexical vs. Conceptual queries
It became clear that to serve customers better, we need a system that could search by meaning, not just by words.
Building the Semantic Engine with Qdrant
To address the shortcomings of lexical search, I introduced a second search system: a vector (semantic) search engine powered by Qdrant. Vector search represents data—in my case, product descriptions and attributes—as numerical vectors (embeddings) in a high-dimensional space. In this space, proximity equals semantic similarity.
The indexing pipeline was straightforward:
- Embedding Model Selection: I chose the
multilingual-e5-base
open-source model for its strong performance in generating meaningful text embeddings. - Product Corpus Embedding: Then I processed a diverse product catalog (>1 million SKUs), concatenating key text fields (title, description, brand, color) for each product and passing them through the model to generate a 768-dimensional vector.
- Indexing in Qdrant: These vectors were indexed into a Qdrant collection. I leveraged its
int8
scalar quantization feature to reduce memory footprint by 4x with minimal impact on accuracy, a critical factor for handling large datasets.
Fig.2. Qdrant indexing pipeline
With this, I had two powerful but separate search systems. The next challenge was figuring out how to make them work together.
The Query Router: An Intelligent Blender for Search Scores
A naive approach to combining results might be a 50/50 static blend of the two systems' scores. However, this is suboptimal. A query for "B003O0MNGC" should rely almost entirely on the lexical score, while "a gift for my dad who likes fishing" should lean more on the vector score.
To solve this, I decided to train a model to predict the optimal blend for any given query. I framed it as a binary classification problem: for a given query, is it better suited for "lexical_search" or "vector_search"?
Creating the Training Dataset
This was the most critical step. I generated a high-quality training set using search results as a signal:
- Sample Queries: I took a large sample of real, anonymized user queries (around 90,000 queries)
- Ground Truth: For each query, I had a corresponding set of "ideal" product IDs. These are known relevant matches for selected queries.
- Dual Search Execution: Each sample query was run against both Solr (lexical) and Qdrant (vector) systems.
- Performance Comparison: The script calculated the nDCG score for both sets of search results against the ground truth.
- Label Assignment: If the Solr result had a higher nDCG score, the query was labeled as
lexical_search
. Otherwise, it was labeled asvector_search
.
This process yielded a rich dataset of queries labeled with their optimal search strategy, forming the foundation for the router model.
Fig.3. Dataset generation pipeline
Training the RoBERTa Classifier
With the dataset in hand, I trained a roberta-base
model using the Hugging Face transformers
library. The model was then fine-tuned on my (query, label)
pairs. Key aspects of the training included:
- Weighted Loss Function: To counteract a potential imbalance in the labels, I used a custom
Trainer
with a weightedCrossEntropyLoss
to ensure the model didn't simply favor the majority class. - Hyperparameter Tuning: Through multiple trial-and-error iterations, I carefully adjusted the training hyperparameters to achieve the best accuracy from the resulting model. If you’re interested in more technical details, check out the Hugging Face page for the trained mode: https://huggingface.co/timofeyk/roberta-query-router-ecommerce
The final output was a highly accurate model capable of predicting the ideal search strategy for a given query.
The Hybrid System in Action
The runtime integration of the classifier is seamless and adds minimal latency:
- A user's query is first sent to the trained RoBERTa classifier.
- The model outputs raw logits for each class. The system applies a softmax function to these logits to get a probability distribution, for example:
{'lexical_search': 0.85, 'vector_search': 0.15}
. - These probabilities become the weights (
w_lexical
andw_vector
) for the final relevance score. - The query is sent in parallel to both Solr and Qdrant.
- As results are returned, they are combined using a weighted fusion formula. For each product appearing in either result set, the final score is calculated as:
Fig.4. Final system architecture
Results: A Clear Win for Relevance
I evaluated this new hybrid system against the original lexical-only baseline on a hold-out test set of 5,000 queries. The results demonstrated a clear and consistent improvement across all key information retrieval metrics.
Model | nDCG (mean) | Recall@100 (mean) | Recall@10 (mean) |
---|---|---|---|
Lexical (Solr) Baseline | 28.94% | 36.67% | 16.84% |
Vector (Qdrant) Baseline | 29.08% | 36.95% | 16.36% |
RoBERTa-based Query Router | 34.86% | 36.28% | 17.16% |
Table 1. Relevance metrics performance comparison
The nDCG score has improved by 20% compared to just using plain lexical search or vector search. This means that it consistently ranks relevant products much higher in the search results. The Recall scores stayed relatively the same, with Recall@10 seeing a slight improvement.
Fig.5. Relevance metrics performance comparison
The hybrid system's ability to dynamically adjust its strategy was the key to its success. For specific queries, it behaved like a precise lexical engine. For ambiguous ones, it seamlessly transitioned to a semantic engine, uncovering relevant products that the baseline system would have missed entirely.
Summary and Future Work
By augmenting an existing full-text search with a Qdrant-powered vector search and intelligently blending the results with a custom-trained classifier, I have built a more intuitive and effective search experience. This hybrid approach respects the strengths of both lexical and semantic search, using a data-driven model to apply the right tool for each unique query.
The significant gains in nDCG validate this architecture as a powerful, practical, and achievable upgrade for any e-commerce platform looking to move beyond the limitations of keyword matching.
A logical next step would be to target the Recall score, improving how many relevant results are being fetched from the search index. But that is a story for another article.