On-site search in ecommerce has improved massively in recent years, thanks to search systems such as Lucene, Solr, Algolia, and Elastic. Despite on-site search generating massive amounts of revenue for online businesses, it’s still fairly common for ecommerce sites to return poor search results.
Improving the conversion rate on-site search engines provide has, therefore, been a major focus for many ecommerce teams. Learning to Rank, LETOR, or LTR models are designed to help automatically improve the quality of users’ search experiences when using on-site search systems and can give a boost to on-site search conversion rates.
Learning to Rank, or machine-learned ranking (MLR), is the application of machine learning techniques for the creation of ranking models for information retrieval systems. LTR is most commonly associated with on-site search engines, particularly in the ecommerce sector, where just small improvements in the conversion rate of those using the on-site search engine can make a massive difference to revenue.
The models train themselves to improve the relevance of the results they provide by predicting the optimal order in which to display results, causing site search conversion rates to rise. Learning to Rank isn’t a single model, but rather a whole class of algorithms, including models such as RankNet, LambdaRank, and LambdaMART, which all apply supervised learning approaches to tackle the problem.
While LTR is most commonly associated with information retrieval, such as search engines, it’s also been used for other things, such as recommender systems, ad placement, and even the placement of news stories on website home pages. It’s a powerful and versatile technique.
If a customer uses your on-site search, they’re looking for something specific - either because they’re researching it before they buy, or because they’re ready to buy it now and want to add it to their basket. As a result, your ecommerce team is likely to have two important KPIs in their remit: the percentage of customers using on-site search, and the ecommerce conversion rate of those using on-site search. Improve either one and sales will go up.
Typically, those customers using your on-site search will have a conversion rate that is significantly higher than the general traffic not using site search. They’re much more likely to be qualified leads, further down the purchase funnel, and much more likely to transact during their visit. And they spend more.
If you can persuade more customers to use your site search, and make the site search conversion rate go up, by improving the results returned, your profits will rise. Potentially, by quite a lot.
In the hypothetical example below, an ecommerce website sees 10% of visitors using the search facility. The conversion rate for those visitors who don’t use site-search is 1%, but it rises to 10% when they use site-search. Their average order values are typically higher too, so while non-search visitors generate an average of £125, the site-search visitors bring in £140 per order.
|Site search status||Visits||Transactions||Conversion rate||Revenue|
|Visits Without Site Search||900,000||9000||1%||£1,125,000|
|Visits With Site Search||50,000||5000||10%||£700,000|
If you can increase the volume of visitors using the site-search by 10% from 50,000 to 55,000 you’ll generate an extra 500 orders at an AOV of £140, bringing in an extra £70K. You’ll get another £70K by boosting the on-site search conversion rate by 10%, and more than double that if you can achieve both.
With over £140K extra per month up for grabs for a 10% improvement in two areas, you can see why focusing on your site search can be so lucrative.
|55,000 (+10%)||5500||10%||£770,000 (up £70K)|
|50,000||5500||11% (+10%)||£770,000 (up £70K)|
|55,000 (+10%)||6050||11% (+10%)||£847,000 (up £147K)|
Improving site search performance really comes down to two main things: having good basic search technology that can support the typical searches entered by your customers, and having well-written consistent content that allows the search system to return relevant results.
To better understand how and why Learning to Rank works, it’s worth considering what happened before - and still does on most well-run ecommerce websites. These are the approaches used on my ecommerce teams, which go beyond the core functionality providing by the standard Elastic search tool we use:
Monitor successful site searches: Set up Google Analytics search tracking so your content writers and merchandisers can monitor what customers search for. Ensure they use the right keywords in their product copy so searches return appropriate results. This will also aid SEO too, so your organic search traffic should rise.
Monitor search refinements: Search refinements, where a user searches for one phrase, views the results and then fine-tunes their search, are also a good indicator that your search engine isn’t returning what customers expect to find first time. You can view these data in Google Analytics and use it to make adjustments to your content to reduce the percentage of refinements and improve the user experience.
Set up synonyms: Most ecommerce platforms, such as Magento, include a synonyms system to allow you to create a lexicon of alternative names or misspellings for products. If your customers often call a widget by another name, or refer to it using a trademark or competitor name you can’t use, add hidden synonyms to your lexicon so users looking for that term see results for whatever it is you call it.
Check that popular searches reveal appropriate results: In ecommerce it’s common for a handful of your top-selling product types to dominate the search terms. Therefore, it’s not a major hassle to periodically check that on-site searches for these phrases return results that look relevant and would be what a user would expect to see. We’ve even written scrapers to check for the presence of certain products in these searches and triggered them to send us Slack alerts should the items go missing.
Optimise content for on-site search: Content writers and merchandisers are used to optimising their content for external search engines, such as Google and Bing, however, it’s also beneficial to optimise the content so it works well with your own search engine. Sometimes, keyword placement or density can mess with the relevance of on-site search results, as these platforms are less sophisticated than proper search engines.
Use weightings or scorings: Solr, Lucene, and Elastic all support the use of additional weighting or scoring fields and it’s worth applying these. If someone searches for a given term and it’s found in the product name, the product code, the brand, or the short description, then consider boosting the weight as it’s likely to be an indicator of relevance.
Add sales-based ranking factors: Adding scores based on recent sales volumes can also help improve results. Calculate the recent sales volume for each product shown and pass this to your search tool to help it rank the results. It can help boost the better selling products.
De-rank out of stock lines: If products are out of stock, customers won’t be able to purchase them, so rather than giving them the top spot on the search results, consider de-ranking them slightly so products you can sell temporarily take the top spots. At the very least, your search results should clearly show whether a product is in stock or out of stock, so customers don’t need to click to find out.
Set up redirects: Redirects can also help. If a search is extremely specific, such as the unique product code for a product, rather than showing the product on a page of results, consider redirecting them automatically to the product page itself and bypass the search results page completely.
Measure search positions and clicks: For many years, Google Analytics has made it possible to record the position of each product in a set of search results (or on a category landing page) and to measure the clicks they generate. At the most basic level, you can use this as a way of monitoring search performance and check that re-writes to content haven’t caused a relevant result to drop off the top of the results, harming conversion.
Older site search systems generally used “full text” search functions provided by relational databases, such as MySQL or PostgreSQL. These look for the presence of the text in the document and return the ones with the most matches. They work OK, but have some drawbacks.
If the “Apple Laptop Sleeve” product contains more mentions of the phrase “apple laptop” then it may be returned above results for actual Apple laptops in the search results, causing users to need to sift through relatively irrelevant results before they find what they are seeking, or causing them to refine their query with a secondary search.
Developers could work around some of the issues with this by restricting the full text searches to certain fields, perhaps the product name, subtitle, or short description, or by using weights, but the approach is rarely perfect - particularly as some site users expect support for the natural language queries they can use on Google.
Learning to Rank models are designed specifically “to present a list of documents that maximises the satisfaction of the user.” Basically, given a search query such as “apple laptop”, what’s the best order in which to return the results, or documents.
While the manual approach I explained above really helps, Learning to Rank models actually only look at a single part of this - the rank in which the documents appear based on relevance.
Traditionally, LTR has been tackled with supervised learning techniques, where the satisfaction or relevance of the results has been annotated manually by human moderators in the training data.
For example, in the early RankNet model, human moderators had to label data as “good match” or “excellent match”, making this quite a laborious exercise. However, the recent trend is to instead use “click models” which use the interactions of users with the search results to infer satisfaction or relevance.
If a user clicks a result, it would be reasonable to assume that this is an indicator that the result shown was a good match for what they were expecting to find after entering the given search term.
The major benefit of the click model approach to LTR is that there are huge volumes of data readily available, so assembling a training data set is much easier and it’s relatively trivial to create a data pipeline of click-stream data to keep the model topped up with fresh information, using Apache Airflow or similar.
The aim of using machine learning is effectively to capture the information overlap between the search query and the document, which is achieved by creating a numeric feature vector for each query-document pair, and then using this information to generate the ranking.
The data that goes in the vector might be related to text that appears in certain parts of the document, such as the product name, brand, or short description, and to the search query itself. Therefore, for each document-query pair, a feature vector can be created which shows the numerical relationship between the two.
Since LTR is a supervised learning problem, a score is required to tell the model how the numeric feature vector relates to the relevance of the document to the search query. Historically, this supervised learning data would have been created by manual labeling by humans checking the relevance of results, though it can be partly inferred using the click model approach.
As Learning to Rank represents a family of models, there’s more than one way to generate predictions of the best ordering for search results. There are now a number of different algorithms for implementing LTR in ecommerce site search, but in general, they each fall into one of three main categories: Pointwise Ranking, Pairwise Ranking, and Listwise Ranking.
Pointwise Ranking use a linear or parametric regression approach and aims to predict the score a human would assign to the relevance of a document-query pair using the feature vector containing the numeric data on the search query used and the document itself. For example, based on a search query of “ultra portable laptop”, the model predicts the relevance scores it would expect a human to assign for each document (or product page).
You simply feed the model your feature vectors, get the predicted scores back from the regression model using Pointwise Ranking, and order the search results from the highest score to the lowest. To compare the accuracy of results, the predicted score can be compared to the actual score, and the Mean Squared Error (MSE) can be used to measure this.
The other approach you can use is to predict the rank that each document would achieve, if they were ranked by a human moderator. For example, if I search for “apple laptop”, I’d expect to see “MacBook Pro 13"” at the top of the results, rather than “apple laptop sleeve”. The model would then be used to predict the rank (1, 2, 3 etc.) for each document or product and this could be compared to the ranks assigned by the human moderator.
Pairwise Ranking looks at pairs of documents and predicts which document should be considered more relevant and which should be considered less relevant, and then uses the outputs to help improve the model. Two common Pairwise Ranking algorithms are RankNet and LambdaRank. Both were developed by one man - Christopher Burges - and his team at Microsoft.
RankNet is a feed-forward neural network that is given pairs of documents to predict whether one will appear before the other in search results. Although it’s a bit of a black box, under the hood it adjusts weights in the neural network so relevant documents become more relevant and irrelevant documents become even less relevant.
LambdaRank was based on RankNet and also uses a feed-forward neural net, but aims to tackle the issue around relevance at the top of the search results in a way which RankNet didn’t do. As it uses an improved distance metric for determining its accuracy, it gives extra weight to improved accuracy at the top of the result set, where most people are looking and judging the search result performance. Results from LambdaRank are therefore stronger than those of RankNet.
The latest model from Christopher Burges and his team is LambdaMART, which uses a newer technique called Listwise Ranking, instead of Pairwise Ranking. LambdaMART is essentially a version of LambdaRank which uses boosted decision trees to move documents to leaf nodes and obtain the scores which are used to determine ranks. Back in 2010, the model won the Yahoo! Learning to Rank Challenge, so it was state-of-the-art back then.
The Pointwise Ranking style of approach has historically used a distance metric based on the number of changes required to the ordering of the results to infer accuracy of the predicted results against the human-moderated ranking order.
The downside is that there’s no weighting, so mistakes made at the top of the list of rankings are considered the same as those at the bottom. In reality, if the first two documents were returned out of sequence, (i.e. Apple Laptop Sleeve in position 1, and Apple MacBook Pro 13” in position 2) this would be seen as less relevant by users.
To work around the distance metric issue, more recent models incorporate something called Normalised Discounted Cumulative Gain (NDCG) as a means of measuring accuracy. NDCG uses the predicted rank, the actual rank, and the true score, to give you a distance metric that considers the rankings at the top to be more important than those at the bottom.
The normalising bit comes from dividing the discounted cumulative gain by the optimal discounted cumulative gain achieved by returning perfectly relevant results. Since NDCG is so important at the top of the results set, it’s often calculated only for the top set of results, i.e. NDCG@K.
Click models are a comparatively recent introduction to the field of LTR. They aim to remove the need for the extremely laborious process of human-ranking of document relevance in search results by inferring this from the items that users click when they appear within search results.
There are now quite a few innovative approaches to the construction of these models, which take more into consideration than merely the first click-through from the search result to infer user satisfaction. With the right model, some researchers have shown that their models using click data can match the performance of those fed by human-generated ranking data.
As Microsoft’s research team has been at the forefront of LTR algorithms, they’ve produced an open source dataset aimed to make it easier for data scientists to train, implement, and compare LTR algorithms.
The Microsoft Learning to Rank Datasets are specifically designed for LETOR and come in two sizes: the MSLR-WEB30K includes 30,000 queries, while the MSLR-WEB10K includes 10,000 queries. They’re both quite weight datasets, coming in at 1.2GB and 3.7GB respectively, so do require some powerful hardware for modeling. The downside is that they require some dat munging to prepare them for use within LETOR models.
Each row in the datasets comprise a URL-query pair, which is represented by a 136-dimensional vector. There’s a full list of the features and where they occur (i.e. body, anchor, title, url, whole document) on the Microsoft Research page. For each stream, a numeric representation is given for each feature, so you can see a count of the number of times the query term appears in the body, anchor, title, url, or whole document.
The features cover various numeric representations of the text, from IDF (Inverse Document Frequency), to sums, minimum and maximum counts of frequency, to TF-IDF (Term Frequency-Inverse Document Frequency) to specialist language model scores such as Okapi-BM25, LMIR.ABS, LMIR.DIR, and LMIR.JM, as well as things like PageRank, QualityScore, click counts, and dwell time. Many similar metrics could be calculated for an ecommerce site, with additional metrics such as add-to-carts, checkouts, basket sizes, and conversion rates added to suit.
Most of the metrics included in the MSLR datasets, such as IDF and TF-IDF, and other vectorizations of text-based data, can be calculated using Python’s many NLP packages. In addition, the ecommerce metrics can also be extracted from Google Analytics using its Reporting API, if you’ve got a properly set-up GA implementation which utilises the newer Enhanced Ecommerce tracking functionality.
The basic concept is that you use
ga:productListName to assign a value to each unique category landing page or search term, and then record the position of each product in the results using
Once set up, you’ll get a set of data showing you the position in which each product appeared on a given category page or for a given search term. With a Google Analytics API query, you can then extract this data and use it to power a click model for Learning to Rank.
My first choice would probably by XGBoost, the extreme gradient boosting algorithm. The benefit here (apart from the fact that it’s nearly always brilliant) is that you can set your distance metrics easily to match those of the RankNet, LambdaRank, and LambdaMART models explained above, by passing in the
objective parameter in your
param dictionary. Here,
'objective: rank:map' corresponds to RankNet,
'objective: rank:ndcg' corresponds to LambdaRank, and
'objective: rank:pairwise' corresponds to LambdaMART.
If you’re running Elasticsearch, the easiest way to implement LTR is to use a plugin. There’s an existing Learning to Rank plugin for Elasticsearch which makes it very simple to get up and running, without the requirement to be a data scientist or data engineer. You can find out more about the plugin <a href=”https://elasticsearch-learning-to-rank.readthedocs.io/en/latest/#” rel=”nofollow noopener” target=”_blank”here></a>. It’s used to power search results at places like the Wikimedia Foundation.
If you want to build your own model, it’s not actually as difficult to develop as you might imagine. The hardest bit is arguably the creation of the feature vectors to generate decent results. The actual creation of the regression or tree models is similar to most other models you’ll have built.
I would highly recommend checking out Sophie Watson’s excellent presentations and her GitHub notebooks covering the practical application of the models on the MSLR datasets. Sophie covers how to perform LTR using regression, and uses XGBoost to implement LambdaMART and LambdaRank and explains how you can evaluate which model is best for your needs.
Don’t forget though, Learning to Rank can improve the performance of your on-site search, but it’s not the only thing you need to consider. Since its features are derived from the content on your product pages, it’s equally important to ensure that the content is written well enough to allow the search tool to find them in the first place.
Anwaar, MA., Rybalko, D., and M. Kleinsteuber (2019) - Mend The Learning Approach, Not the Data: Insights for Ranking ecommerce Products. DOI: 10.13140/RG.2.2.27807.71842
Burges, C. (2016) - From RankNet to LambdaRank to LambdaMART: An Overview. Microsoft Research Technical Report MSR-TR-2010-82
Wang, D., Chen, W., Wang, G., Zhang, Y. and B. Hu (2010) - Explore click models for search ranking. Proceedings of the 19th ACM international conference on Information and knowledge management, October 2010, Pages 1417–1420
Qin, T., and Liu, Y. (2013) - Introducing LETOR 4.0 Datasets. http://dblp.uni-trier.de/rec/bib/journals/corr/QinL13 https://www.microsoft.com/en-us/research/project/mslr/
Zoghi, M., Tunys, T., Ghavamzadeh, M., Kveton, B., Szepesvari, C. and Zheng Wen (2017) - Online Learning to Rank in Stochastic Click Models. Proceedings of the 34th International Conference on Machine Learning. arXiv:1703.02527
Matt Clarke, Wednesday, March 03, 2021