Visual Search Engine — the Future of Search
Modern eCommerce Solutions Based on Deep Learning Algorithms (Part 2)
Series index:
- How to Improve Product Search Algorithm to Meet Growing Customer Expectations
- Visual Search Engine — the Future of Search
- Retrieval-Ranking Recommendation System for H&M Dataset (publication soon)
Introduction
This article is part of the series “Modern eCommerce Solutions Based on Deep Learning Algorithms,” in which we delve into the secret of using state-of-the-art algorithms to grow eCommerce business, meet growing customer expectations and easily gain market advantage.
In the previous article in this series, we showed that a product image contains much more information about a product’s features than any phrase, description or set of tags, and is extremely useful in text-based product search (the problem of searching for a product using user-provided text). By using modern deep learning methods, it is possible to identify similarities between image and text, and consequently quickly and easily find the product the customer is looking for.
If you haven’t read it yet, we encourage you to read the previous article about the modern approach to text search — How to Improve Product Search Algorithm to Meet Growing Customer Expectations.
Now we would like to go one step further and discuss the concept of visual search engines. What are they? Visual search allows you to search using images, not just for them. Visual search engines work similarly to classic text search engines, but instead of text, the client can use an image as a query. These engines allow users to snap or upload an image of an item and receive recommendations of similar items from the store’s offer.
This trend has been strongly developed by the major players in recent years. You’ve probably heard of (or maybe even already used) such search solutions as Google Lens or Pinterest Lens. It’s time to take advantage of this amazing technology in the e-commerce industry and make it available to a wider group of companies. So let’s cut to the chase.
Here we will focus on the fashion industry, but all the conclusions and observations can be easily transferred to any e-commerce business line.
In order to prepare all the examples for this article, we built our own visual search engine based on fashion images from the H&M Kaggle competition and images from this dataset are returned as search results for the figures presented in this article.
In this article, we will briefly discuss:
- whether visual search engines are really necessary
- how cool visual search engines are
- our approach to the problem of visual product search
- further challenges in this field
- conclusions from all the above points
Do you really need a visual search engine in your app?
Let’s start our consideration with a quote from Clay Bavor, vice president of Labs at Google:
“In the English language there’s something like 180,000 words, and we only use 3,000 to 5,000 of them. (…) Think about how many objects there are in the world, distinct objects, billions, and they all come in different shapes and sizes. (…)” — Clay Bavor, Google
This comparison perfectly illustrates the problem of searching for specific products in the huge range of online stores. The decision to buy usually does not come out of thin air. Customers are first inspired by something (a picture of a celebrity’s outfit seen in a magazine, a great product seen by a friend or on a store window) and very often know exactly what they would like to buy (how it looks). The problem, however, is naming this need correctly — in order to find the right product quickly and efficiently.
Phrases typed by users into the browsers are usually quite general and return hundreds or thousands of products. On the other hand, if a user types in an exact description of a product, most text-based search engines do not handle such a query well and return a ‘no results found’ page or results based on only part of the text query. A visual search engine is the perfect answer to this challenge! It frees users from trying to guess the right keyword to describe the object they are looking for. It’s a new, time-saving way to shop that allows customers to browse and buy conveniently and quickly.
How cool are visual search engines?
Before we get into the technical details, we can’t resist showing how fantastic visual search engines are. Let’s take a look at some examples that are search results in the tool we created.
Similar product search — catalog photos
You see a picture of a beautiful handbag on Instagram, but it’s a luxury brand product that you can’t afford right now. You can search for a similar, cheaper version of this product in the offer of your favorite store.
The results are not an identical bag match, but they are similar alternatives considering product features such as color, shape, gold details, size, etc.
Similar product search — real images
Or maybe your favorite sweater has its best years behind it and you can’t part with it. The search engine also effectively recognizes objects from real photos (photos with lower quality than catalog photos: busy background, badly cropped photo, other products in the photo, bad lighting, etc.).
Similar pattern search
Some customers choose a particular garment from a store’s offer mainly because they are attracted to a certain element/detail of the product, such as a favorite dot pattern. So let’s find that kind of product.
How does it work?
This innovative shopping experience is powered by deep learning techniques. We have created the following solution architecture:
The user enters a photo and specifies what to search for recommended products:
- finding the most similar product
- or finding only the fabric pattern in the uploaded photo
In the background, the appropriate pre-processing of the photo is carried out (if required), and then the photo is transformed into a form understandable to the computer/algorithms (numeric embedding) — this is where the magic happens.
In our solution, we focused on the most modern types of models — transformers. These models have been pre-trained on huge amounts of data, so they are very effective and versatile in a wide range of applications. How do they work? We will describe it using the example of the first transformer built for computer vision — the Vision Transformer (ViT). Take a look at Figure 8, which illustrates the use of this model in our scheme.
The following steps are performed:
- Pre-processing and feature extraction from a given image
- resizing/scaling
- normalization
- returning pixel values - Using the transformer model
- dividing an image into patches (smaller images)
- linear projection on flattened patches
- adding position embeddings
- sequential feeding of the transformer encoder
- returning hidden states (numerical embeddings from the last layer) - Measuring the similarity (cosine similarity) of a given image (its numerical embedding) to all images in the database (their embeddings)
- Final recommendations of the most similar images
We tested various transformer models (44 architectures with different size, type, image resolution, patches and data used for pre-training and fine-tuning) to find the best numerical representation of the input images and effectively measure the similarity between them. All were downloaded from https://huggingface.co/.
The models had varying levels of effectiveness in identifying individual product features:
- some were very good at picking up the color and pattern of the product
- others paid more attention to similarity in product shape
However, the overall effectiveness of the best models turned out to be very high and definitely met business expectations, as already shown in the examples from the introduction to the article, but let’s take a look at the summary of the results for different types of products, divided into two groups: catalog photos and real photos, and evaluate for ourselves the correctness of the indicated products.
Main challenges in visual search
Our R&D process made us aware of several additional challenges related to visual search. We briefly discuss some of them in the sections below.
Some product features are not visually noticeable
We have observed that some visual features of a product are extremely difficult to capture in a photograph of it. Difficult to the point that even the best tool invented by nature (i.e., the human eye) cannot handle it. This led us to revise some of the assumptions of this project.
A great example of such a challenge is evaluating the gender of the user for whom the pants were designed (we will discuss this further using the example of jeans). The algorithm we chose does a great job of identifying such a product (jeans) because the product has a very specific shape, material texture and color. However, when numerically evaluating the correctness of the indicated recommendations, we noticed a significant discrepancy between the low numerical results and the great visual results (similarity between images) for this type of product. The numerical relevance measure we used was based on features from the product table: information about (among other things) product type, color, pattern, gender and age of the user. In the case of some types of pants, the assessment of the user’s gender proved to be very unsuccessful. Sample recommendations are shown in Figure 11.
We see that when searching for similar women’s pants, recommendations for men, children and teenagers are returned. This does not change the fact that the selected recommendations actually have a similar appearance to the recommended product, and without access to the product description, it was impossible to clearly determine for whom it was created (women, men or children).
After analyzing more examples, we decided to verify the expectations placed on the algorithm and implement an additional complementary solution — filtering the results, which can be done by the user if the results obtained are not satisfactory in his/her opinion. Below are the results for the same query, but after filtering out only products for women.
Similar challenges can arise for other product features, such as the type of product. In Figure 13, we showed the search results for children’s pajamas. The model successfully recognized the fabric pattern from the reference product, the pastel color palette, the lace details in the clothing, etc. Unfortunately, none of the top 5 suggested products were pajamas.
Therefore, we applied a filter based on the type of product and presented the customer with those products from the children’s pajamas collection that are most similar to the product in the photo they submitted (Figure 14).
Removing background from real images
We showed earlier that the chosen algorithm works effectively for real photos. This is true. However, we have identified cases in which the use of appropriate additional pre-processing significantly improves the quality of the returned recommendations.
An example of such a situation is photos of clothes taken against a highly patterned background. The algorithm correctly identifies the shape and type of clothing in the photo, but sometimes it also catches the color or pattern from the background, resulting in the return of products of the correct type, but in the background color (instead of the color of the reference product from the photo).
We performed the background removal based on the algorithm presented in the article: https://levelup.gitconnected.com/remove-background-from-images-using-python-and-ai-149a6985e478.
The improvement in search results for several real images after removing the background is shown in Figure 15.
Outfit search
All the examples we presented in the previous part of this article were based on searching for a single product, and that was the task of our project. The next step in this type of project, could be to generalize this solution to identify individual items in an image. This would only require adding one module to our solution architecture — a model for segmenting an image and extracting individual objects from it as the first step of searching for similar objects. The revised solution architecture graph for such an extended search is shown in Figure 16.
Adding this element will lead to the functionality shown in Figure 17.
Conclusions
Nothing discourages an online store customer more than a pointless search for a product that is probably in the store’s offer, but he/she doesn’t know how to reach it. It is a waste of time. So give customers what they want quickly — based on modern deep learning methods. Faster results increase the likelihood of completing purchases, and a satisfied customer will come back for more products. Visual search technologies change users’ search behavior, they are cool, they increase customers’ interest in the store’s application (and consequently its offerings) and their loyalty to the brand.
We’ve shown above that currently available models do a great job with visual search and provide new opportunities for a company to promote and sell its products. The e-commerce industry was the first to see the potential benefits of visual search technology, but it is a great solution for other industries and business lines as well.
Words by Monika Sikorska, Data Scientist at Altimetrik Poland
https://www.linkedin.com/in/monika-sikorska-215738a7/
Editing by Kinga Kuśnierz, Content Writer at Altimetrik Poland