Fashion Clothes Recommender

This is my second medium blog. I hope you will enjoy reading my case study.

In the current world recommender systems got huge popularity and many companies business has increased due to these recommender systems. In this case study I am going to show how to build a recommender system for recommending fashion clothes for top wear, bottom wear and foot wear given a full image of a person with clothes. This case study has been solved from scratch. Before discussing about my solution let us see the problem statement or business problem of this case study.

Business problem:

As online shopping of fashion products is growing rapidly the development of a good recommendation system may become a key to the business. In this case study we focus on the problem of similar fashion item recommendations for multiple fashion items. Majority of existing works in this domain focus on retrieval of similar products corresponding to an individual item present in a query. But there is a problem with this type of recommendation system. For example there is a user might have searched for a particular primary article type (for ex: men’s shorts), the human model in the full-shot look image would usually be wearing secondary fashion items as well (for ex: t-shirts, shoes etc.). Upon looking at the full-shot look image in the PDP(product detail page), the user might also be interested in viewing similar items for the secondary article types. In order to address this requirement one can focus on developing the recommendation system that would give recommendations for multiple fashion items at once.

Objective :

To build a fashion clothes recommender which will recommend fashion products for top wear, bottom wear and foot wear given a full shot image of a person with complete dress and foot wear.

Dataset Details:

Dataset1 is prepared by scraping it from fashion product website like Myntra. The scraping of images is done using selenium. So the prepared csv file1 of the dataset consists of the following columns:

  1. Image path
  2. Product type(ex: T-shirts, shorts, jeans etc.)
  3. Type of wear(Top wear, Bottom wear, Foot wear)
  4. Fullshot / Not a Fullshot(To determine whether the image has full body of the person with clothes)

This prepared dataset is going to be used for YOLO V3 model training. After training the YOLO V3 model it is used to divide a full shot image into top wear, bottom wear and foot wear using the bounding boxes. More details about this YOLO V3 will be shared in further sections. This dataset could be found in my GitHub repo.

Dataset2 is also prepared by scraping the images from Myntra website using selenium. Different types of fashions products like shorts, T-shirts, skirts, shoes etc. are scraped but we should mind that none of these images should be a full shot image i.e. it should not contain full body of a person. It should belong to either top wear or bottom wear or foot wear. So the prepared csv file2 of the dataset consists of the following columns:

  1. Image path
  2. Labels (each type of dress type is given a unique number like 0, 1, 2 etc.)

This prepared dataset is going to be used for finding similar fashion products for the given full shot image having all the wears (top, bottom and foot). This dataset could be found in my GitHub repo.

Metric Used:

To train the Siamese network, we made use of a weighted triplet loss defined as follows:

The above equation (1) describes that total loss is equal to sum triplet loss and alpha times embedded loss. Here alpha gives weight to the embedded loss.

such that

the above equation denotes the squared Euclidean distance between pair of examples xᵢ and xⱼ.

α > 0 is a trade-off hyper-parameter in (1). The objective of (2) is to constrain the squared Euclidean distance of the anchor-negative pair to be larger than the squared Euclidean distance of the anchor-positive pair by a margin m > 0.

Furthermore, the embedding loss in equation (1) is defined as follows:


and here d represents the embedding size. xₐ, xₚ and xₙ belongs to the embedding vector of size d. Essentially, the embedding loss performs a normalization of the representations of the examples in the triplet to ensure that the image embeddings remain within the radius range of the margin value.

Existing Approaches:

  • Sarah Ibrahimi et al. [1] applied the deep metric learning and ensemble models to solve the cross-domain fashion object retrieval problem. While already existing approaches target on the retrieval of shop instances given a consumer instance, their focus was on bidirectional retrieval, while including the reverse direction as well.
  • Quintino et al. [2] proposed a model that uses convolutional neural networks to jointly attain fashion categories and attributes via an attention mechanism.
  • Kinli et al. [3] explored the in-shop clothes retrieval problem by employing Triplet-based Capsule Networks with Stacked Convolutional (SC) and Residual Connected (RC) blocks as input to the capsule layers.
  • In addition, Park et al. [4] also studied different training strategies and combinations of loss functions (e.g., cross-entropy for classification and triplet-loss for similarity) for applying it in the fashion image retrieval problem.

However, the core part of all the discussed approaches is to match a single detected fashion item. This is in contrast to our approach, which focuses on simultaneously retrieving similar image products corresponding to multiple fashion products present in a full shot image.

Current approach for building fashion clothes recommender:

The fashion clothes recommender that is going to be build will be having the following components.

  1. Simple pose estimation using OpenCV: It will be used for detecting whether the input image is a full shot image or not.
  2. YOLO V3: It will be used for creating bounding boxes on the input full shot image so that one can retrieve top wear, bottom wear and foot wear by using these bounding boxes. These segregated bounding boxes are further used for retrieving similar fashion items from the fashion clothes collection set.
  3. Siamese network with triplet loss: It is a neural network consisting of three identical networks with shared weights. To train it, one requires triplets of examples, of the form (xₐ, xₚ, xₙ). Here, xₐ is called a query or anchor image. The example xₚ is called the positive image that is semantically similar to the anchor. The example xₙ is called the negative image that is semantically dissimilar to the anchor and the positive. The objective for training the network is to bring the embeddings of xₐ and xₚ closer, while moving away xₙ. The Siamese network is trained until the triplet loss is as minimum as possible and the embeddings of the training set is obtained. These embeddings are further used for comparing with the test image and after comparison one can determine whether the fashion items are similar or not using similarity methods.

4. Similarity methods(Cosine Similarity, Euclidean distance etc.): These methods are used for comparing the embeddings of test images with that of other embeddings of fashion clothes in the collection set. The lesser the cosine similarity and Euclidean distance similarity value the higher is the similarity between the fashion clothes. Faiss implementation is used for faster recommendation of similar fashion products.

Exploratory Data Analysis(EDA):

Let us import the prepared csv data file

Let us check number of categories in the dataset and number of images for each category

Examples for each dress type:

  1. Men Casual Shirt

2. Men Casual Trouser

3. Men Shorts

4. Men T-shirts

5. Women shorts and skirts

Let us check number of images for top wear and bottom wear

Examples for Top wear and Bottom wear

  1. Top wear

2. Bottom wear

Let us check number of full shot images and non-full shot images in the dataset

Let us check number of full shot images and non-full shot images for each dress type

  1. Men Casual shirt

Example of Full shot image in men casual shirt dress type

2. Men casual trousers

Example of Full shot image in men casual trouser dress type

3. Men shorts

Example of Full shot image in men short dress type

4. Men T-shirts

Example of Full shot image in men T-shirt dress type

5. Women shorts and skirts

Example of Full shot image in women shorts and skirts dress type

Let’s get started:

We will be starting by scraping the images of fashion clothes from websites like Myntra using selenium. Scraping the full shot images i.e. image of a complete body of the person with clothes is a non-trivial task. So work around for it is to scrape all the images from different dress types consisting both full shot and non-full shot images, and then use simple pose estimator to separate the full shot images from the scraped images of fashion items.

Simple pose estimation using OpenCV for detecting full shot images in the dataset

First step is to assign key point and joint with unique values.

Initially a full shot image is taken for experiment

Now let us create a function that finds the key points and joints and returns initial image with key points and joints

Applying the above function on the initial image to get the key points and joints on it.

Now let us use this simple pose estimator for separating full shot images from rest of the scraped images of fashion clothes.

So the list full_shot_imagepaths consists of paths of all full shot image files.

We are going to use YOLO_V3 for creating bounding boxes on a full shot image for recognizing different dress types present in the full shot image. Given a full shot image we can divide the image into different bounding boxes which represent different dress types.

For this we need to annotate all the separated full shot images for training it with YOLO_V3 model. After that we are going to save the trained weights. By using these weights on the model we will be creating bounding boxes on the query full shot image. We can retrieve these bounding boxes separately and then later use it for similar fashion clothes recommendations.

Bounding boxes on the full shot images to recognize different dress types(using yolo_v3)

Initially, a full shot image is taken having no bounding boxes

Now we are going to create yolo_v3_boundingbox function which upon applying on the full shot image would return the image with bounding boxes and their labels.

After applying YOLO_V3 function on full shot image

Retrieved bounding box images:

Before proceeding further we will be scraping images along with their links again in websites like Myntra using Selenium. While scraping we should keep one thing in mind that we will not be scraping any full shot images. We will be scraping different fashion items like T-shirts, shorts, Casual shoes, skirts etc. After scraping images from the website along with their web links, we are going to categorize them into broadly six categories. These categories are:

  1. Men’s Top wear(Ex: Shirts, T-Shirts)
  2. Men’s Bottom wear(Ex: Shorts, Jeans)
  3. Men’s Foot wear(Ex: Formal shoes, Casual Shoes, Sports shoes)
  4. Women’s Top wear(Ex: Tops, Shirts, T-Shirts)
  5. Women’s Bottom wear(Ex: Shorts, Skirts)
  6. Women’s Foot wear(ex: Heels, Flats, Boots)

The retrieved bounding box images will also be segregated into above categories. Respective csv files for each category are prepared with their respective image paths and web links. Unique number like 1, 2, 3 etc. are allotted to each dress type in the csv file. Now each of this csv data file is used for training a Siamese model with triplet loss. After which the weights of the trained model are stored. These stored weights are then use to create a inference model which will be giving us the embeddings of a query image which are nothing but the bounding box images. These embeddings are further used for comparing the embeddings of scraped image dataset using similarity techniques like Cosine similarity or Euclidean distance. We can also use Faiss for finding similarity between embeddings of query image and embeddings of recently scraped image dataset.

Creating a Siamese Network with Triplet Loss in Keras

Preprocessing the data

Plotting the examples

Creating a batch of Triplets

The first image is an anchor or query image. The second image is positive or similar image. The third image is negative or dissimilar image.

Embedding Model

Siamese network

Triplet loss

Data Generator and callbacks

Training the Model

Top 5 Recommendations of fashion clothes with their links using Similarity

Faiss implementation

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. It is developed by Facebook AI Research.

Function that retrieves top 5 recommendations

Top 5 recommendations for women’s top wear

Links For top 5 recommendations for women’s top wear

Similar procedure is applied for all other type of wears(Ex: Bottom wear and Foot wear). Similar model could also be applied to Men’s Fashion Clothes. Apart from Faiss implementation one can also use Cosine similarity and Euclidean distance for finding similarity between images. Bases on these scores one can give recommendations.

Inference times of different similarity techniques used for fashion product recommendation

Finally we are going to make this fashion clothes recommender interactable using a web user interface with the help of stream lit. Here I am going to share the video of web UI with the fashion clothes recommender.

Thanks to Applied AI team as they have guided me through out this case study. While some code snippets are included within the blog, for the full code one can check out my GitHub repository. I hope this blog was helpful for you in one or other way!


  1. Sarah Ibrahimi, Nanne van Noord, Zeno Geradts, and MarcelWorring. 2019. Deep Metric Learning for Cross-Domain Fashion Instance Retrieval. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW). 0–0. Metric Learning for Cross-Domain Fashion Instance Retrieval. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW). 0–0.
  2. Beatriz Quintino Ferreira, Joao P Costeira, Ricardo G Sousa, Liang-Yan Gui, and Joao P Gomes. 2019. Pose Guided Attention for Multi-Label Fashion Image Classification. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW). 0–0.
  3. Furkan Kinli, Baris Ozcan, and Furkan Kirac. 2019. Fashion Image Retrieval with Capsule Networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW). 0–0.
  4. Sanghyuk Park, Minchul Shin, Sungho Ham, Seungkwon Choe, and Yoohoon Kang. 2019. Study on Fashion Image Retrieval Methods for Efficient Fashion Visual Search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 0–0.
  5. Abhinav Ravi, Sandeep Repakula, Ujjal Kr Dutta, Maulik Parmar. 2020. Buy Me That Look: An Approach for Recommending Similar Fashion Products. In Proceedings of Preprint. ACM, New York, NY, USA, 9 pages.

You can also find and connect with me on LinkedIn and GitHub




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium Using Machine Learning to Create High-Res Fine Art

Machine Learning and Deep Learning

Machine Learning for Dummies in Julia

Natural Language Processing: How Computers are Starting to Understand Us

Tutorial: Build your own Embedding and use it in a Neural Network

NLP News Cypher | 09.13.20

Research Papers based on Fast Convolution Technology

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


More from Medium

Building self-awareness and why it matters.

Lights, Camera and Action: Coming face-to-face with the Lens!

Why is design important when building a B2B brand?

Restaurant Management Software For Better Customer Experience