
I'm working on building an application called Shopfluence.
The main lead magnet of it is a feature where you can take a screenshot of clothes you like and it finds it for you.
In the research space this problem is called Fashion Re-identification.
For fashion based re-identification you must extract and embed clothing materials, deal with different poses and still match the same item, logos, etc.
I've taken several approaches on this problem.
Initially I used a pre-built model called OpenClip, and a Flat index trained on local image data.
OpenClip is an open-source reproduction of OpenAI's CLIP. It is a Vision Transformer model.
They apply the Transformer encoder from NLP directly to images instead of using CNNs.
The ViT models aren't necessarily better then CNNs from a quality perspective, just mostly from a scale and global attention perspective they perform better.
Scale means that diminishing returns arrive later for ViTs. As you increase parameters they still extract value up to billions of images. Global attention means that ViTs can learn many long-range patterns across the dataset which helps at scale.
Using these learned patterns the model can then understand new parameters well.
This approach was a decent start. I created an internal database as a proof of concept about 30K products.
Created embeddings for each of these vectors with OpenCLIP, then added it to a FAISS flat index for vector search.
FAISS is Facebook AI Similarity Search, its used for fast vector similarity search and clustering. It supports exact and ANN approximate nearest neighbor.
Exact is brute force and searches all the vectors giving it 100% recall.
Recall is did we find all the relevant things, while Precision is are the things we found actually right.
The first iteration worked fairly well the only thing was the index took a full day to train and was Gigabytes in size.
It wasn't scalable at all.
It was a good first attempt.
The second iteration, I found out that there was a model called FashionClip, which was OpenCLIP finetuned on fashion data. An open source model from farfetch/coveo. Still using a flat FAISS index.
Fine tuning a pretrained model leads to better results in the domain but there is a risk of losing general knowledge from the original pretrained and overfitting which reduces how effective the model is at working with new data. It assumes everything is similar to the training data.
Since it was fine tuned on fashion data the embeddings created were more relevant to clothes.
This approach worked quite well for the images that were stored in the vector. It succesfully found the most appropriate image even with different poses of people in the clothing.
I then created a frontend and hooked it up to this model via an API. It was difficult to host it as the index size was massive with 30K products. Just for a quick test I pushed it to a Lambda. This was a mess as the startup time of a lambda from a cold start wouldn't get up and running. I then used provisioned concurrency and that worked.
It cost a lot though, so a differant approach was needed. Also the PoC worked well, but it was only for those 30K products which wasn't useful for a production use case.
Which led me too..
For clothing re-identification recall and precision need to be as high as possible, but in order to have something in production with 1 million + products a flat index is not feasible.
IVF - Inverted File - Clustering. Divides all the vectors into clusters called Voronoi cells. At search time only the nearest clusters are searched. With this approach it gives a 5-10x faster search and you only need to search 20% of the data.
This technique compresses the original vector at the expense of recall. From a 512 dimension to N dimensions.
The recall without a re-ranker is quite poor, roughly 60-85%.
In order to get recall higher I introduced a re-ranker. It does what it says, it re-ranks the images with the full vector uncompressed. Similar to a Flat FAISS index.
The way it works is:
This comes with a whole bunch of other factors to get it working.
So I pivoted to use a GPU machine with Google Cloud Run, and this approach allows production level scaling of products.
In order to make indexing of the process not take a long time especially scaling the products up I created an approach to shard the database and use EC2 instances. This allowed to specify the products to embed and distribute the processing throughout machines. Then at the end I would merge them into the one FAISS index.
For this I used the same FashionCLIP model for getting TOPK and rerank. This isn't ideal as FashionClip is semantic search.
This approach works the best for our use case. To keep the FAISS index reasonable in size and maintain latency for queries a compressed fashionclip for TOPK with embeddings only on one image per clothing item using fashionclip. Then using a precision model like DINOv2 for the reranker.
Using DINOV2 for everything did not work well. If a persons hand was up the photo it would use that photo as a similarity match.
At this point the image to image matching precision isn't as good as it needs to be. I'm looking into building a custom model now. Looking at the latest models for the re-identification problem.
DeepFashion2 seems to be the latest usable paper in 2019. In it they discuss an approach of masking each clothing item, then doing a similarity search.
I have been using Google Vision for the clothing piece isolation and then the pre-trained models as a PoC.
But with the current setup using pre-trained models and only embedding one picture per clothing item to avoid needing to shard the index, the FashionCLIP + DINOv2 on rerank is the best, most accurate choice.
We know, not another newsletter pitch! - hear us out. Most developer newsletters lack value. Our newsletter will help you build the mental resilience and emotional intelligence you need to succeed in your career. When’s the last time you enjoyed reading a newsletter? Even worse, when’s the last time you actually read one?
We call it Letters, but others call it their favorite newsletter.