Engineering8 min read

How We Built a Visual Product Search Engine with CLIP

A deep dive into building Trooply Search — a GPU-accelerated visual search engine using OpenAI's CLIP model, Qdrant vector database, and FastAPI. From architecture decisions to production deployment.

T
Trooply Engineering

The Problem

Most product search relies on text — keywords, filters, categories. But what if a customer has a photo of what they want and no words to describe it? What if they see a pair of shoes on the street and want to find where to buy them?

Text search fails here. Visual search doesn't.

We built Trooply Search to solve this: upload any product image, and instantly find matching or similar products from a catalog of millions. No keywords needed.

Why CLIP?

OpenAI's CLIP (Contrastive Language-Image Pre-training) is a neural network that understands both images and text in the same embedding space. This means:

  • An image of red sneakers and the text "red running shoes" produce similar vectors
  • You can search by image, by text, or by both simultaneously
  • No need to manually tag or categorize products

We use the ViT-L/14 variant — the largest publicly available CLIP model. It produces 768-dimensional embeddings that capture fine-grained visual details. The tradeoff is size (~900MB in memory), but on our RTX GPU, inference takes under 50ms per image.

Architecture

Our search pipeline looks like this:

  1. Ingestion: Client uploads product images via REST API
  2. Embedding: CLIP encodes each image into a 768-dim vector (GPU-accelerated, FP16)
  3. Storage: Vectors are stored in Qdrant with product metadata
  4. Search: Query image → CLIP embedding → Qdrant nearest-neighbor search → ranked results
  5. Serving: FastAPI serves results with sub-200ms latency

The Stack

  • FastAPI — async Python API framework, handles 500+ concurrent requests
  • CLIP (ViT-L/14) — image/text encoding, runs on NVIDIA RTX GPU
  • Qdrant — purpose-built vector database, handles billion-scale collections
  • PostgreSQL — tenant management, API keys, usage tracking
  • Redis — rate limiting, caching, session management
  • Docker — isolated per-tenant deployment

Multi-Tenancy

Trooply Search is a SaaS product — multiple customers share the same infrastructure but their data is completely isolated.

Each tenant gets:

  • Their own Qdrant collection (no cross-tenant access)
  • Scoped API keys with rate limits
  • Separate usage tracking and billing
  • Isolated search results

We use Qdrant's collection-level isolation rather than namespace-level — this gives stronger security guarantees and independent scaling per tenant.

GPU Optimization

Running CLIP on CPU takes ~2 seconds per image. On our RTX GPU with FP16 inference:

  • Single image: 45ms
  • Batch of 32 images: 180ms (5.6ms per image)
  • Memory: ~3.9GB total (model + CUDA runtime + framework overhead)

Key optimizations:

  • FP16 inference: Half-precision floats on GPU — 2x speed, half the VRAM
  • Batch processing: Group incoming images and encode in batches
  • Pre-computed category embeddings: Common search queries are pre-encoded at startup
  • Background removal: rembg strips backgrounds before encoding, improving match quality by 15-20%

Results

In production with real customer catalogs:

  • Search latency: p50 = 89ms, p99 = 180ms
  • Relevance: 87% of top-5 results rated "relevant" by human evaluators
  • Throughput: 500+ concurrent searches on a single GPU
  • Uptime: 99.9% over 6 months

What We Learned

  1. Vector search is only as good as your embeddings. CLIP is excellent for general products but struggles with very similar items (e.g., different shades of the same shoe). Fine-tuning on domain-specific data helps significantly.
  1. Background matters more than you think. A product on a white background vs. a lifestyle photo produces very different embeddings. Background removal before encoding improved relevance by 15-20%.
  1. GPU memory is the bottleneck, not compute. The CLIP model + PyTorch + CUDA runtime takes 3.9GB. On a 16GB GPU, that leaves room for ~3 more concurrent model instances. Plan your GPU memory budget carefully.
  1. Self-hosting wins for privacy-sensitive clients. Several customers chose Trooply Search specifically because their product images never leave their infrastructure. This is a real competitive advantage over cloud-only solutions.

Try It

Trooply Search is live at search.trooply.in. Upload an image, get results in under 200ms.

Want to integrate visual search into your product? Contact us — we can have you running in a day.

CLIPvisual searchQdrantFastAPIGPUPyTorch

Want to build something similar?

We help companies build and deploy AI products. Let's talk about your project.