Engineering9 min readยทFeb 12, 2026

AI in Recruitment: Architecture Breakdown

What's actually running under the hood of an AI recruitment platform? From NLP pipelines and embedding models to vector similarity search โ€” a technical deep-dive for curious minds.

๐Ÿ”ฌ

InnoHire Editorial Team

InnoHire.ai

# Architecture# Vector DB# ML# FastAPI

7

Pipeline stages

1536

Embedding dimensions

<200ms

P95 API latency

99.9%

Uptime target

System Design Philosophy

InnoHire.ai is built on a microservice-oriented API architecture where every core capability โ€” resume parsing, embedding generation, similarity scoring, ranking, content generation โ€” is a discrete, independently scalable service. This design allows each component to scale based on its own load profile rather than scaling a monolith uniformly.

The system is stateless at the request layer. Every API call carries its own context, ensuring horizontal scalability without session-state complexity. Persistent state lives in purpose-built data stores: PostgreSQL for structured data, a vector database for embeddings, and blob storage for raw document files.

The Core Components

Document Ingestion Service

Accepts resumes in PDF, DOCX, and plain text. Uses format-aware parsers to extract raw text without losing structural context โ€” handling multi-column layouts, tables, and embedded headers common in professional resumes.

NLP Parsing Engine

Processes raw text through a Named Entity Recognition (NER) pipeline that identifies skills, job titles, companies, durations, certifications, and education. The model is fine-tuned on recruitment-domain corpora, not generic text.

Embedding Service

Converts parsed entities and full document contexts into dense vector representations using transformer-based embedding models. Each resume and job description becomes a set of semantic vectors that encode meaning in 1,536-dimensional space.

Scoring Engine

Computes cosine similarity between resume and JD embeddings across multiple dimensions. Applies recruiter-configured factor weights to produce a composite match score and ranked candidate list.

"

Moving from keyword matching to vector similarity is like upgrading from a map to GPS โ€” the destination looks the same but the routing intelligence is incomparable.

NLP & Embedding Layer

The NLP layer uses a domain-adapted transformer model. Unlike general-purpose embeddings, the model understands that SDE II and Software Engineer L4 are equivalent at a conceptual level, and that Redux is a relevant entity when a JD mentions "React state management" without naming it explicitly.

๐Ÿง 

Model architecture

The embedding model uses a bi-encoder architecture โ€” resumes and JDs are encoded independently and compared post-hoc. This enables the pre-computation of resume embeddings, dramatically reducing per-request latency when new jobs are posted.

Vector Similarity Search

Embedding vectors are stored in a vector database (Pinecone or pgvector depending on deployment tier). When a new job description is posted, its embedding is computed and used to perform an approximate nearest-neighbor (ANN) search across the candidate embedding index.

This retrieval step returns the top-K semantically similar candidates in milliseconds โ€” before any weighted scoring is applied. The scoring layer then re-ranks these candidates using structured factor weights.

Scalability & Infrastructure

The frontend is deployed on Vercel. The API layer runs on serverless FastAPI containers orchestrated by AWS Lambda or GCP Cloud Run depending on region. Database reads are served from read replicas; writes go to a primary Supabase (PostgreSQL) instance. Expensive NLP tasks โ€” embedding generation and content synthesis โ€” are queued through a job queue (BullMQ) to decouple latency from request completion.

Observability

Every API call produces structured logs. Metrics are collected per-service covering latency percentiles (P50, P95, P99), error rates, queue depths, and embedding generation throughput. Alerting fires when P95 latency for scoring exceeds 500ms or when error rates exceed 0.5% over a 5-minute window.

InnoHire Assistant

Always here to help

Hi! ๐Ÿ‘‹ I'm your AI support assistant. How can I help you today?