← Back to Blogs
Building CandidAI: An AI Interviewer That Actually Sees You

Building CandidAI: An AI Interviewer That Actually Sees You

9 min read
HackathonAIVision AgentsWebRTCYOLOReal-TimeNext.jsPython

How I built a real-time AI technical interviewer using Vision Agents SDK that watches body language, evaluates live code, and generates comprehensive interview reports — all over WebRTC

Building CandidAI: An AI Interviewer That Actually Sees You

How I built a real-time AI technical interviewer that watches your body language, listens to your answers, evaluates your code, and generates a full performance report — in one week.

"What if your mock interviewer could actually see you fidgeting, notice when you're confused, and adapt in real-time?"

When the Vision Possible: Agent Protocol hackathon dropped, the brief was clear: build multi-modal AI agents that watch, listen, and understand video in real-time. Most people went for security cameras or sports coaching. I went for something that every developer dreads — technical interviews.

Vision Agents SDK by Stream

The hackathon was sponsored by Vision Agents — Stream's open-source SDK for building real-time Vision AI agents. It gives you WebRTC transport, video frame capture, event systems, and native LLM APIs out of the box. The challenge: use it to build something that pushes the boundaries of real-time video intelligence.

The Problem

Technical interviews are broken. Companies spend thousands on hiring pipelines, candidates spend weeks grinding LeetCode, and at the end of it all, the evaluation is still subjective. A human interviewer might miss that the candidate's hands were shaking, or that they solved the problem in half the expected time but stumbled on communication.

What if we could build an interviewer that never misses a signal?

CandidAI Demo — Live interview with body language analysis, avatar, and transcript

CandidAI: The Idea

CandidAI is a full-stack AI technical interviewer that:

  • Sees you — YOLO pose estimation tracks body language every 3 seconds (posture, fidgeting, eye contact)
  • Hears you — Deepgram STT transcribes speech in real-time with smart turn detection
  • Tests you — Presents coding challenges with a Monaco editor, evaluates solutions, asks MCQs
  • Talks to you — Edge TTS gives the AI a natural voice over WebRTC
  • Knows the material — PostgreSQL + pgvector RAG over 169 interview question files
  • Reports everything — Generates radar charts, timelines, and dimensional scores after each session

CandidAI Architecture

The Architecture

Here's where it gets interesting. The entire system runs over WebRTC using Stream's Vision Agents SDK:

Browser (Next.js)  <--WebRTC (Stream)--> Python Agent (Vision Agents SDK)
     |                                        |
     |-- Convex Cloud (DB)                    |-- OpenAI GPT-5.2 (LLM)
     |-- Monaco Editor                        |-- PostgreSQL + pgvector (RAG)
     |-- SVG Avatar (Framer Motion)           |-- Deepgram STT
     |-- Recharts (reports)                   |-- Edge TTS (free)
                                              |-- YOLO Pose (yolo11n-pose.pt)

The agent joins the call as a WebRTC participant — it receives the candidate's audio and video tracks through Stream's SFU (Selective Forwarding Unit), processes them through the AI pipeline, and sends back audio + custom events.

Vision Pipeline (The Core of It)

This is where Vision Agents SDK shines. When the candidate joins:

  1. Continuous capture: 1 FPS from the active video track (webcam or screen share)
  2. Frame buffer: 5-second rolling buffer = ~5 frames
  3. Send trigger: When Deepgram detects the user finished speaking
  4. To the LLM: Last 5 buffered frames sent as base64 JPEGs with conversation context
  5. Screen share priority: When the candidate shares their screen, the SDK automatically switches from webcam to screen share (priority=1 vs 0)
# The agent factory wires everything together
agent = CandidAIAgent(
    llm=openrouter_vlm,       # GPT-5.2 with vision (frame buffering + tool calling)
    stt=deepgram_stt,          # Real-time transcription
    tts=edge_tts,              # Free Microsoft neural voice
    transport=stream_edge,      # WebRTC via Stream
    processors=[pose_processor] # YOLO body language
)

Body Language Analysis

Every 3 seconds, the YOLO pose processor analyzes the candidate's webcam feed:

# Pose processor emits events with body language metrics
class BodyLanguageEvent:
    posture: float      # 0-1 score
    fidgeting: float    # 0-1 score
    eye_contact: float  # 0-1 score

These metrics stream to the frontend as custom events through Stream's SFU — the browser never talks to the AI pipeline directly. Everything flows through WebRTC.

10 Function Tools

The LLM doesn't just talk — it acts. I gave it 10 function tools:

ToolWhat It Does
search_knowledge_baseQueries pgvector RAG for relevant interview questions
set_expressionControls the SVG avatar's face (10 expressions)
nod_headAvatar nods in agreement
raise_eyebrowsAvatar shows surprise
score_responseScores candidate on 5 dimensions (0-10)
present_mcqShows multiple choice questions
present_coding_challengeOpens Monaco editor with a problem
evaluate_codeReviews submitted code
transition_phaseMoves through interview phases
generate_reportCreates the final assessment

The Data Flow

CandidAI Data Flow

The frontend is built with Next.js 15, React 19, and Tailwind v4. The interview room has:

  • Animated SVG avatar — 12 animatable properties (eyebrows, pupils, mouth, head tilt, cheek blush) driven by Framer Motion with lerp interpolation
  • Monaco code editor — Full syntax highlighting with a custom dark theme
  • Real-time transcript — Shows the conversation as it happens
  • Body language indicator — Visual feedback on posture, fidgeting, eye contact
  • MCQ cards — For quick technical knowledge checks

The avatar is the star. Each expression defines 12 properties — when the LLM calls set_expression("thinking"), the custom event flows through WebRTC → useAvatarEvents hook → lerp interpolation smoothly transitions all 12 properties. It feels alive.

The RAG Knowledge Base

I didn't want the interviewer to ask generic questions. So I built a RAG system over 169 curated markdown files covering:

  • Behavioral questions with rubrics and senior-level variants
  • Coding problems across 17 categories (arrays, graphs, DP, trees...)
  • System design (Twitter, web crawler, scaling AWS...)
  • Frontend deep-dives (company-specific questions from Meta, Google, Amazon...)
  • Technical concepts (closures, event delegation, virtual DOM, CORS...)

PostgreSQL + pgvector with fastembed (BAAI/bge-small-en-v1.5) handles the embeddings. The agent queries relevant questions based on the interview context.

Five-Phase Interview Flow

The interview follows a structured 5-phase flow:

  1. Intro — Quick greeting, language preference
  2. Behavioral — STAR-method questions, scored on communication
  3. Technical — Deep-dive concepts, MCQs
  4. Coding — Live coding challenge with real-time evaluation
  5. Wrapup — Final report generation with radar chart

Each phase transition is a function call from the LLM. The agent decides when to move on based on the conversation flow.

How I Used Vision Agents SDK

Vision Agents SDK was the backbone of this project. Here's specifically what it provided:

  • getstream.Edge() transport — WebRTC connection through Stream's SFU with sub-30ms audio/video latency
  • Video track subscription — Automatic capture of webcam and screen share at 1 FPS with priority-based switching
  • VideoProcessor base class — Extended for YOLO pose estimation with the Warmable pattern for model loading
  • Agent class — Core agent lifecycle with function tool registration via @llm.register_function
  • EventManager — Custom event system for body language metrics, avatar control, and scores
  • send_custom_event() — Bidirectional communication between agent and frontend through Stream's SFU
  • Native LLM integration — OpenAI GPT-5.2 VLM with vision (sends buffered frames as base64 JPEGs + tool calling)
  • STT/TTS pipeline — Deepgram transcription + Edge TTS with smart turn detection

Without Vision Agents, I would have had to build the entire WebRTC pipeline, video frame capture, audio routing, and event system from scratch. The SDK handled all of that, letting me focus on the interview logic.

Bugs That Nearly Killed Me

1. The Silent Video Track Bug

The YOLO pose processor wasn't receiving any frames. Turns out, the SDK's EventManager.subscribe() uses typing.get_type_hints() to route events. My event handler was missing a type hint:

# BROKEN — handler never registered
async def _on_track_published(self, event):
    ...

# FIXED — SDK can now route the event
async def _on_track_published(self, event: TrackPublishedEvent):
    ...

One missing type hint = complete silence. Took me hours to figure out.

2. OpenAI API SSL Errors

Sending 5 base64 images per LLM call over a long-running connection caused transient SSL errors (SSLV3_ALERT_BAD_RECORD_MAC). Fixed with retry logic — 3 attempts with exponential backoff.

The Stack

LayerTechnology
LLMOpenAI GPT-5.2 (vision + function calling)
STTDeepgram (flux-general-en)
TTSEdge TTS (free Microsoft neural, en-US-AriaNeural)
PoseYOLO 11n-pose via Ultralytics
TransportVision Agents SDK + Stream (WebRTC)
FrontendNext.js 15 + React 19 + Tailwind v4
DatabaseConvex Cloud (7 tables)
RAGPostgreSQL + pgvector + fastembed
EditorMonaco (@monaco-editor/react)
ChartsRecharts (RadarChart)
DeployDocker Compose

What I Learned

Building CandidAI in a week pushed me to my limits. Some takeaways:

  1. Vision Agents SDK is genuinely powerful — The amount of infrastructure it abstracts away (WebRTC, SFU, video capture, event routing) is massive. I went from zero to a working real-time video AI agent in days.

  2. Type hints matter in event-driven systems — The SDK uses runtime type introspection for event routing. Miss a hint, lose an event. Document your types.

  3. Real-time AI is about buffering, not streaming — You don't send every frame to the LLM. You buffer, sample, and send snapshots at the right moment (when the user finishes speaking).

  4. Custom events over WebRTC are underrated — Stream's send_custom_event() let me build an entire UI control protocol (avatar expressions, scores, phases, challenges) without any additional WebSocket connections.

  5. Body language adds a dimension — Even basic posture/fidgeting/eye-contact scores provide surprisingly useful interview insights. The radar chart at the end tells a story that transcripts alone can't.

Try It

CandidAI is open source. Clone it, set up your API keys, and run your own AI technical interviews:

git clone https://github.com/aryan877/candidai.git
cd candidai
docker compose -f docker-compose.dev.yml up

The AI will greet you, ask your preferred language, and take you through a full 5-phase technical interview — behavioral, technical deep-dive, live coding, and a final report with dimensional scores.

Built for the Vision Possible: Agent Protocol hackathon. Powered by Vision Agents SDK by Stream.

Comments (0)

Loading comments...