How I built a real-time AI technical interviewer using Vision Agents SDK that watches body language, evaluates live code, and generates comprehensive interview reports — all over WebRTC

Building CandidAI: An AI Interviewer That Actually Sees You

How I built a real-time AI technical interviewer that watches your body language, listens to your answers, evaluates your code, and generates a full performance report — in one week.

"What if your mock interviewer could actually see you fidgeting, notice when you're confused, and adapt in real-time?"

When the Vision Possible: Agent Protocol hackathon dropped, the brief was clear: build multi-modal AI agents that watch, listen, and understand video in real-time. Most people went for security cameras or sports coaching. I went for something that every developer dreads — technical interviews.

The hackathon was sponsored by Vision Agents — Stream's open-source SDK for building real-time Vision AI agents. It gives you WebRTC transport, video frame capture, event systems, and native LLM APIs out of the box. The challenge: use it to build something that pushes the boundaries of real-time video intelligence.

The Problem

Technical interviews are broken. Companies spend thousands on hiring pipelines, candidates spend weeks grinding LeetCode, and at the end of it all, the evaluation is still subjective. A human interviewer might miss that the candidate's hands were shaking, or that they solved the problem in half the expected time but stumbled on communication.

What if we could build an interviewer that never misses a signal?

CandidAI Demo — Live interview with body language analysis, avatar, and transcript

CandidAI: The Idea

CandidAI is a full-stack AI technical interviewer that:

Sees you — YOLO pose estimation tracks body language every 3 seconds (posture, fidgeting, eye contact)
Hears you — Deepgram STT transcribes speech in real-time with smart turn detection
Tests you — Presents coding challenges with a Monaco editor, evaluates solutions, asks MCQs
Talks to you — Edge TTS gives the AI a natural voice over WebRTC
Knows the material — PostgreSQL + pgvector RAG over 169 interview question files
Reports everything — Generates radar charts, timelines, and dimensional scores after each session

CandidAI Architecture

The Architecture

Here's where it gets interesting. The entire system runs over WebRTC using Stream's Vision Agents SDK:

Browser (Next.js)  <--WebRTC (Stream)--> Python Agent (Vision Agents SDK)
     |                                        |
     |-- Convex Cloud (DB)                    |-- OpenAI GPT-5.2 (LLM)
     |-- Monaco Editor                        |-- PostgreSQL + pgvector (RAG)
     |-- SVG Avatar (Framer Motion)           |-- Deepgram STT
     |-- Recharts (reports)                   |-- Edge TTS (free)
                                              |-- YOLO Pose (yolo11n-pose.pt)

The agent joins the call as a WebRTC participant — it receives the candidate's audio and video tracks through Stream's SFU (Selective Forwarding Unit), processes them through the AI pipeline, and sends back audio + custom events.

Vision Pipeline (The Core of It)

This is where Vision Agents SDK shines. When the candidate joins:

Continuous capture: 1 FPS from the active video track (webcam or screen share)
Frame buffer: 5-second rolling buffer = ~5 frames
Send trigger: When Deepgram detects the user finished speaking
To the LLM: Last 5 buffered frames sent as base64 JPEGs with conversation context
Screen share priority: When the candidate shares their screen, the SDK automatically switches from webcam to screen share (priority=1 vs 0)

# The agent factory wires everything together
agent = CandidAIAgent(
    llm=openrouter_vlm,       # GPT-5.2 with vision (frame buffering + tool calling)
    stt=deepgram_stt,          # Real-time transcription
    tts=edge_tts,              # Free Microsoft neural voice
    transport=stream_edge,      # WebRTC via Stream
    processors=[pose_processor] # YOLO body language
)

Body Language Analysis

Every 3 seconds, the YOLO pose processor analyzes the candidate's webcam feed:

# Pose processor emits events with body language metrics
class BodyLanguageEvent:
    posture: float      # 0-1 score
    fidgeting: float    # 0-1 score
    eye_contact: float  # 0-1 score

These metrics stream to the frontend as custom events through Stream's SFU — the browser never talks to the AI pipeline directly. Everything flows through WebRTC.

10 Function Tools

The LLM doesn't just talk — it acts. I gave it 10 function tools:

Tool	What It Does
`search_knowledge_base`	Queries pgvector RAG for relevant interview questions
`set_expression`	Controls the SVG avatar's face (10 expressions)
`nod_head`	Avatar nods in agreement
`raise_eyebrows`	Avatar shows surprise
`score_response`	Scores candidate on 5 dimensions (0-10)
`present_mcq`	Shows multiple choice questions
`present_coding_challenge`	Opens Monaco editor with a problem
`evaluate_code`	Reviews submitted code
`transition_phase`	Moves through interview phases
`generate_report`	Creates the final assessment

The Data Flow

CandidAI Data Flow

The frontend is built with Next.js 15, React 19, and Tailwind v4. The interview room has:

Animated SVG avatar — 12 animatable properties (eyebrows, pupils, mouth, head tilt, cheek blush) driven by Framer Motion with lerp interpolation
Monaco code editor — Full syntax highlighting with a custom dark theme
Real-time transcript — Shows the conversation as it happens
Body language indicator — Visual feedback on posture, fidgeting, eye contact
MCQ cards — For quick technical knowledge checks

The avatar is the star. Each expression defines 12 properties — when the LLM calls set_expression("thinking"), the custom event flows through WebRTC → useAvatarEvents hook → lerp interpolation smoothly transitions all 12 properties. It feels alive.

The RAG Knowledge Base

I didn't want the interviewer to ask generic questions. So I built a RAG system over 169 curated markdown files covering:

Behavioral questions with rubrics and senior-level variants
Coding problems across 17 categories (arrays, graphs, DP, trees...)
System design (Twitter, web crawler, scaling AWS...)
Frontend deep-dives (company-specific questions from Meta, Google, Amazon...)
Technical concepts (closures, event delegation, virtual DOM, CORS...)

PostgreSQL + pgvector with fastembed (BAAI/bge-small-en-v1.5) handles the embeddings. The agent queries relevant questions based on the interview context.

Five-Phase Interview Flow

The interview follows a structured 5-phase flow:

Intro — Quick greeting, language preference
Behavioral — STAR-method questions, scored on communication
Technical — Deep-dive concepts, MCQs
Coding — Live coding challenge with real-time evaluation
Wrapup — Final report generation with radar chart

Each phase transition is a function call from the LLM. The agent decides when to move on based on the conversation flow.

How I Used Vision Agents SDK

Vision Agents SDK was the backbone of this project. Here's specifically what it provided:

getstream.Edge() transport — WebRTC connection through Stream's SFU with sub-30ms audio/video latency
Video track subscription — Automatic capture of webcam and screen share at 1 FPS with priority-based switching
VideoProcessor base class — Extended for YOLO pose estimation with the Warmable pattern for model loading
Agent class — Core agent lifecycle with function tool registration via @llm.register_function
EventManager — Custom event system for body language metrics, avatar control, and scores
send_custom_event() — Bidirectional communication between agent and frontend through Stream's SFU
Native LLM integration — OpenAI GPT-5.2 VLM with vision (sends buffered frames as base64 JPEGs + tool calling)
STT/TTS pipeline — Deepgram transcription + Edge TTS with smart turn detection

Without Vision Agents, I would have had to build the entire WebRTC pipeline, video frame capture, audio routing, and event system from scratch. The SDK handled all of that, letting me focus on the interview logic.

Bugs That Nearly Killed Me

1. The Silent Video Track Bug

The YOLO pose processor wasn't receiving any frames. Turns out, the SDK's EventManager.subscribe() uses typing.get_type_hints() to route events. My event handler was missing a type hint:

# BROKEN — handler never registered
async def _on_track_published(self, event):
    ...

# FIXED — SDK can now route the event
async def _on_track_published(self, event: TrackPublishedEvent):
    ...

One missing type hint = complete silence. Took me hours to figure out.

2. OpenAI API SSL Errors

Sending 5 base64 images per LLM call over a long-running connection caused transient SSL errors (SSLV3_ALERT_BAD_RECORD_MAC). Fixed with retry logic — 3 attempts with exponential backoff.

The Stack

Layer	Technology
LLM	OpenAI GPT-5.2 (vision + function calling)
STT	Deepgram (flux-general-en)
TTS	Edge TTS (free Microsoft neural, en-US-AriaNeural)
Pose	YOLO 11n-pose via Ultralytics
Transport	Vision Agents SDK + Stream (WebRTC)
Frontend	Next.js 15 + React 19 + Tailwind v4
Database	Convex Cloud (7 tables)
RAG	PostgreSQL + pgvector + fastembed
Editor	Monaco (@monaco-editor/react)
Charts	Recharts (RadarChart)
Deploy	Docker Compose

What I Learned

Building CandidAI in a week pushed me to my limits. Some takeaways:

Vision Agents SDK is genuinely powerful — The amount of infrastructure it abstracts away (WebRTC, SFU, video capture, event routing) is massive. I went from zero to a working real-time video AI agent in days.
Type hints matter in event-driven systems — The SDK uses runtime type introspection for event routing. Miss a hint, lose an event. Document your types.
Real-time AI is about buffering, not streaming — You don't send every frame to the LLM. You buffer, sample, and send snapshots at the right moment (when the user finishes speaking).
Custom events over WebRTC are underrated — Stream's send_custom_event() let me build an entire UI control protocol (avatar expressions, scores, phases, challenges) without any additional WebSocket connections.
Body language adds a dimension — Even basic posture/fidgeting/eye-contact scores provide surprisingly useful interview insights. The radar chart at the end tells a story that transcripts alone can't.

Try It

CandidAI is open source. Clone it, set up your API keys, and run your own AI technical interviews:

git clone https://github.com/aryan877/candidai.git
cd candidai
docker compose -f docker-compose.dev.yml up

The AI will greet you, ask your preferred language, and take you through a full 5-phase technical interview — behavioral, technical deep-dive, live coding, and a final report with dimensional scores.

Built for the Vision Possible: Agent Protocol hackathon. Powered by Vision Agents SDK by Stream.

Building CandidAI: An AI Interviewer That Actually Sees You

Building CandidAI: An AI Interviewer That Actually Sees You

The Problem

CandidAI: The Idea

The Architecture

Vision Pipeline (The Core of It)

Body Language Analysis

10 Function Tools

The Data Flow

The RAG Knowledge Base

Five-Phase Interview Flow

How I Used Vision Agents SDK

Bugs That Nearly Killed Me

1. The Silent Video Track Bug

2. OpenAI API SSL Errors

The Stack

What I Learned

Try It

Comments (0)