Back to Blog
PrivacyFebruary 2026

Why Local Speech-to-Text with Whisper AI Matters for Privacy

Your interview audio contains sensitive information — proprietary questions, your personal responses, company details. Where that audio gets processed matters.

Every AI interview tool needs speech-to-text. The interviewer speaks, the tool transcribes, and the transcription feeds into the language model. This step is non-negotiable — it's how the tool understands what's being asked.

The question is: where does the transcription happen? Most tools send your audio to a cloud service. Some tools — including Shadow Claude — run transcription entirely on your device using OpenAI's Whisper model. The difference matters more than you might think.

The Problem with Cloud Transcription

Cloud transcription is the default approach because it's easy to implement. Make an API call, get text back. But during a live interview, this creates four significant problems:

01

Your audio is transmitted to third-party servers

Cloud transcription services like Google Speech-to-Text, AWS Transcribe, and OpenAI Whisper API all require sending raw audio data over the internet. Even with encryption in transit, your audio exists on someone else's server for processing.

02

Audio may be stored and used for model training

Most cloud providers retain audio data for quality improvement unless you explicitly opt out. Some services include fine print allowing audio to be used for training future models. Your interview conversations could become training data.

03

Network requests create a forensic trail

Every API call to a cloud transcription service creates network traffic that can be logged by corporate proxies, VPNs, or network monitoring tools. IT departments can see that you're sending audio to an external service during a video call.

04

Latency is unpredictable

Cloud transcription adds network round-trip time to every request. On a good connection, that's 100-300ms. On a bad one, it can spike to seconds. During a live interview, consistent sub-second transcription matters.

What Is Whisper and How Does It Work Locally?

Whisper is a speech recognition model created by OpenAI and released as open-source in 2022. It was trained on 680,000 hours of multilingual audio and can transcribe speech in over 90 languages. Unlike OpenAI's paid Whisper API (which runs in their cloud), the open-source model weights can be downloaded and run locally on any machine.

The model comes in several sizes — tiny, base, small, medium, and large — each trading accuracy for speed. For real-time interview transcription, the base model hits the sweet spot: accurate enough for conversational English, fast enough to transcribe in under 2 seconds on a modern CPU.

When running locally, Whisper processes audio entirely within your machine's RAM and CPU. No network call is made. No audio data leaves the device. The transcription is generated in-process and passed directly to the next stage of the pipeline.

How Shadow Claude's Audio Pipeline Works

Shadow Claude captures system audio (what you hear through your speakers or headphones) using the operating system's loopback interface. This audio goes through a multi-stage local pipeline — no network involved until the final question is sent to Claude for answering.

01

Audio Capture

System audio is captured directly from your meeting software on Windows. This picks up whatever the interviewer is saying through Zoom, Teams, or Meet.

02

Resampling

Raw audio is resampled to 16kHz mono PCM — the format Whisper expects. This runs in a dedicated audio thread with <50ms latency.

03

Voice Activity Detection

An energy-based VAD detects when someone is speaking and when they stop. Silence triggers transcription of the latest speech segment.

04

Rolling Transcription

A 30-second sliding window of audio is transcribed by Whisper every 6 seconds (or immediately on silence). New text is diffed against the accumulated transcript to extract only what's new.

05

Question Detection

New text is classified as a question, context, or noise. Questions trigger the AI response pipeline. Context is accumulated for additional background.

The only network call in the entire pipeline is the final step: sending the transcribed text question to Claude for answer generation. Your raw audio never leaves your machine.

Whisper Model Specifications

ModelWhisper Base (English)
File size~148 MB (bundled)
InferenceCPU-only (no GPU required)
Latency<2 seconds for 5s of audio
Languages12+ (English-only model for best accuracy)
Audio format16kHz mono PCM (resampled from system audio)
Network usageZero — all processing on-device

Cloud vs. Local Transcription: Side by Side

FactorCloudLocal (Whisper)
Audio leaves deviceYesNo
Network requiredYesNo
Latency100-500ms + processing<2s total (CPU)
Data retention riskDepends on providerNone
Network forensicsDetectableNo trace
Works offlineNoYes
AccuracySlightly higher (large models)Good (base model)
Cost per request$0.006/min (Whisper API)Free (bundled)

The accuracy trade-off is minimal for interview transcription. Conversational English at normal speaking pace is well within the base model's capabilities. Where cloud models have an edge — heavy accents, noisy environments, specialized vocabulary — the difference rarely affects interview performance.

Why This Matters Specifically for Interviews

Interview audio is uniquely sensitive. It often contains proprietary coding problems that companies invest significant effort in creating and keeping confidential. Interviewers share internal project details when describing the role. You discuss your current employer, compensation, and career plans.

Sending this audio to a cloud service means trusting that provider with all of this information. Local processing eliminates the trust requirement entirely. Your audio is processed, transcribed, and discarded — all within your machine's memory.

It also eliminates the network footprint. Corporate VPNs and network monitoring tools can detect unusual API traffic during a video call. Local Whisper generates zero network traffic during transcription — the only external calls are the text-based requests to the answer generation API.

Try Privacy-First Transcription

Shadow Claude runs Whisper locally on your machine. Your audio never leaves your device. Free plan available — no credit card required.