Deepfake Detector - from crawler to cloud interface

2025-08-21

TL;DR

We designed and shipped an end-to-end pipeline: custom short-video crawler → cloud storage → face-aware frame sampling → multi-model voting → auditable API and web UI. It’s pragmatic, scalable, and easy to extend with better models or new sources.

Introduction

This started as a uni project that we treated like a product. Goal: ingest short-form videos, analyze them with multiple ML models, and expose trustworthy results through an API and a small web app. Human reviewers can verify detections to keep the system honest and continuously improve it.

What made it hard

Short-video platforms are hostile to automation (cookie walls, infinite scroll, dynamic markup). We built a resilient crawler instead of fighting brittle scrapers.
Models disagree. We needed an ensemble and a clear aggregation strategy that works per frame and per video.
Inference is spiky. The system had to scale up quickly during crawls and scale down to near-zero cost the rest of the time.
Auditability. Every detection had to be reproducible with model version, parameters, and timestamps.

Architecture at a glance

High level flow: 1. Crawler discovers and downloads videos, uploads them to object storage, and creates metadata records. 2. Analyzer samples frames, detects faces, runs deepfake models, aggregates results, and writes back a verdict. 3. API + Web expose users, videos, detections, and analytics with role-based access.

Pipeline, in practice

Crawler

Stack: Headless Chrome (Selenium) for navigation + banner handling, custom scroll logic, and link extraction; yt-dlp for robust media download.
Naming & storage: Filenames are sanitized and prefixed with ISO timestamps. Videos land in Google Cloud Storage; metadata (source URL, duration, hashtags, storage URL) is posted to the API.
Backoff & retries: Exponential backoff around fetches, checksum validation, and idempotent uploads keep the pipeline stable during bursts.

Analyzer

Frame policy: Decode with OpenCV and sample at fixed intervals (e.g., every 5th frame). This balances speed and coverage for short clips.
Face focus: Detect faces using a fast Haar cascade; crop with a safety margin; convert BGR→RGB for model processors. If no face is found, we still evaluate the raw frame to avoid blind spots.
Models: Plug-and-play classifiers loaded from Hugging Face (e.g., ViT-based and CNN-based detectors). Each returns logits → softmax per frame.
Aggregation: Majority vote across frames; confidence = mean probability of the winning class. We store per-frame traces for audits.
Packaging: Shipped as a Google Cloud Function (Python 3.10) with increased memory for inference. Concurrency gives elastic scale without babysitting servers.

API & Web

API: FastAPI served by Uvicorn. Endpoints cover users, videos, detections, models, and analytics. All writes are atomic; responses include model id, confidence, and timestamps for traceability.
Data model (MongoDB):
- videos: source URL, storage URL, duration, counts, hashtags, uploader, crawl timestamps.
- detections: video id, model id, per-frame scores, aggregated verdict, confidence, analyzer version.
- users & reviews: roles, verification decisions, and comments to close the human-in-the-loop loop.
UI: A small Flask app lists videos, shows verdicts and confidences, and lets admins filter by model, date range, or hashtag.

Reliability, ops, and cost

Stateless workers: Crawler and analyzer are stateless; progress lives in the DB. That makes retries safe and parallelization easy.
Cold starts & throughput: Analyzer runs well as a Cloud Function; for heavier models we can drop it into Cloud Run to tame cold starts and pin CPU/RAM.
Observability: Structured logs with request ids, per-stage timings, and model versions. Basic metrics: crawl rate, analyzer latency, success/error ratios, and queue depth.
Security: API keys with scoped permissions, signed GCS URLs, and strict validation (Pydantic) on all incoming payloads.

Results that mattered

On our short-form test set the ensemble produced a sub-20% false-accept rate, with room to improve as we add harder negatives and temporal cues.
The pipeline kept up with bursty crawls thanks to object storage + elastic compute, without us managing servers.

Engineering notes

Small conventions compound: timestamped filenames, idempotent uploads, and strict schemas saved hours of debugging.
Ensemble > single model for this domain. Disagreement is normal; the system is designed to quarantine uncertainty rather than over-promise.
Human review is first-class. A single click can flip a verdict and mark a sample for retraining.

What’s next

Model quality: Curate tougher negatives, add lightweight per-creator adapters, and explore temporal signals (blink rate, head-pose drift) to reduce single-frame bias.
Throughput: Batch inference on Cloud Run for GPU bursts when needed; keep the API on App Engine for simplicity.
Coverage: Expand beyond a single platform, normalize metadata across sources, and schedule re-checks when models improve.