Methodology
How Bullsift scores YouTube video credibility. Every metric explained.
Last updated: April 29, 2026
Analysis Pipeline Overview
Bullsift uses a two-pass AI pipeline to analyze YouTube videos. Both passes feed into a single public-facing metric — the Sift Score (0–100, higher = more trustworthy) — which is what gets shown on share cards, the explore feed, and the Chrome extension popup.
Pass 1: Quick Sift
- What it does: Extracts the transcript, generates an AI summary, identifies 10–40 individual claims with per-claim sourcing & speculation flags, and produces a Quick Sift Score
- Speed: 2–3 seconds
- Availability: All tiers including Free
- Web search: No — uses AI general knowledge only
- Sift Score badge: “Quick Sift estimate” (Truth subscore not included)
Pass 2: Deep Sift
- What it does: Takes the most critical verifiable claims from Pass 1, verifies each one against the open web, and produces per-claim verdicts plus a Truth subscore that weights the final Sift Score
- Speed: 15–30 seconds
- Availability: Pro (50/month) and Power (150/month) tiers
- Web search: Yes — 3–10 targeted web searches per claim batch, cross-referencing multiple independent sources
- Sift Score badge: “Deep Sift verified” (full formula with Truth subscore at 55% weight)
Sift Score
The Sift Score is Bullsift's primary public metric. A single 0–100 number that answers “is this video worth watching?”. Higher = more trustworthy.
It blends five independent subscores so no single signal can dominate. Truth carries the most weight when available (Deep Sift only); for Quick Sift the formula leans on sourcing, balance, and channel trust.
Formula — Deep Sift
Formula — Quick Sift (estimate)
Quick Sift videos are tagged with a “Quick Sift estimate” badge in the UI so users know the score didn't include claim verification. There's no cap on Quick Sift — calibration showed capping erased meaningful differences between videos.
Color Bands
The Five Subscores
Every video page shows these subscores in a drilldown so you can see why the headline number landed where it did. Each is 0–100 (or — when not computed). They're also persisted to the database so the displayed breakdown can never drift from what the formula actually used.
Truth
Weighted average of factual + statistic claim verdicts (Deep Sift only). Verdict weights:
- Supported / True → 1.0
- Partially Supported → 0.7
- Unverifiable → 0.5 (with up to −10% penalty when more than 40% of factual claims are unverifiable)
- Misleading / Needs Context → 0.2
- Unsupported / False → 0.0
Opinion, recommendation, and prediction claims are excluded — they aren't fact-checkable. Truth is null on Quick Sift videos and on Deep Sift videos that contained no factual claims.
Sourcing
The fraction of factual claims accompanied by a named source or citation in the transcript. Pass 1 emits a per-claim has_named_source bool: true when the speaker explicitly names a specific source — an institution, publication, dataset, study author, government agency, or court filing.
When per-claim flags aren't available (older analyses), a transcript keyword fallback applies and is capped at 70 because keyword presence is a weaker signal than a structured per-claim flag.
Balance
Does the video hedge appropriately and acknowledge counterpoints? Combines a hedging-quality signal (transcript-level) with a per-video acknowledges_counterpoints score from Pass 2. Specifically: 0.6 × hedge_quality + 0.4 × acknowledges_counterpoints. On Quick Sift the hedge signal alone drives this. A speaker who explicitly engages with the strongest opposing case scores higher; strawmanning scores lower.
Originality
The fraction of this video's claims not already well-represented in the global claim graph. A claim that has appeared in more than five other analyzed videos counts as “recycled”. Penalizes content that just repeats what's already been said; rewards videos that surface new claims.
Channel Trust
The most trustworthy channel-level signal we have. Resolution order: real ContentItem trust (when community votes exist or Deep Sift has set an AI trust score) → Creator baseline trust (Gemini Flash channel assessment) → Creator community-blended trust score → null. See Channel Trust & Baseline Scoring below for the full breakdown.
Claim Verdicts
How Claims Are Extracted
During Pass 1, the AI extracts 10–40 individual claims depending on video length. Each claim is classified by category (factual, statistic, opinion, prediction, recommendation), tagged with a timestamp, scored for speaker confidence, marked as verifiable or non-verifiable, and flagged withhas_named_source andis_speculative bools that feed the Sift sourcing subscore. Advertising claims are automatically filtered out, duplicates are removed, and vague pronouns are resolved to named entities.
How Claims Are Prioritized
Not all claims are sent to Deep Sift. A criticality scoring system ranks claims by importance. Statistics and health/financial claims score highest. Suspicious or low-confidence claims get a boost. Opinions and very short claims are deprioritized. The top-ranked verifiable claims are sent for web verification — 5 claims for Pro users, 10 for Power users.
Verdict Categories
Each verified claim receives one of seven verdicts:
Anti-Hallucination Safeguards
Bullsift enforces strict rules to prevent AI hallucination in verdicts. The AI is prohibited from citing the video itself as evidence (circular reasoning). Only external sources — news articles, official websites, research papers, government records — count as evidence. If a person tells the same story on multiple podcasts, that counts as circular repetition, not independent corroboration. When no external evidence exists, the claim is marked Unverifiable rather than given a false verdict. Verdicts of Supported / True / Partially Supported are required to include at least one source URL; verdicts that don't are post-validated and downgraded to Unverifiable.
Channel Trust & Baseline Scoring
Channel Trust is the input that powers the Sift Score's channel_trust subscore. It's resolved from a four-step fallback chain so the formula always uses the most meaningful signal available:
- Per-video community + AI trust — the blended ContentItem trust score, used only when real signal exists (community votes > 0 or Deep Sift has set an AI trust score)
- Channel Baseline Trust — an LLM-generated assessment of the channel itself, computed once per creator and refreshed every 30 days
- Creator community-blended trust — only when it's been touched by community votes (otherwise it's the default 50)
- Null — the formula uses a neutral 50 in the blend; the displayed subscore reads as
—
Channel Baseline Trust Score
For channels with fewer than 5 community votes, Bullsift generates a Baseline Trust Score using three components:
- Channel metadata (40% weight) — subscriber count, channel age, video count, and verification status, scored deterministically
- AI channel assessment (45% weight) — a single Gemini Flash call evaluates the channel's name, description, and metadata to assess overall credibility (~$0.00015/channel)
- Anti-slop heuristic (15% weight) — the inverted channel-heuristics score (see Channel Heuristics below)
When Gemini grounding is unavailable, the formula falls back to 0.55 × metadata + 0.45 × anti_slop. Baselines are auto-refreshed every 30 days, and skipped entirely once a channel has 5+ community votes.
Community Voting
Pro members get 1× vote weight; Power members get 2×. The community trust score is the proportion of trust votes to total weighted votes, scaled to 0–100. Once a content item accumulates real votes, the per-video trust score takes priority over the channel baseline — community signal is treated as ground truth for that specific video.
Deepfake & Vision AI Detection
Bullsift's Vision AI analyzes sampled frames from the video to detect AI-generated or manipulated visual content. The system samples 4 frames at different points in the video (10%, 30%, 50%, and 70% of total duration) and analyzes them for artifacts. Results feed the AI Visuals production tag (triggered above 60).
What It Detects
- Generative AI imagery — morphing, unnatural textures, structural inconsistencies in faces, hands, and backgrounds
- Stock footage slop — disjointed random stock footage with grainy overlays, light leaks, and large text boxes typical of faceless content farms
- AI slideshows — still images with pan/zoom effects or AI-warping animation
What It Does Not Flag
- Professional motion graphics and recorded interviews
- Financial dashboard screenshots (compression artifacts are normal)
- Designed YouTube thumbnails with bold text and branding
- Channel branding elements like logo animations and end screens
Fakeness Probability Scale
Results are reported as a probability score from 0 to 100. Scores of 0–15 indicate clearly human-produced content. Scores of 16–30 suggest minor concerns but likely human production. Scores above 60 trigger the AI Visuals production tag. The system accounts for channel context — professional verified channels are evaluated with awareness that high-end motion graphics differ from generic AI slop, though established status does not grant a free pass.
Channel Heuristics & Content Farm Detection
Bullsift runs a separate heuristic analysis on each channel to detect content-farm and bot-farm behavior. This score (0.0 to 1.0) feeds the Faceless / High-Volume Channel production tag (triggered above 0.7) and the anti-slop component of the Channel Baseline Trust calculation.
Detection Signals
- Upload velocity — channels posting more than 2 videos per day receive the highest penalty (this upload rate is a signature of automated content generation)
- Age/volume mismatch — a channel less than 90 days old with over 100 videos is flagged as suspicious
- Low subscriber-to-video ratio — fewer than 5 subscribers per video (with 50+ videos) indicates mass-produced content with no audience retention
- Engagement anomalies — abnormally low like-to-view ratios on high-view videos, or suspiciously high ratios that suggest manipulation
Authority Balancing
To prevent false positives on legitimate high-output publishers, authority signals like channel verification, high subscriber counts, and long channel age reduce the heuristic score. A verified channel with 1M+ subscribers receives substantial authority reduction even if upload velocity is high.
Global Claims Database
Bullsift maintains a global database of claims extracted from analyzed videos. When the same claim appears across multiple videos, it's canonicalized and tracked — similar to how Snopes tracks recurring claims. This is also what powers the Sift Originality subscore.
Claim Matching
Claims are matched using a three-tier approach: text similarity matching catches obvious duplicates, semantic vector matching (using embeddings) catches claims that say the same thing in different words, and a cache layer prevents redundant re-verification of recently checked claims. Claims that have been seen in more than five other videos count as “recycled” for the originality calculation.
Temporal Truth Decay
Not all claims age equally. Bullsift categorizes claims by freshness — stable facts rarely need re-verification, while event-driven or fast-changing claims are automatically flagged for periodic re-checking. A background system monitors claim expiry and triggers re-verification when a claim becomes stale, ensuring that verdicts stay current as new evidence emerges.
Slop Score (legacy)
Slop Score was Bullsift's original primary metric (0.0–1.0, lower = better). It measures production quality — AI-generated voice, recycled stock footage, formulaic structure, clickbait titling — rather than whether the content is trustworthy. Calibration against real exports surfaced the limitation: a polished human conspiracy video could score low-slop and look “good”, while a well-researched AI-narrated explainer could score high-slop and look “bad”.
For that reason, Slop Score has been replaced by the Sift Score as the primary public metric. It is still computed and exposed on the API for backwards compatibility with installed Chrome extensions (v0.1.x), but it's no longer rendered in any new Bullsift UI.
The API field slop_score is marked deprecated per RFC 8594 with a Sunset header of 2027-04-26. It will not be removed from the API before that date; the actual removal will be gated on legacy-extension install share dropping below 1%.