Methodology

Name: Bullsift
Author: Bullsift

How Bullsift scores YouTube video credibility. Every metric explained.

Last updated: April 29, 2026

Analysis Pipeline Overview

Bullsift uses a two-pass AI pipeline to analyze YouTube videos. Both passes feed into a single public-facing metric — the Sift Score (0–100, higher = more trustworthy) — which is what gets shown on share cards, the explore feed, and the Chrome extension popup.

Pass 1: Quick Sift

What it does: Extracts the transcript, generates an AI summary, identifies 10–40 individual claims with per-claim sourcing & speculation flags, and produces a Quick Sift Score
Speed: 2–3 seconds
Availability: All tiers including Free
Web search: No — uses AI general knowledge only
Sift Score badge: “Quick Sift estimate” (Truth subscore not included)

Pass 2: Deep Sift

What it does: Takes the most critical verifiable claims from Pass 1, verifies each one against the open web, and produces per-claim verdicts plus a Truth subscore that weights the final Sift Score
Speed: 15–30 seconds
Availability: Pro (50/month) and Power (150/month) tiers
Web search: Yes — 3–10 targeted web searches per claim batch, cross-referencing multiple independent sources
Sift Score badge: “Deep Sift verified” (full formula with Truth subscore at 55% weight)

Sift Score

The Sift Score is Bullsift's primary public metric. A single 0–100 number that answers “is this video worth watching?”. Higher = more trustworthy.

It blends five independent subscores so no single signal can dominate. Truth carries the most weight when available (Deep Sift only); for Quick Sift the formula leans on sourcing, balance, and channel trust.

Formula — Deep Sift

sift_score =

0.55 × truth

+ 0.20 × sourcing

+ 0.15 × channel_trust

+ 0.10 × balance

Formula — Quick Sift (estimate)

sift_score =

0.40 × sourcing

+ 0.25 × balance

+ 0.20 × channel_trust

+ 0.15 × (1 − extreme_predictions) × 100

Quick Sift videos are tagged with a “Quick Sift estimate” badge in the UI so users know the score didn't include claim verification. There's no cap on Quick Sift — calibration showed capping erased meaningful differences between videos.

Color Bands

70+Trustworthy — well-sourced, balanced, established channel

40–69Watch with skepticism — opinion-heavy, weak sourcing, or limited verification

0–39Don't trust — failed verification, conflict of interest, or low signal across the board

The Five Subscores

Every video page shows these subscores in a drilldown so you can see why the headline number landed where it did. Each is 0–100 (or — when not computed). They're also persisted to the database so the displayed breakdown can never drift from what the formula actually used.

Truth

Weighted average of factual + statistic claim verdicts (Deep Sift only). Verdict weights:

Supported / True → 1.0
Partially Supported → 0.7
Unverifiable → 0.5 (with up to −10% penalty when more than 40% of factual claims are unverifiable)
Misleading / Needs Context → 0.2
Unsupported / False → 0.0

Opinion, recommendation, and prediction claims are excluded — they aren't fact-checkable. Truth is null on Quick Sift videos and on Deep Sift videos that contained no factual claims.

Sourcing

The fraction of factual claims accompanied by a named source or citation in the transcript. Pass 1 emits a per-claim has_named_source bool: true when the speaker explicitly names a specific source — an institution, publication, dataset, study author, government agency, or court filing.

When per-claim flags aren't available (older analyses), a transcript keyword fallback applies and is capped at 70 because keyword presence is a weaker signal than a structured per-claim flag.

Balance

Does the video hedge appropriately and acknowledge counterpoints? Combines a hedging-quality signal (transcript-level) with a per-video acknowledges_counterpoints score from Pass 2. Specifically: 0.6 × hedge_quality + 0.4 × acknowledges_counterpoints. On Quick Sift the hedge signal alone drives this. A speaker who explicitly engages with the strongest opposing case scores higher; strawmanning scores lower.

Originality

The fraction of this video's claims not already well-represented in the global claim graph. A claim that has appeared in more than five other analyzed videos counts as “recycled”. Penalizes content that just repeats what's already been said; rewards videos that surface new claims.

Channel Trust

The most trustworthy channel-level signal we have. Resolution order: real ContentItem trust (when community votes exist or Deep Sift has set an AI trust score) → Creator baseline trust (Gemini Flash channel assessment) → Creator community-blended trust score → null. See Channel Trust & Baseline Scoring below for the full breakdown.

Production Tags

Production tags are descriptive chips that appear alongside the Sift Score — they describe how the video was made, not whether the content is trustworthy. These are filterable but non-graded: users who prefer human-led content can hide AI-narrated videos; users who only care about claim accuracy can ignore the chips entirely.

AI VisualsVision model probability of AI-generated imagery > 60

AI VoiceAudio analysis probability of synthetic voice > 60

AI ScriptPass 1 detected AI-generated writing patterns > 0.7

Faceless / High-Volume ChannelChannel heuristic score > 0.7 (likely content-farm patterns)

Clickbait TitleTitle clickbait intensity > 0.7

Conflict of InterestSpeaker promotes own products while making directional predictions about those assets (threshold 0.5). The one production tag that is a real watch-out.

Important: production tags do not push the Sift Score down. A well-researched AI-narrated explainer can score high; a polished human conspiracy video can score low. The medium is described separately from the message.

Claim Verdicts

How Claims Are Extracted

During Pass 1, the AI extracts 10–40 individual claims depending on video length. Each claim is classified by category (factual, statistic, opinion, prediction, recommendation), tagged with a timestamp, scored for speaker confidence, marked as verifiable or non-verifiable, and flagged withhas_named_source andis_speculative bools that feed the Sift sourcing subscore. Advertising claims are automatically filtered out, duplicates are removed, and vague pronouns are resolved to named entities.

How Claims Are Prioritized

Not all claims are sent to Deep Sift. A criticality scoring system ranks claims by importance. Statistics and health/financial claims score highest. Suspicious or low-confidence claims get a boost. Opinions and very short claims are deprioritized. The top-ranked verifiable claims are sent for web verification — 5 claims for Pro users, 10 for Power users.

Verdict Categories

Each verified claim receives one of seven verdicts:

Supported — Direct corroborating evidence found from independent sources

True — Exact numbers, dates, or facts match authoritative sources

Partially Supported — Part of the claim is confirmed, or confirmed with significant caveats

Unsupported — Contradicting evidence found from independent sources

Misleading — Technically true but framed in a way that creates a false impression

Opinion — Subjective belief or values-based judgment that cannot be fact-checked

Unverifiable — No evidence found for or against; distinct from Unsupported

Anti-Hallucination Safeguards

Bullsift enforces strict rules to prevent AI hallucination in verdicts. The AI is prohibited from citing the video itself as evidence (circular reasoning). Only external sources — news articles, official websites, research papers, government records — count as evidence. If a person tells the same story on multiple podcasts, that counts as circular repetition, not independent corroboration. When no external evidence exists, the claim is marked Unverifiable rather than given a false verdict. Verdicts of Supported / True / Partially Supported are required to include at least one source URL; verdicts that don't are post-validated and downgraded to Unverifiable.

Channel Trust & Baseline Scoring

Channel Trust is the input that powers the Sift Score's channel_trust subscore. It's resolved from a four-step fallback chain so the formula always uses the most meaningful signal available:

Per-video community + AI trust — the blended ContentItem trust score, used only when real signal exists (community votes > 0 or Deep Sift has set an AI trust score)
Channel Baseline Trust — an LLM-generated assessment of the channel itself, computed once per creator and refreshed every 30 days
Creator community-blended trust — only when it's been touched by community votes (otherwise it's the default 50)
Null — the formula uses a neutral 50 in the blend; the displayed subscore reads as —

Channel Baseline Trust Score

For channels with fewer than 5 community votes, Bullsift generates a Baseline Trust Score using three components:

Channel metadata (40% weight) — subscriber count, channel age, video count, and verification status, scored deterministically
AI channel assessment (45% weight) — a single Gemini Flash call evaluates the channel's name, description, and metadata to assess overall credibility (~$0.00015/channel)
Anti-slop heuristic (15% weight) — the inverted channel-heuristics score (see Channel Heuristics below)

When Gemini grounding is unavailable, the formula falls back to 0.55 × metadata + 0.45 × anti_slop. Baselines are auto-refreshed every 30 days, and skipped entirely once a channel has 5+ community votes.

Community Voting

Pro members get 1× vote weight; Power members get 2×. The community trust score is the proportion of trust votes to total weighted votes, scaled to 0–100. Once a content item accumulates real votes, the per-video trust score takes priority over the channel baseline — community signal is treated as ground truth for that specific video.

Deepfake & Vision AI Detection

Bullsift's Vision AI analyzes sampled frames from the video to detect AI-generated or manipulated visual content. The system samples 4 frames at different points in the video (10%, 30%, 50%, and 70% of total duration) and analyzes them for artifacts. Results feed the AI Visuals production tag (triggered above 60).

What It Detects

Generative AI imagery — morphing, unnatural textures, structural inconsistencies in faces, hands, and backgrounds
Stock footage slop — disjointed random stock footage with grainy overlays, light leaks, and large text boxes typical of faceless content farms
AI slideshows — still images with pan/zoom effects or AI-warping animation

What It Does Not Flag

Professional motion graphics and recorded interviews
Financial dashboard screenshots (compression artifacts are normal)
Designed YouTube thumbnails with bold text and branding
Channel branding elements like logo animations and end screens

Fakeness Probability Scale

Results are reported as a probability score from 0 to 100. Scores of 0–15 indicate clearly human-produced content. Scores of 16–30 suggest minor concerns but likely human production. Scores above 60 trigger the AI Visuals production tag. The system accounts for channel context — professional verified channels are evaluated with awareness that high-end motion graphics differ from generic AI slop, though established status does not grant a free pass.

Channel Heuristics & Content Farm Detection

Bullsift runs a separate heuristic analysis on each channel to detect content-farm and bot-farm behavior. This score (0.0 to 1.0) feeds the Faceless / High-Volume Channel production tag (triggered above 0.7) and the anti-slop component of the Channel Baseline Trust calculation.

Detection Signals

Upload velocity — channels posting more than 2 videos per day receive the highest penalty (this upload rate is a signature of automated content generation)
Age/volume mismatch — a channel less than 90 days old with over 100 videos is flagged as suspicious
Low subscriber-to-video ratio — fewer than 5 subscribers per video (with 50+ videos) indicates mass-produced content with no audience retention
Engagement anomalies — abnormally low like-to-view ratios on high-view videos, or suspiciously high ratios that suggest manipulation

Authority Balancing

To prevent false positives on legitimate high-output publishers, authority signals like channel verification, high subscriber counts, and long channel age reduce the heuristic score. A verified channel with 1M+ subscribers receives substantial authority reduction even if upload velocity is high.

Global Claims Database

Bullsift maintains a global database of claims extracted from analyzed videos. When the same claim appears across multiple videos, it's canonicalized and tracked — similar to how Snopes tracks recurring claims. This is also what powers the Sift Originality subscore.

Claim Matching

Claims are matched using a three-tier approach: text similarity matching catches obvious duplicates, semantic vector matching (using embeddings) catches claims that say the same thing in different words, and a cache layer prevents redundant re-verification of recently checked claims. Claims that have been seen in more than five other videos count as “recycled” for the originality calculation.

Temporal Truth Decay

Not all claims age equally. Bullsift categorizes claims by freshness — stable facts rarely need re-verification, while event-driven or fast-changing claims are automatically flagged for periodic re-checking. A background system monitors claim expiry and triggers re-verification when a claim becomes stale, ensuring that verdicts stay current as new evidence emerges.

Slop Score (legacy)

Slop Score was Bullsift's original primary metric (0.0–1.0, lower = better). It measures production quality — AI-generated voice, recycled stock footage, formulaic structure, clickbait titling — rather than whether the content is trustworthy. Calibration against real exports surfaced the limitation: a polished human conspiracy video could score low-slop and look “good”, while a well-researched AI-narrated explainer could score high-slop and look “bad”.

For that reason, Slop Score has been replaced by the Sift Score as the primary public metric. It is still computed and exposed on the API for backwards compatibility with installed Chrome extensions (v0.1.x), but it's no longer rendered in any new Bullsift UI.

The API field slop_score is marked deprecated per RFC 8594 with a Sunset header of 2027-04-26. It will not be removed from the API before that date; the actual removal will be gated on legacy-extension install share dropping below 1%.