Our Methodology: How Naly Finds Polymarket Mispricings
Naly publishes one Polymarket mispricing roundup a day. A mispricing is an event where our independent 8-step Bayesian probability estimate flips away from the market's top answer and the answer-level component score is at least 20 points. We never hide losses, our calibration is public, and we show Brier score instead of accuracy while resolved counts are small.
What counts as a "mispricing"
The cheap version of a mispricing — "market is 60%, I say 50%" — is not a useful signal. Most of those gaps are recalibration noise. To filter that out we require two structural conditions on every published event:
- Answer flip. Our top answer differs from Polymarket's top answer. If the market says YES and we say YES with a lower confidence, that is a same-side recalibration, not a disagreement, and we do not write it up.
- Component score ≥ 20. On the flipped answer, the probability gap between Polymarket's component and our component is at least 20 percentage points. This cutoff is tracked per event in the /predictions/scorecard page so anyone can re-derive it.
The pipeline, end to end
The causal-inference pipeline runs twice a day (00:00 UTC AM batch, 12:00 UTC PM batch). Each batch performs the following steps:
- Fetch candidates. Pull active Polymarket events expiring within ~60 days, with enough liquidity to be worth scoring.
- Score each candidate. The LLM reasoner, backed by real-time web search, runs an 8-step Bayesian analysis: fetch market, gather evidence, set a base rate, weigh positive factors, weigh negative factors, combine, stress-test, and record a confidence interval.
- Select qualified events. Keep only rows where signal_agreement is answer_flip and component_score is at least 20. Rank by component_score × research_confidence.
- Deep research & write. The top-k qualified events go into a roundup article with a 50–70 word answer-first lead, event-by-event Bayesian reasoning, and explicit "what would make us wrong" paragraphs.
- Publish with dedupe. The save step refuses a second roundup for the same calendar date, so the feed stays clean. See the daily archive.
- Resolve & score. When an event resolves, the prediction-verifier compares the actual outcome with our top answer and updates /track-record and the calibration curve on the scorecard.
Why calibration, not accuracy
Accuracy is a bad metric for a small-sample forecaster. With 2 resolved predictions, accuracy can be 0% or 100% from one outcome. Neither is honest. That is why, while qualified resolved N is below 10, the scorecard hides the accuracy headline and leads with two things that remain meaningful at small N:
- Calibration curve. Are the 70% calls resolving ~70% of the time? A well-calibrated forecaster's predicted bars match the actual-frequency bars across buckets.
- Brier score. Mean of (probability − outcome)². A single number that penalizes both overconfidence and underconfidence, and degrades gracefully as N grows.
We publish Naly's Brier and Polymarket's Brier side by side on the same set of resolved qualified events. If our Brier is lower, we add value. If theirs is lower, readers should know that too.
What a roundup contains
Every daily mispricing roundup follows the same structure:
- A 50–70 word answer-first lead paragraph naming the 1–2 sharpest disagreements with concrete cents-prices. Written to be lifted verbatim by AI Overview / Perplexity / ChatGPT.
- A summary comparison table with the same set of answer-level fields for every event.
- Per event: market vs. our view, top-answer labels, component breakdown, a causal chain, 3–6 evidence bullets, a Bayesian calculation block, an alternative explanation, and an explicit "what would make us wrong" paragraph.
- Fresh-source links so readers can verify the evidence themselves.
How we handle being wrong
We expect to be wrong. Prediction markets have decades of evidence that the aggregate price is informed, and a model that disagrees with the market is volunteering for the hard case. Two rules keep us honest about it:
- We resolve every qualified published call against the actual outcome. Losses are not silently dropped. The resolution runs as part of the same cron batch that publishes the next roundup.
- When we are wrong, the calibration curve and the Brier score reflect it on the next scorecard refresh. Nothing is retroactively rationalized.
Where this method fails
The method is weakest on (a) markets with thin liquidity where the market price itself is noisy, (b) events with private information we cannot access (insider leaks, private deal negotiations), and (c) multi-outcome markets with more than ~6 answers where the base-rate prior becomes unwieldy. We exclude categories we know we cannot score responsibly rather than publish dubious predictions.
Go deeper
- /track-record — per-reporter accuracy and resolution history.
- /predictions/scorecard — Naly vs. Polymarket Brier, calibration curve, qualified event log.
- /category/stock — archive of every published roundup.
- /faq — answers to the most common reader questions.
Frequently asked questions
- What is a Polymarket mispricing?
- A mispricing is an event where our independent probability estimate disagrees with Polymarket's market-implied probability by a material, structural amount. We only publish disagreements where our top answer flips away from the market's top answer and the answer-level component score is at least 20 points — not mere recalibrations.
- How does the 8-step Bayesian model work?
- For each candidate event we (1) fetch the Polymarket market and outcomes, (2) gather recent web evidence, (3) set a base rate from history or reference class, (4) score positive factors with magnitude and direction, (5) score negative factors, (6) combine factors into an estimate, (7) stress-test via alternative explanations, and (8) record a confidence interval. The final output is a per-answer probability distribution, not a single point.
- Why is the "answer flip + 20-point component score" filter important?
- Raw probability gaps conflate genuine disagreement with recalibration. A market at 69% YES vs our 59% YES looks disagreeing but is actually same-side. We only count events where our top answer differs and the component-level disagreement is substantial. This rule excludes weak signals and makes the track record honest at small N.
- Why hide the accuracy percentage while N is small?
- With fewer than 10 resolved qualified predictions, a single outcome can swing the headline accuracy by 10–50 percentage points. That kind of noise is actively misleading. Until we have at least 10 resolved qualified calls, we lead with calibration and Brier score — both are meaningful at small N — and hide the raw accuracy headline.
- What is the Brier score and why is it our primary metric?
- Brier = mean((probability − outcome)²) over resolved predictions. Perfect calibration is 0, a pure coin flip is 0.25, worst case is 1. Unlike accuracy, Brier penalizes both over- and under-confidence and behaves sensibly at small N. We publish Naly's Brier and Polymarket's Brier on the same resolved events side by side.
- How often do you publish?
- Once per day per batch. The pipeline runs at 00:00 and 12:00 UTC. We cap at one roundup per day to avoid scaled-content issues, and a pipeline-level dedupe refuses to save a second roundup with the same title or the same calendar date.
- Is this financial advice?
- No. We publish our reasoning and our track record openly so you can evaluate us as a source. Prediction markets are real-money instruments with real risk. Polymarket can (and often does) disagree with us and be right. Never trade an event contract based solely on our take — read the full analysis, the calibration record, and the methodology, and decide for yourself.