Background: Metaculus Scores

Metaculus evaluates forecasts using log scores. For a question with possible outcomes $\omega$, where a forecaster submits a predictive distribution $p$ and the question resolves to outcome $\omega^*$:

Baseline score. Compares the forecast to a uniform (chance) baseline $q$:

$$ \text{Baseline}(p, \omega^) = 100 \cdot \log_2 \frac{p(\omega^)}{q(\omega^*)} $$

For binary questions, $q = 0.5$. For continuous questions discretised into $K$ bins, $q(\omega) = 1/K$ for each bin.

A positive baseline score means you did better than chance; negative means worse.

Peer score. Compares a forecaster to all other forecasters on the same question. For forecaster $i$among $N$ total forecasters:

$$ \text{Peer}(i, \omega^) = \text{Baseline}(p_i, \omega^) - \frac{1}{N} \sum_{j=1}^{N} \text{Baseline}(p_j, \omega^*) $$

Peer scores sum to zero across all forecasters on any given question. A positive peer score means you outperformed the field; negative means the field outperformed you. Tournament rankings on Metaculus are determined by peer scores.

The Problem

We want a measure of disagreement among a set of forecasters on a single question. This measure should:

  1. Be zero when all forecasters agree.
  2. Be large when forecasters disagree strongly.
  3. Live in the same units as Metaculus scores, so the magnitude is interpretable.
  4. Work for all Metaculus question types (binary, continuous, multiple choice).

Key Insight

Before a question resolves, we don’t know the outcome $\omega^*$. But we can compute what every forecaster’s peer score would be under each possible outcome. When forecasters disagree, the outcome matters a lot for who wins — different resolutions would produce very different peer scores. When they agree, the outcome barely matters.

This suggests a natural measure: how much do peer scores vary across forecasters, in expectation over outcomes?

The Measure

For each possible outcome $\omega$, we can compute the variance of peer scores across forecasters: