Metaculus evaluates forecasts using log scores. For a question with possible outcomes $\omega$, where a forecaster submits a predictive distribution $p$ and the question resolves to outcome $\omega^*$:
Baseline score. Compares the forecast to a uniform (chance) baseline $q$:
$$ \text{Baseline}(p, \omega^) = 100 \cdot \log_2 \frac{p(\omega^)}{q(\omega^*)} $$
For binary questions, $q = 0.5$. For continuous questions discretised into $K$ bins, $q(\omega) = 1/K$ for each bin.
A positive baseline score means you did better than chance; negative means worse.
Peer score. Compares a forecaster to all other forecasters on the same question. For forecaster $i$among $N$ total forecasters:
$$ \text{Peer}(i, \omega^) = \text{Baseline}(p_i, \omega^) - \frac{1}{N} \sum_{j=1}^{N} \text{Baseline}(p_j, \omega^*) $$
Peer scores sum to zero across all forecasters on any given question. A positive peer score means you outperformed the field; negative means the field outperformed you. Tournament rankings on Metaculus are determined by peer scores.
We want a measure of disagreement among a set of forecasters on a single question. This measure should:
Before a question resolves, we don’t know the outcome $\omega^*$. But we can compute what every forecaster’s peer score would be under each possible outcome. When forecasters disagree, the outcome matters a lot for who wins — different resolutions would produce very different peer scores. When they agree, the outcome barely matters.
This suggests a natural measure: how much do peer scores vary across forecasters, in expectation over outcomes?
For each possible outcome $\omega$, we can compute the variance of peer scores across forecasters: