If you’re not already familiar with Brier scoring, you should read the first two parts of our forecast scoring series:
In our previous articles, we discussed a basic forecasting question: Will
the Cubs win the World Series in 2017? Once the season is over, there’s an
unequivocal answer to the question (either they won or they did not), and we
score each allotment of probability in a forecast as correct or incorrect. But
how should we score questions when some answers are "closer" to being
correct than others?
One example would be the question "How many games will the Cubs win in
the 2017 regular season?" with the answer options:
Less than 50 |
50-75 |
76-100 |
More than 100 |
Say I make the following forecast:
Answer | Forecast |
---|---|
Less than 50 | 15% |
50-75 | 45% |
76-100 | 25% |
More than 100 | 15% |
Now say they win 80 games, making the "76-100" bucket correct. The 45% I allocated to the 50-75 bucket was not technically correct, but it was closer to being correct than the "Less than 50" bucket. Ideally, our scoring system should penalize forecasters less for allocating probabilities closer to the correct outcome. This is exactly what the ordinal scoring system does.
Similar to a normal Brier score, we calculate a daily score for each day
that the forecast was active and then average the daily scores to calculate an
overall score for the question. The difference from a normal Brier score is in
the method used for calculating the daily score.
Using a standard Brier score for this question (ie. not an ordinal score), the daily score for my forecast would be:
Answer | Forecast | Score |
---|---|---|
Less than 50 | 0.15 | (0.15 - 0)² = 0.0225 |
50-75 | 0.45 | (0.45 - 0)² = 0.2025 |
76-100 | 0.25 | (0.25 - 1)² = 0.5625 |
More than 100 | 0.15 | (0.15 - 0)² = 0.0225 |
Daily score (sum of answer scores) | 0.81 |
In ordinal questions, we use a different method for calculating daily error. To start, we create successive groupings of the answer options as follows:
Grouping 1 | Grouping 2 |
---|---|
|
|
|
|
|
|
For each of these groupings, sum the probabilities in that group and calculate the squared error using that sum:
(forecast_probability_sum - final_outcome)²
The final_outcome
should be 0 or 1, depending on which bucket
the correct answer (76-100) falls into.
Grouping 1 | Grouping 1 Score | Grouping 2 | Grouping 2 Score | Total Score | |
---|---|---|---|---|---|
|
(0.15 - 0)² = 0.0225
|
|
(0.45 + 0.25 + 0.15 - 1)² = 0.0225
|
0.0225 + 0.0225 = 0.045 | |
|
(0.15 + 0.45 - 0)² = 0.36
|
|
(0.25 + 0.15 - 1)² = 0.36
|
0.36 + 0.36 = 0.72 | |
|
(0.15 + 0.45 + 0.25 - 1)² = 0.0225 |
|
(0.15 - 0)² = 0.0225 | 0.0225 + 0.0225 = 0.045 | |
Daily Score (average of the 3 total score values): | 0.27 |
As you can see, the ordinal score penalizes my forecast much less (remember,
lower score = less error = better score) than a standard Brier score. This
better reflects the fact that I allocated most of the probability to the
correct bucket or a bucket that was "close" but not quite correct.