first edition adventures: Statistical differences between rolling methods

Sean and I have been discussing rolling methods for ability scores, particularly in the context of PCs vs. henchmen. Consider the two following methods for generating a set of abilities (sourced from the 2e PH, p. 13):

Method II. Roll 3d6 twice and take the higher of the two values.
Method V. Roll 4d6, discard the lowest roll, and add up the other three.

For player characters, I've long employed Method V (4d6 drop lowest) with a caveat that the player can roll up to three sets; if the player discards one set to roll another, the discarded set cannot be used. This gives the player the ability to discard an underwhelming set or swing for the fences to try to attain a great set, but also adds an element of risk, especially if the third and final set (which must be accepted) turns out to be poor.

For most NPCs and hirelings, I employ Method II, which tends to generate more averages sets. I don't use any re-roll caveat for NPCs. They get what they get.

Because, in our campaign, henchmen have the ability to replace fallen PCs over the long-term, we want to make sure we're in agreement on the rolling method to use. Most AD&D players intuitively know that Method V is more likely to generate high scores (15+) than Method II, but what if we evaluate the methods more deeply? Let's say, for example, that we're looking to create a ranger, which requires two scores of 14 or higher and two additional scores of 13 or higher.

I wrote a quick app to generate random sets of ability scores using both methods. Scores within each set are arranged from highest to lowest, with sets that qualify for the ranger class marked with a letter r. Here are results for ten sets of scores for each method:

Method II results (3d6 twice, take higher):
16 14 13 13 11 11 (r)
16 13 12 12 10 8
15 14 13 11 9 8
16 15 14 12 11 11
14 14 13 13 12 11 (r)
13 11 11 11 9 5
15 15 15 13 12 8 (r)
15 14 14 12 12 8
13 12 12 11 10 8
16 14 14 13 12 11 (r)

Method V results (4d6 drop lowest):
15 13 13 12 12 12
14 14 12 10 10 8
16 13 12 12 9 4
17 15 13 11 11 7
14 14 12 9 9 8
14 14 12 11 8 7
17 14 13 12 11 7
15 12 11 10 9 8
16 14 13 13 10 10 (r)
15 15 14 13 10 8 (r)

Four rangers for Method II, only two for Method V. This are very small samples sizes, so let's run them again to observe the variance:

Method II results (3d6 twice, take higher):
13 12 11 11 11 6
14 13 13 12 12 12
14 13 13 12 12 11
17 16 12 11 10 9
13 13 13 12 11 10
15 15 12 10 9 8
17 16 12 11 9 8
15 14 14 10 9 9
16 16 15 15 14 9 (r)
17 13 13 13 12 12

Method V results (4d6 drop lowest):
15 12 11 11 10 9
15 14 13 12 12 7
15 13 11 11 11 9
16 14 12 11 10 8
14 14 13 13 12 12 (r)
15 15 14 13 11 10 (r)
17 14 14 14 13 11 (r)
16 16 15 12 10 8
15 15 14 13 12 5 (r)
18 14 13 12 9 9

This time, one ranger for Method II, four rangers for Method V (an opposite result). Obviously, we need more data. I'll have the application roll 100 sets for each method and take the averages (rounded down to whole numbers):

Method II averages:
15 13 12 11 10 9

Method V averages:
15 14 12 11 10 8

These are actually really close; in fact, the total number of ability points is the same with both methods. Through another few runs, I was able to verify that these exact averages still hold even with a very high (10,000) number of sets.

But the averages don't tell the full story. Again, we know from intuition that there's going to be tangible variance between the methods. If it's not in the total number of points, then where?

For starters, we know that, in order to get stuck with a score of 3 using Method II, we need to roll six 1s in a row. That means 6^6, or one in 46,656 scores. To get equally unlucky with Method V, you only need to roll four straight 1s, or 6^4, which is one in 1,296 scores.

That's a major difference: you're almost 40 times more likely to end up with a score of 3 using Method V (the 4d6 method) compared with Method II (the 3d6-twice method).

Let's see what the distribution of scores is over ten sets of scores using each rolling method:

Method II totals:
3s: 0
4s: 0
5s: 0
6s: 0
7s: 0
8s: 2
9s: 6
10s: 4
11s: 5
12s: 12
13s: 7
14s: 12
15s: 6
16s: 6
17s: 0
18s: 0

Method V totals:
3s: 0
4s: 0
5s: 0
6s: 2
7s: 2
8s: 2
9s: 6
10s: 3
11s: 3
12s: 12
13s: 9
14s: 5
15s: 10
16s: 3
17s: 3
18s: 0

That looks reasonable: fewer high but also fewer low scores when using Method II. How about for 1,000 sets?

Method II totals:
3s: 0
4s: 2
5s: 11
6s: 36
7s: 101
8s: 257
9s: 424
10s: 629
11s: 839
12s: 979
13s: 926
14s: 727
15s: 513
16s: 313
17s: 184
18s: 59

Method V totals:
3s: 3
4s: 25
5s: 51
6s: 99
7s: 194
8s: 269
9s: 451
10s: 585
11s: 697
12s: 794
13s: 740
14s: 737
15s: 586
16s: 430
17s: 240
18s: 99

Now we're starting to see the numbers at work. A thousand sets contain 6,000 individual scores; we have three scores of 3 using Method V, and none using Method II. Given that we were expecting only one in 46,656 scores with Method II but about one in 1,296 scores with Method V to result in a lowly 3, these results look pretty solid, though our sample sizes are still small enough that we're hitting a fair degree of variance.

Here are the totals for 100,000 sets:

Method II totals:
3s: 15
4s: 184
5s: 1056
6s: 3819
7s: 10642
8s: 24747
9s: 43524
10s: 65439
11s: 84711
12s: 94523
13s: 92145
14s: 72943
15s: 51490
16s: 32525
17s: 16626
18s: 5611

Method V totals:
3s: 473
4s: 1917
5s: 4722
6s: 9581
7s: 17370
8s: 29011
9s: 41925
10s: 56347
11s: 68252
12s: 77520
13s: 79661
14s: 74389
15s: 60690
16s: 43464
17s: 25012
18s: 9666

For Method II, fifteen out of 600,000 scores ended up as 3, or one in 40,000, which is very close to the one in 46,656 ratio that we expect to normalize to over the long term. For Method V, we had 473 scores of 3, which is about one in 1,268... extremely close to the normalized ratio of one in 1,296.

If we add up all the scores of 7 or lower, Method II only generated 15,716 while Method V produced a whopping 34,063. If we add up all the scores of 15 or higher, Method II gave us 106,252 while Method V resulted in 138,832.

The takeaways are that you're more than twice as likely to get bad scores (7 or lower) with the 4d6 method, but only about 30% more likely to get high scores (15 and above). Method V, however, is almost twice as likely to generate very high scores (17 or 18), while Method II is far more likely to hit in the average range of 10 through 14.

Finally, here's a plot graph of the 100,000 set results, which makes everything nice and clear:

Click to enlarge

Though Sean and I still haven't decided exactly how to handle scores for a new henchmen, the above data definitely provides the right ammunition to help us make the best decision for our campaign. As an extra bonus, here are score distributions for two additional rolling methods described in the PH, along with an additional graph that charts all four methods.

Method I. Roll 3d6 for each score.
Method IV. Roll 3d6 twelve times, and take the six highest values.

Here are the side-by-side averages of all four methods:

Method II averages:
15 13 12 11 10 9

Method V averages:
15 14 12 11 10 8

Method I averages:
14 12 11 9 8 6

Method IV averages:
15 13 12 12 11 10

...along with the score distributions of the two new methods:

Method I totals:
3s: 2817
4s: 8353
5s: 16620
6s: 27915
7s: 41657
8s: 58653
9s: 68882
10s: 74966
11s: 74945
12s: 69277
13s: 58329
14s: 41804
15s: 27916
16s: 16701
17s: 8340
18s: 2825

Method IV totals:
3s: 0
4s: 0
5s: 0
6s: 1
7s: 129
8s: 2063
9s: 14860
10s: 50573
11s: 98940
12s: 123988
13s: 114935
14s: 83276
15s: 55540
16s: 33591
17s: 16596
18s: 5508

...and the final graph depicting all four methods:

Click to enlarge

(If you made it to the end of this post, congratulations!)

first edition adventures

Friday, November 22, 2019

Statistical differences between rolling methods

No comments:

Post a Comment

About Me

XP Totals

Labels

Links

Archive