D&D General Data from a million DnDBeyond character sheets?

Oofta · Jul 21, 2023

FrogReaver said:
Please have it on my desk by EOD. Thanks.

Really wish I could, but I have to get these TPS reports out.

FrogReaver · Jul 21, 2023

Oofta said:
Really wish I could, but I have to get these TPS reports out.

You got to prioritize better…

ichabod · Jul 21, 2023

Oofta said:
Looking at the link @ichabod provided above, there's a ton of data in there. Far more than what the data dump has. However, without a schema definition it would take significant effort to decipher. It seems to have basically all the data you need to recreate your character sheet, including all the class descriptive text.

Yeah, there's a lot of data there. It seems to have all the class features to 20th level even if you don't have a 20th level character. From what I can see, though, I think we can reconstruct the character's original abilities, all classes and levels for multiclass characters, and which blank backgrounds are custom backgrounds that D&D Beyond allows.

ichabod · Jul 21, 2023

FrogReaver said:
Maybe it will help some to understand what I’m sitting up on my end.

I’m starting with the dup removed set I just use it as the starting point for further trims.

I then want the next dataset to be what we broadly agree with on trimming. Let’s call this the ‘Type 1 error dataset’. Goal is for it to not exclude any data that should be there. Which means basically when in doubt include.

I’m also good with a ‘Type 2 error dataset’ where we trim the data to the point where we are more or less certain what’s remaining is valid.

I’m good posting results based on type 2 dataset unless I want to talk about some of the data we excluded from it. For example it would be interesting to know that hypothetically 200 of 300,000 characters had all 18 stats.

Does that work for a compromise?

That sounds like a good compromise. I think before we start we should clearly define "data that should be there." I was thinking about another dataset while at lunch, a "by the rules" dataset. This may end up being the Type 2 dataset, but if it isn't, I think it would be a good dataset to make as well.

I see no problems with discussing odd subsets of the whatever data as long as it's clear what's going on. I did a double check and found 1,623 characters with all 18's in the UID data.

I think we should take a few more days poking around in the data for potential issues before working out the Type 1 & 2 criteria.

ichabod · Jul 21, 2023

Oh, and the JSON also tells us the sub-race, which is not in the CSV data.

FrogReaver · Jul 21, 2023

ichabod said:
That sounds like a good compromise. I think before we start we should clearly define "data that should be there." I was thinking about another dataset while at lunch, a "by the rules" dataset. This may end up being the Type 2 dataset, but if it isn't, I think it would be a good dataset to make as well.

I see no problems with discussing odd subsets of the whatever data as long as it's clear what's going on. I did a double check and found 1,623 characters with all 18's in the UID data.

I think we should take a few more days poking around in the data for potential issues before working out the Type 1 & 2 criteria.

Agreed on all!

Lanefan · Jul 21, 2023

Hussar said:
True, but since they are houseruling right out of the gate like that, then aren’t they automatically outliers?

Depends how common that houserule or variant turns out to be, doesn't it?

I mean, personally I'd say the true outliers would be those who play exactly by the rules as written with no variance whatsoever.

Hussar said:
IOW if we’re looking at this data to see trends in how people are playing the game, shouldn’t we start out by ignoring people who aren’t actually using the rules of the game?

Absolutely not!

If you want to see trends in how people are playing the game then you need to look at how people are actually playing the game, which includes houserules and kitbashes they might have applied in order to make the game their own.

Lanefan · Jul 21, 2023

Ferrousbones said:
That is my point about the inventory. It is reasonable to conclude that a character intended for play, not just as an experiment, will have all the starting bases covered: race, background (even if custom), class, maybe subclass (depending on class and level), proficiencies, starting equipment, starting gold, bio stats (height, weight, etc.).

The question is: how many characters not meeting the above criteria are experiments, and how many are future frameworks for existing characters?

Another thing to keep in mind is that some of those incomplete-looking character sheets might be bare-bones versions for online-play purposes only, or for the DM's quick reference, with the real character sheets kept physically by their players.

We don't use DDB but with roll20 that's what some of us have: quickie online sheets for the DM to reference while the full-ride sheets are on paper with the players.

Lanefan · Jul 21, 2023

ichabod said:
I think it's good to look at house rule characters and see if they deviate significantly from the other data in other ways. One thing I am really concerned with is the 44.9% of characters who are level one. I think that's where your mass of unplayed characters are. Fifth edition is not lethal enough to kill half of all starting characters.

If we look at the full data set, but trimmed of duplicate character IDs (which I am going to call the UID data from now on, and I think would be a good baseline for discussions), 44.9% are level 1. If we look at the ones without backgrounds, 50.5% are level 1. That's suggestive, but not a huge difference. However, if we look at characters with the default name ('<username/>'s Character'), 61.9% are level 1. I think that's enough of a variation to exclude those characters, at least if they are level 1.

There's no way of knowing whether those were true experiments or whether they were actual characters rolled up for campaigns that then never got off the ground. Or, less likely but still possible, the campaigns were specificially intended as one-shots. I'd say that characters rolled up for campaigns that quickly collapsed are legit for data purposes.

IME anyway, if for whatever reason a campaign's going to collapse that collapse happens within the first few sessions before the characters have got past 1st level.