Multiquote isn't working for me, so here are my thoughts to various things:
I'd want to look at keeping all the classes more. There's a lot of munchkin stuff in the data. There's a 1,000+ characters with point buy scores high enough to get all 18's as starting abilities. I assumed the odd classes were part of that. But maybe that could be taken care of by trimming out the munchkin stuff directly. And remember: all the abilities in the data are starting abilities. Racial bonuses, items, and ASIs are not included. I don't think class features are either, but I'd have to check to be sure. That's why I was using point buy score to trim the data. I was trying to find an upper bound for a 4d6 drop low character. So if you are going to trim by ability scores I would actually used 3-18, since that covers rolling and point buy.
I have a list of subclasses I used for subsetting the data. Since there are concerns about piracy I will not post it here.
I believe you can have a character without race and class, but that it is not viewable, so it should not be part of this data set. D&D Beyond does not force abilities, which is how so many characters in the data have all 0's or all 8's.
I could try looking into the backgrounds more, by finding no-background leveled characters and checking the JSON. Maybe over the weekend.