All data is wrong

Vicki Boykis riffing off the George Box quote, “All models are wrong, some are useful.”:

The point is that, whatever data you dig into, at any given point in time, that looks solid on the surface, will be a complete mess underneath, plagued by undefined values, faulty studies, small sample problems, plagiarism, and all of the rest of the beautiful mess that is human life.

Just as all deep learning NLP models are really grad students reading phone books, if you dig deep enough, you’ll get to a place where your number is wrong or calculated differently than you’ve assumed.

I think of statistics as uncertainty management. It’s about estimates and figuring out how much you can trust them. Working with data is rarely about getting an exact truth.