• Reviewing Deborah Stone’s Counting and Tim Harford’s The Data Detective, Hannah Fry discusses the usefulness of data and its limitations for The New Yorker:

    Numbers are a poor substitute for the richness and color of the real world. It might seem odd that a professional mathematician (like me) or economist (like Harford) would work to convince you of this fact. But to recognize the limitations of a data-driven view of reality is not to downplay its might. It’s possible for two things to be true: for numbers to come up short before the nuances of reality, while also being the most powerful instrument we have when it comes to understanding that reality.

    This builds on Fry’s similarly themed article from a couple of years ago, as well as her book Hello World.

    Data is limited, and the better we understand those limitations, the better use we can get out of what’s there.

  • For ProPublica, Ken Schwencke reports on a poor data system that relies on local law enforcement to voluntarily enter data:

    Local law enforcement agencies reported a total of 6,121 hate crimes in 2016 to the FBI, but estimates from the National Crime Victimization Survey, conducted by the federal government, pin the number of potential hate crimes at almost 250,000 a year — one indication of the inadequacy of the FBI’s data.

    “The current statistics are a complete and utter joke,” said Roy Austin, former deputy assistant attorney general in the Department of Justice’s civil rights division. Austin also worked at the White House on data and civil rights and helped develop an open data plan for police data.

    Garbage in, garbage out.

  • I was curious who played for a single team over their entire career, who skipped around, and how the patterns changed over the decades.

  • In what seems to have become a trend of making more and more detailed election maps, NYT’s The Upshot mapped results down to the addresses of 180 million voters:

    The maps above — and throughout this article — show their estimates of partisanship down to the individual voter, colored by the researchers’ best guess based on public data like demographic information, voter registration and whether voters participated in party primaries.

    We can’t know how any individual actually voted. But these maps show how Democrats and Republicans can live in very different places, even within the same city, in ways that go beyond the urban-suburban-rural patterns visible in aggregated election results.

    The estimates are based on research by Jacob Brown and Ryan Enos, recently published in Nature. You can also look at their data via the Harvard Dataverse.

  • Russell Jeung, chair of the Asian American Studies Department at San Francisco State University, on NPR about the recent rise:

    What we’ve discovered isn’t that we’ve just had a spike, but we’ve had a surge over the entire year last year with COVID-19 and with the president’s political rhetoric in the last administration. We now have over 3,000 incidents and hate-filled incidents where people are tormenting Asian Americans. I can’t describe the actual amount of hate that Asian American community is experiencing now. We have over 11% of our cases where we’re getting pushed and shoved and actually physically assaulted.

    Ugh.

  • For Kontinentalist, Isabella Chua took a dive into the evolution of Chinese names:

    Put simply, names encode the wishes parents have for their children.

    So, what were these wishes? For answers, I turned to the Chinese name database, which covers the surname and given-name characters for almost all 1.2 billion Han Chinese—the ethnic majority in China—individuals born between 1930 and 2008. I’ve focused only on given names here rather than surnames; given names are subject to parents’ discretion, whereas surnames are inherited.

    If you’re unfamiliar with Chinese names, Chua provides good explanations and audio pronunciations to make it easier to follow along.

  • For Quartz, Amanda Shendruk and Marc Bain analyzed skin tones that appeared in beauty and fashion ads on Instagram. The graphics use Blackout Tuesday on June 2, 2020, when many brands vowed to improve diversity to better reflect the world, as a point of comparison. Using median skin color as the main metric, some companies shifted more than others.

  • As a lead-in and backdrop to a timeline of the past year by The Washington Post, an animated dot density map represents Covid-19 deaths. “Every point of light is a life lost to coronavirus.”

  • As part of their Citizen Browser project to inspect Facebook, The Markup shows a side-by-side comparison between Facebook feeds for different groups, based on the feeds of 1,000 paid participants.

    There are pretty big differences for news sources and group suggestions, but the news stories don’t seem as big as you might think with a median 3 percentage points difference between groups. Although, the distribution shows a wider spread.

  • Alicia Parlapiano and Josh Katz, reporting for NYT’s The Upshot, plotted the average aid for different groups, outlined by the March 2021 stimulus bill. The estimates come from a new analysis by the Tax Policy Center, which contrasts sharply with the 2017 Tax Cuts and Jobs Act.

    Check out the full Upshot chart, which shows single and married households up to three children or more. There are a few visual encodings going on here with the axes, bubble size, color, and income group labels.

  • Seeing CO2, by design studio Extraordinary Facility, is a playable data visualization that imagines if carbon dioxide were visible. You drive a car around collecting bits of information about carbon dioxide in our environment, and along the way, you’ll see volumes of CO2 compared against well-known structures. Pretty great.

  • Members Only

    Bad charts get made. It’s inevitable. Sometimes I wonder how it happens though.

  • BirdCast, from Colorado State University and the Cornell Lab of Ornithology, shows current forecasts for where birds are headed over the United States:

    Bird migration forecasts show predicted nocturnal migration 3 hours after local sunset and are updated every 6 hours. These forecasts come from models trained on the last 23 years of bird movements in the atmosphere as detected by the US NEXRAD weather surveillance radar network. In these models we use the Global Forecasting System (GFS) to predict suitable conditions for migration occurring three hours after local sunset.

  • Chris Ume, with the help of Tom Cruise impersonator Miles Fisher, created highly believable deepfakes of Tom Cruise and posted the videos to TikTok. Ume showed the breakdown of the arduous process of training the A.I. model and editing each frame.

    The Verge talked to Ume more about the process:

    “You can’t do it by just pressing a button,” says Ume. “That’s important, that’s a message I want to tell people.” Each clip took weeks of work, he says, using the open-source DeepFaceLab algorithm as well as established video editing tools. “By combining traditional CGI and VFX with deepfakes, it makes it better. I make sure you don’t see any of the glitches.”

    The results are both entertaining and worrisome.

  • We already looked at minimum wage over time, but when it comes to geography and income, you also have to consider the cost of living for a fair comparison.

  • Speaking of A.I. and fiction, Adam Epstein for Quartz reported on how Wattpad, the platform for people to share stories, uses machine learning to find potential movies:

    Wattpad uses a machine-learning program called StoryDNA to scan all the stories on its platform and surface the ones that seem like candidates for TV or film development. It works on both macro and micro levels, analyzing big-picture audience engagement trends to identify the genres picking up steam, while also looking at the specific stories that got popular quickly and calculating what made them so appealing.

    The tool can break stories down to their vocabularies and sentence structures (a story’s “DNA,” if you will) and then compare those to other stories to deduce what really makes a work of fiction popular. It also looks at how often users comment on stories and, when they do, what exactly they’re saying. Its goal is to examine all these clues to uncover the precise combination of story elements—genre, emotion, grammar, the list goes on—that hooks audiences to the point they’ll follow its journey onto a visual medium.

    Maybe I’m just getting old, but this sounds terrible.

  • Pamela Miskhin, in collaboration with The Pudding, wrote a love story. It’s not just any love story though. The text is based on Mishkin’s own experiences and input from GPT-3, the language prediction model that produces text that sounds like it came from a human.

    The result resembles a cross between choose-your-own-adventure and Mad Libs.

  • The Royal Statistical Society published ten lessons governments should takeaway from this year, which should naturally apply to standard data practice:

    1. Invest in public health data – which should be regarded as critical national infrastructure and a full review of health data should be conducted 
    2. Publish evidence – all evidence considered by governments and their advisers must be published in a timely and accessible manner
    3. Be clear and open about data – government should invest in a central portal, from which the different sources of official data, analysis protocols and up-to-date results can be found
    4. Challenge the misuse of statistics – the Office for Statistics Regulation should have its funding augmented so it can better hold the government to account
    5. The media needs to step up its responsibilities – government should support media institutions that invest in specialist scientific and medical reporting
    6. Build decision makers’ statistical skills – politicians and senior officials should seek out statistical training
    7. Build an effective infectious disease surveillance system to monitor the spread of disease – the government should ensure that a real-time surveillance system is ready for future pandemics
    8. Increase scrutiny and openness for new diagnostic tests – similar steps to those adopted for vaccine and pharmaceutical evaluation should be followed for diagnostic tests
    9. Health data is incomplete without social care data – improving social care data should be a central part of any review of UK health data
    10. Evaluation should be put at the heart of policy – efficient evaluations or experiments should be incorporated into any intervention from the start.

    See the full report here.

  • There was a lot of uncertainty in the beginning of the pandemic, so the forecasts varied across sources. There were also many forecasts. Youyang Gu provided on of those forecasts, and it predicted well. Ashlee Vance reporting for Bloomberg on the Covid-19 forecasting work of Youyang Gu:

    The novel, sophisticated twist of Gu’s model came from his use of machine learning algorithms to hone his figures. After MIT, Gu spent a couple years working in the financial industry writing algorithms for high-frequency trading systems in which his forecasts had to be accurate if he wanted to keep his job. When it came to Covid, Gu kept comparing his predictions to the eventual reported death totals and constantly tuned his machine learning software so that it would lead to ever more precise prognostications. Even though the work required the same hours as a demanding full-time job, Gu volunteered his time and lived off his savings. He wanted his data to be seen as free of any conflicts of interest or political bias.

    Reading this, it felt a little bit like cherry-picking the forecast that was best, but I don’t know enough to decide. It does seem to highlight though some of the limitations of larger organizations that don’t always have the best point of view.