• Statistics professor Di Cook was one of the first people I ever talked to about visualization. She has a short Q&A over at StatsChat.

    I spent a few years doing that [a research assistant] and then realised I’d really like to make art, because some of the research-assistant work I was doing was computer graphics for data online. It fed into my art instincts from teenage years, so I spent some time as an artist before finding a graduate programme in statistics in the US that focused on data visualisation.

  • If you want to analyze bodies of text, it’s a good to know how to use regular expressions. That way you can programmatically extract complex text patterns instead of marking and encoding items manually. Thomas Nield for O’Reilly provides an introduction:

    Many data science, analyst, and technology professionals have encountered regular expressions at some point. This esoteric, miniature language is used for matching complex text patterns, and looks mysterious and intimidating at first. However, regular expressions (also called “regex”) are a powerful tool that only require a small time investment to learn. They are almost ubiquitously supported wherever there is data.

    Nield says it isn’t a steep learning curve, which I agree with, but I would suggest not trying to learn every part of the syntax. Learn it piecewise, and it’ll seem like less of a jumble of brackets, periods, and question marks.

    See also the RegExr. It’s an interactive tool that lets you paste a body of text and then enter regular expressions to see what matches your given pattern in real-time.

  • The New York Times is back at it in explaining the creative process. A couple of years ago they explained the making of a Justin Bieber song. This time they talked to Ed Sheeran and his collaborators about the making of their hit song Shape of You. The musicians talk and the visualization serves as a backdrop.

  • Hearing about machine learning and algorithms a lot recently and not sure what that means? CGP Grey explains:

  • For the past few months, Cards Against Humanity polled the American public to ask important questions such as whether or not it is okay to pee in the shower.

    To conduct our polls in a scientifically rigorous manner, we’ve partnered with Survey Sampling International — a professional research firm — to contact a nationally representative sample of the American public. For the first three polls, we interrupted people’s dinners on both their cell phones and landlines, and a total of about 3,000 adults didn’t hang up immediately. We examined the data for statistically significant correlations, and boy did we find some stuff.

    The poll is in the context of political leanings, which leads to some interesting cross-sections.

    Maybe the best part though is that CAH will continue to poll for a full year, and you can download the data, which I am sure makes for a fun class project. They are also asking social scientists for question suggestions that would otherwise go unasked by more traditionally funded public polling.

  • Here’s a different look at tax cuts and increases from Reuben Fischer-Baum for The Washington Post. As Fischer-Baum points out, keep in mind that these are just estimates and they calculations vary:

    Analyses that use data from real taxpayers as their starting point – like the calculator put together by the New York Times – produce lower estimates. Other calculators like the one put together by the Wall Street Journal produce similar results to ours. For example, a household in D.C. filing jointly with two kids under 17, earning a total of $150,000 and itemizing $20,000 gets a tax cut of $3,796 in our analysis. Roughly equivalent inputs to the New York Times calculator produces an estimated range of a $1,020 to $3,280 cut, while the Wall Street Journal calculator – which is based on the less generous House bill – produces a cut of $3,230.

  • I’m pretty sure this is all that most people want to know. The Upshot provides a tax calculator that considers the Republican tax bill and the variation of taxes between households that earn similar incomes. Punch in some information like income range and marital status, and you get a range of tax cuts or increases for households similar to yours.

  • The David Rumsey Map Collection, known for its many browsable historical maps, now has a “data visualization” subject tag. This means you can now quickly access over 1,000 charts that date back centuries. I’m not sure how long the browser has had the filter available, but I’m glad it does. [via @srendgen]

  • I heard you like maps. Jim Vallandingham put together a collection of maps that show multiple variables, for inspiration and perusal.

  • Disney is set to buy 21st Century Fox for $52.4 billion. I honestly don’t have the mental capacity or imagination to comprehend such a large sum, much less figure out how such a deal works. At least Youyou Zhou, reporting for Quartz, provides breakdowns of market share for the two companies, which makes things a bit more understandable. If the deal goes through, Disney is going to be (an even bigger) behemoth.

  • Introducing yourself to R as an Excel user can be tricky, especially when you don’t have much programming experience. It requires that you switch from one mental model of the data that exists in an interactive spreadsheet to one that exists in vectors and lists. Steph de Silva provides a translation of these data structures for Excel users.

  • Research group Euphrates experimented with lines and a ballet dancer’s movements in Ballet Rotoscope:

    By the way, rotoscoping is an old technique used by animators to capture movement. Pictures or video are taken and lines are traced for use in different contexts. [via @Rainmaker1973]

  • Doug Mills, reporting for The New York Times:

    Echoing his days as a real estate developer with the flair of a groundbreaking, Mr. Trump used an oversize pair of scissors to cut a ribbon his staff had set up in front of two piles of paper, representing government regulations in 1960 (20,000 pages, he said), and today — a pile that was about six feet tall (said to be 185,000 pages).

    Interpret as you like.

  • Statistician Kristian Lum described her experiences with harassment as a graduate student at stat conferences. She held back on talking about it for many of the same reasons others have, but then there was a shift and she began warning colleagues.

    I started doing this because I heard that S (for the second time to my knowledge) had taken advantage of a junior person who had had too much to drink. This time, his act had been witnessed first-hand by several professors at the conference. Since then, I have heard one professor who witnessed the incident openly lament that he’ll have to find a way to delicately advise his female students on “how not to get raped by S” so as not to lose promising students.

    What the hell? Unacceptable.

  • As everyone has already checked out for the rest of the year, I’m going to mess around with R to the tune of The Twelve Days of Christmas and maybe throw down a few tips. You’re welcome.
    Read More

  • Democrat Doug Jones won in the senate race against Republican Roy More last night. The Washington Post provides how different demographic groups voted, based on a poll “conducted by Edison Research for the National Election Pool, The Washington Post and other media organizations.”

  • Enrico Bertini, a professor at New York University, delves into the less flashy but equally important branch of visualization: analysis. Much of what Enrico describes applies to the other branches too, so it’s worth the full read:

    One aspect of data visualization I have been discovering over the years is that when we talk about data visualization we often think that the choice of which graphical representation to use is the most important one to make. However, deciding what to visualize is often equally, if not more, important, than deciding how to visualize it. Take this simple example. Sometime a graph provides better answers to a question when the information is expressed in terms of percentages than absolute values. I think it would be extremely helpful if we could better understand and characterize the role data transformation plays in visualization. My impression is that we tend to overemphasize graphical perception when content is what really makes a difference in many cases.

    Getting to that what often requires iteration between the analysis and presentation facets of visualization. I spend about the same time on the analysis side as on presentation, and that’s only because I’m more fluent with my analysis tools. I don’t have to spend a lot of time reading documentation. The amount of production during the analysis phase is definitely much higher.

  • Michael Wines, reporting for The New York Times:

    “The politicization of the census would erode what is already fragile trust and confidence in the integrity of the count,” said Vanita Gupta, the president of the Leadership Conference on Civil and Human Rights, which has worked for years on census issues.

    The Trump administration’s heated rhetoric on immigration, race and the trustworthiness of government is fueling fears that minorities, legal and undocumented immigrants and others — from asylum-seekers to victims of the opioid crisis — will be even harder to locate and count. The 2010 census actually overcounted non-Hispanic whites by 0.8 percent and undercounted African-Americans by 2.1 percent and Hispanics by 1.5 percent.

    For context, the overcount and undercount numbers aren’t statistically different from that of the 2000 Census. The Census has always had to account for some groups reporting more than others.

    But much of this comes from a general distrust of government — more so among some than others — and that trust level isn’t exactly on the rise these days. With that, in tandem with an administration not above swaying the numbers, the upcoming census could get messy. As the census approaches, I hope everyone assumes their right to be counted in this country.

  • Data for police shootings is usually the subset that only includes fatalities. Vice News made requests nationwide to get data on people who were shot but not killed by police. To accompany their story, Vice News made the data and code available for download:

    Ultimately, we obtained some data from 47 departments — with 4,099 incidents in all. Departments in New York’s Suffolk and Nassau Counties didn’t provide us with any data. Maryland’s Montgomery County Police Department gave us only partial incident-level information and no total number of police shootings, so we excluded them from the analysis.

    We put all this information together to analyze trends across the departments and to compare them with one another — the first time this has ever been done for both fatal and nonfatal shootings.

    Get the data and look for yourself.

  • NASA. Data. Good.

    Tracking the aerosols carried on the winds let scientists see the currents in our atmosphere. This visualization follows sea salt, dust, and smoke from July 31 to November 1, 2017, to reveal how these particles are transported across the map.

    The first thing that is noticeable is how far the particles can travel. Smoke from fires in the Pacific Northwest gets caught in a weather pattern and pulled all the way across the US and over to Europe. Hurricanes form off the coast of Africa and travel across the Atlantic to make landfall in the United States. Dust from the Sahara is blown into the Gulf of Mexico. To understand the impacts of aerosols, scientists need to study the process as a global system.

    Read more here.