• Emotional dynamics of literary classics

    September 1, 2014  |  Statistics

    Happiness meter for Huck Finn

    As a demonstration of efforts in estimating happiness from language, Hedonometer charts emotion over time for literary classics. The above is the collection of charts for Adventures of Huckleberry Finn by Mark Twain.

    I wish I could say this meant something to me, but comparative literature in high school was never my strong suit. From a totally superficial point of view though, the chart in the top left shows happiness metrics — based on the research of Peter Dodds and Chris Danforth — through the entirety of the book. The chart on the right shows a comparison of book sections, which you can select in the first chart.

  • Unintentional Venn diagram suggests opposite meaning

    August 19, 2014  |  Mistaken Data

    Unintentional Venn Diagram

    Most people probably wouldn't think much about this poster that shows the values of Thomson Reuters. But when you think of the graphic as a Venn diagram, it's hard to see much else.

  • Not automatic

    August 19, 2014  |  Statistics

    It's an absolute myth that you can send an algorithm over raw data and have insights pop up.

    — Jeffrey Heer in For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insights by Steve Lohr

  • Crisis Text Line releases trends and data

    August 19, 2014  |  Data Sources

    Crisis Trends

    Crisis Text Line is a service that troubled teens can use to find help with suicidal thoughts, depression, anxiety, and other issues via text messaging. The long-term hope was to anonymize and encode these text messages so that researchers and policy-makers could better understand something typically kept private to the individuals.

    Following through, the organization recently released a look into their data and a sample of encoded messages. (There's a link to download the data at the bottom of the page.)

    The visual part of the release shows when text messages typically come in, and you can subset by issue, state, and days. It could use some work, but it's a good start. Hopefully they keep working on it and release more data as the set grows. It could potentially do a lot of good.

  • How charity: water uses data to do more good

    August 15, 2014  |  Statistics

    Water map

    Scott Harrison, the founder and CEO of charity: water, describes how the organization uses data to improve what they do, both on the ground and internally.

    The first is program data, which is the information we collect on the water programs and projects we fund in 22 countries around the world. The second is data on our donors and supporters — for example, how much time or money they've given, which projects or aspects of our work they are most interested in and how they interact with our website. And finally, we collect internal data on our work in order to increase our effectiveness and efficiency as an organization.

  • Variance be damned

    August 15, 2014  |  Statistics

    Daniel Colman won $15.3 million in the The Big One for One Drop poker tournament, but he seems annoyed about it. What he had to say after the win:

    First off, I don't owe poker a single thing. I've been fortunate enough to benefit financially from this game, but I have played it long enough to see the ugly side of this world. It is not a game where the pros are always happy and living a fulfilling life. To have a job where you are at the mercy of variance can be insanely stressful and can lead to a lot of unhealthy habits.

    Clearly we're missing some details here — stuff that would make a person disgusted with winning — but how about that response. Stability and normality for the win. Or, well, in this case, variance for the win. Nevermind, forget it. [via Deadspin]

  • Geography.

    August 8, 2014  |  Mistaken Data

    Geography

    By way of David Kennerr, something in this CNN frame seems off.

  • Markov Chains explained visually

    August 8, 2014  |  Statistics

    Markov chain

    Adding on to their series of graphics to explain statistical concepts, Victor Powell and Lewis Lehe use a set of interactives to describe Markov Chains. Even if you already know what Markov Chains are or use them regularly, you can use the full-screen version to enter your own set of transition probabilities. Then let the simulation run.

    Nice. Should be especially useful for educators.

  • Visual Microphone estimates sound from vibrations in objects

    August 5, 2014  |  Statistics

    A group of researchers from MIT, Microsoft Research, and Adobe Research are experimenting with seemingly inanimate objects as a proxy for sound in the vicinity. They call it the Visual Microphone.

    When sound hits an object, it causes small vibrations of the object's surface. We show how, using only high-speed video of the object, we can extract those minute vibrations and partially recover the sound that produced them, allowing us to turn everyday objects—a glass of water, a potted plant, a box of tissues, or a bag of chips—into visual microphones.

    See the demo in the video above. It's impressive. It's also great that there's another use for high speed video other than watching water balloons pop and guns fire on the Discovery Channel.

    Find more details on the project here.

  • This is Statistics

    July 31, 2014  |  Statistics

    Statistics has an image problem. To the general public, it's old, out of touch, and boring. It's a problem because we place stock in a younger generation who we (1) want to be more data literate and (2) eventually lead the way, or at least participate, in all data-related realms. It's beneficial for everyone.

    This is Statistics is a new push by the American Statistical Association to provide a new perspective that doesn't dwell on sheets of equations.

    From the about:

    We want students and parents to have a better understanding of a field that is often unknown or misunderstood. Statistics is not just a collection of numbers or formulas. It's not just lines, bars or points on a graph. It's not just computing. Statistics is so much more. It's an exciting—even fun—way of looking at the world and gaining insights through a scientific approach that rewards creative thinking.

    In brief: Statistics is not lame.

    If you're reading this, you already know the benefits of learning statistics, but for those who question, at least you have somewhere to send them.

    When I told my parents that I wanted to go to graduate school for statistics, they were concerned. They never pushed me in any direction career-wise, just as long as I tried my best and enjoyed what I did. But, this was the one time they sat me down for a talk.

    Was I sure about this statistics thing? What do people do after? Was I pursuing statistics for the right reasons? It's so much easier to answer those questions now than it was ten years ago. I mean, careers in data are in the news all the time now. I'm glad the ASA is working on making the statistics portion of the data push more obvious.

  • How well we don’t understand probability

    July 29, 2014  |  Statistics

    All Things Considered on NPR ran a fine series on how we interpret probability and uncertainty. It came in five bits (plus one follow-up), each five to ten minutes long. They explore explanations of risk in different areas such as national security, health, and the daily weather and how people interpret the numbers and words.

    A recurring theme was experts who use alternative descriptions for the seemingly concrete numbers.

    Doctors, including Leigh Simmons, typically prefer words. Simmons is an internist and part of a group practice that provides primary care at Mass General. "As doctors we tend to often use words like, 'very small risk,' 'very unlikely,' 'very rare,' 'very likely,' 'high risk,' " she says.

    Not that words always makes understanding numeric probability easier. From the social scientist for the National Weather Service:

    And it's not just a numbers game — words used to describe weather can be just as confusing. Take "watch" and "warning," for example.

    "'Watch' means that conditions are ripe for something to happen. 'Warning' means that it is happening — it is imminent," Brown says. "It's easy to get them confused."

    Both the doctor and the social scientist agree that a combination of numbers, words, and a visual explanation could be the best route.

    Some people think we should forgo trying to explain uncertainty to a general public that doesn't understand, but the rejectors themselves don't recognize the importance. Just because you don't understand something doesn't mean you should ignore it.

    Listen to the full series. [via Dart-Throwing Chimp]

  • A more visual world data portal

    July 24, 2014  |  Data Sources

    OECD data portal

    One of the most annoying parts of downloading data from large portals is that you never quite know what you're gonna get. It's a box of chocolates. It's government data sites. It's lists of datasets with vague or unhelpful titles with links to download. Of course, I'd rather have a hodgepodge than nothing at all, but as with most things, there's room for improvement.

    The OECD, which maintains and provides data on the country level, takes steps towards a more helpful portal that makes data grabs less of a headache. With the help of Raureif, 9elements, and Moritz Stefaner, the new portal is still in beta, but there's plenty to like.
    Continue Reading

  • Large-ish data packages in R

    July 24, 2014  |  Data Sources

    If you've played around with R enough, there comes a time when you just need some data to mess around with. Maybe it's to learn a new method or to make one of your own. R offers some small-ish, clean datasets to poke at, but sometimes you need bigger, messier data. Hadley Wickham from RStudio released four popular large-ish datasets in package form to help you with that.

    I've released four new data packages to CRAN: babynames, fueleconomy, nasaweather and nycflights13. The goal of these packages is to provide some interesting, and relatively large, datasets to demonstrate various data analysis challenges in R. The package source code (on github, linked above) is fully reproducible so that you can see some data tidying in action, or make your own modifications to the data.

    Good.

  • Polling for stress

    July 9, 2014  |  Statistics

    Stress test

    NPR, the Robert Wood Johnson Foundation and the Harvard School of Public Health conducted a survey about peoples' stress levels and factors contributing to the stress. It took place for about a month. NPR started a summary of their findings, of what will be a two-week segment on the air and online.

    The above shows the percent of respondents in the age brackets who said the factors (the rows in this case) contributed to their current stress. It looks like I might be in a less stressful stage of my life, between the age of 30 and 39.

    It's just an early summary of poll responses right now, so I'm hoping they go into more detail about statistically significant differences between demographics and how the 2,500-person sample correlates to the the US population.

  • Data science, big data, and statistics – all together now

    July 2, 2014  |  Statistics

    Terry Speed, a emeritus professor in statistics at University of California at Berkeley, gave an excellent talk on how statisticians can play nice with big data and data science. Usually these talks go in the direction of saying data science is statistics. This one is more on the useful, non-snarky side.

  • Subpar Captain America

    July 1, 2014  |  Statistics

    Animation Domination High-Def has a Captain America video of things that America is not so good at, relative to other countries. And they even cited their data source, the CIA World Factbook. How about that.

  • Test your statistical wits about stuff in the world

    June 24, 2014  |  Statistics

    How wrong you are

    Many of us aren't aware of how one country compares to others or public policy that has been around for decades. How Wrong You Are is a simple quiz game by Moiz Syed and Juliusz Gonera that tests such knowledge.

    How Wrong You Are is a collection of important questions that people are sometimes misinformed about. We poll you to measure how right — or how wrong — the public is about these important questions.

    Every week, we will add a new question. These are all questions that we hope you already know. But if you don't, don't worry! You learned something. Share your results, successful or not. Chances are, if you didn't know this question, other people might not, either.

    Play the game here. At the very least, you'll learn something new.

  • Lessons from improperly anonymized taxi logs

    June 23, 2014  |  Statistics

    Through a Freedom of Information request Chris Whong received and eventually released NYC taxi logs starting in 2013 (about 173 million trips). Vijay Pandurangan looked at the data a little closer and deanonymized the logs to link hashed license numbers to the driver names. It didn't take much to do it. Pandurangan described the process and lessons organizations can learn when they release data.

    Someone on Reddit pointed out that one specific driver seemed to be doing an incredible amount of business. When faced with anomalous data like that, it's good practice to weed out data error before jumping to conclusions about cheating taxi drivers. Also, I couldn't shake the feeling that there was something about that encoded id number: "CFCD208495D565EF66E7DFF9F98764DA." After a little bit of poking around, I realised that that code is actually the MD5 hash of the character '0'. This proved my suspicion that this was actually a data collection error, but also made me immediately realise that the entire anonymization process was flawed and could easily be reversed.

    He also provided the code snippet he used to do it.

  • Data grab bag

    June 12, 2014  |  Statistics

    — When you deal with data, you can think like a statistician, even if you don't know the math (although it will certainly help a lot). Jonathan Stray brings up fine points to draw conclusions from data, as does Jacob Harris in a detailed case study on distrusting your data.

    — Learning data science still seems like a fuzzy, abstract idea. Trey Causey offers advice on getting started with the bubbling field.

    — Is college still worth it? Yes.

    Coding isn't easy. If it were, everyone would do it.

    R gotchas.

  • Government Data

    How to Make Government Data Sites Better

    Accessing government data from the source is frustrating. If you've done it, or at least tried to, you know the pain that is oddly formatted…
Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.