• Jake Porway, executive director of DataKind on data hackathons and why they require careful planning to actually work:

    Any data scientist worth their salary will tell you that you should start with a question, NOT the data. Unfortunately, data hackathons often lack clear problem definitions. Most companies think that if you can just get hackers, pizza, and data together in a room, magic will happen. This is the same as if Habitat for Humanity gathered its volunteers around a pile of wood and said, “Have at it!” By the end of the day you’d be left with a half of a sunroom with 14 outlets in it.

    Without subject matter experts available to articulate problems in advance, you get results like those from the Reinvent Green Hackathon. Reinvent Green was a city initiative in NYC aimed at having technologists improve sustainability in New York. Winners of this hackathon included an app to help cyclists “bikepool” together and a farmer’s market inventory app. These apps are great on their own, but they don’t solve the city’s sustainability problems. They solve the participants’ problems because as a young affluent hacker, my problem isn’t improving the city’s recycling programs, it’s finding kale on Saturdays.

    Without clear direction on what to do with the data or questions worth answering, hackathons can end up being a bust from all angles. From the organizer side, you end up with a hodgepodge of projects that vary a lot in quality and purpose. From the participant side, you’re left up to your own devices and have to approach the data blind, without a clear starting point. From the judging side, you almost always end up having to pick a winner when there isn’t a clear one, because the criteria of the contest was fuzzy to begin with.

    This also applies to hiring freelancers for visualization work. You should have a clear goal or story to tell with your data. If you expect the hire to analyze your data and produce a graphic, you better get someone with a statistics background. Otherwise, you end up with a design-heavy piece with little substance.

    Basically, the more specific you can be about what you’re looking for, the better.

  • Self-tracking devices are all the rage these days. I went to the Apple store, and there was practically a whole wall of them. They were all uni-taskers though. There was one for cycling, another for running, and one for golfing. Amiigo, an Indiegogo campaign with four days left to contribute (but funded to completion five times over as of this writing), aims to track multiple exercises and figure out what you’re exercise you’re doing automatically.
    Read More

  • Lois Beckett for ProPublica has a thorough piece on data brokers — companies that collect and sell information about you — and what they know and where they get the data from.

    They start with the basics, like names, addresses and contact information, and add on demographics, like age, race, occupation and “education level,” according to consumer data firm Acxiom’s overview of its various categories.

    But that’s just the beginning: The companies collect lists of people experiencing “life-event triggers” like getting married, buying a home, sending a kid to college — or even getting divorced.

    Credit reporting giant Experian has a separate marketing services division, which sells lists of “names of expectant parents and families with newborns” that are “updated weekly.”

    The companies also collect data about your hobbies and many of the purchases you make. Want to buy a list of people who read romance novels? Epsilon can sell you that, as well as a list of people who donate to international aid charities.

    So if you’re wondering why you received that catalog in the mail, it was probably because a store sold your purchase data to a broker.

  • My many thanks to FlowingData sponsors who help keep the lights on around here. Check ’em out. They help you do stuff with data.

    InstantAtlas — Enables information analysts and researchers to create highly-interactive online reporting solutions that combine statistics and map data to improve data visualization, enhance communication, and engage people in more informed decision making.

    Tableau Software — Helps people see and understand data. Ranked by Gartner in 2011 as the world’s fastest growing business intelligence company, Tableau helps anyone quickly and easily analyze, visualize and share information.

    Periscopic — A socially conscious data visualization firm that specializes in using technology to help companies and organizations facilitate information transparency and public awareness. They do good with data.

    Column Five Media — Whether you are a startup that is just beginning to get the word out about your product, or a Fortune 500 company looking to be more social, they can help you create exciting visual content – and then ensure that people actually see it.

    Want to sponsor FlowingData? Send interest to [email protected] for more details.

  • When we build models of the world, we often think of it broken down into pieces, such as cities, counties, and countries. In their newly funded project The City of 7 Billion, architects Joyce Hsiang and Bimal Mendis aim to model the world as one city, to study the impact of population growth on the environment and natural resources on a larger scale.

    Every corner of the planet, they argue, is “urban” in some sense, touched by farming that feeds cities, pollution that comes out of them, industrialization that has made urban centers what they are today. So why not think of the world as a single urban entity?

    Hsiang and Mendis don’t yet know exactly what this will look like (that is the question, Mendis says). But they are planning to seed their geo-spatial model with worldwide data on population growth, economic and social indicators, topography, ecology and more. Ultimately, they hope, other researchers will be able to use the open-source platform for research on development patterns or air quality; the public will be able to use it to grasp the implications of building in a flood plain or implementing an energy policy; and architects will be able to use it to view the world as if it were a single project site.

    Along with a slew of other challenges I am sure, one of the big ones is finding comparable data at high granularity. Large cities tend to track (and hopefully release) data about what’s going, but once you step out of the major areas, data grows scarce.

    They started with population, which was transformed into the physical installation above.

  • Along the same lines as Google Flu Trends, researchers at Microsoft, Stanford and Columbia University are investigating whether search data can be used to find interactions between drugs. They recently found an interaction.

    Using automated software tools to examine queries by six million Internet users taken from Web search logs in 2010, the researchers looked for searches relating to an antidepressant, paroxetine, and a cholesterol lowering drug, pravastatin. They were able to find evidence that the combination of the two drugs caused high blood sugar.

    The idea is that people are searching for symptoms and medications, and this data is stored in anonymized search logs. They then followed a suspicion that using the two drugs at the same time might cause hyperglycemia. Those that searched for the two drugs were more likely to search for hyperglycemia than the control group (probably those who didn’t search for hyperglycemia).

    The work is still in its infancy, but it’ll be interesting to see how this sort of data can be used to supplement existing work by the Food and Drug Administration.

  • Members Only

    Although time series plots and small multiples can go a long way, animation can make your data feel more real and relatable. Here is how to do it in R via the animated GIF route.

  • These days it’s relatively easy to figure out connections between people via email, Twitter, Facebook, etc. However, it’s harder to decipher relationships between people in the 17th century. Researchers at Carnegie Mellon and Georgetown University aim to figure that out in the Six Degrees of Francis Bacon.

    Historians and literary critics have long studied the way that early modern people associated with each other and participated in various kinds of formal and informal groups. Yet their scholarship, published in countless books and articles, is scattered and unsynthesized. By data-mining existing scholarship that describes relationships between early modern persons, documents, and institutions, we have created a unified, systematized representation of the way people in early modern England were connected.

  • The United States Census Bureau just released county-level commute estimates for 2011, based on the American Community Survey (that thing so many people seem to be against).

    About 8.1 percent of U.S. workers have commutes of 60 minutes or longer, 4.3 percent work from home, and nearly 600,000 full-time workers had “megacommutes” of at least 90 minutes and 50 miles. The average one-way daily commute for workers across the country is 25.5 minutes, and one in four commuters leave their county to work.

    The Bureau graphic isn’t very good [PDF], but WNYC plugged the data into a map, which is a lot more informative.

    There’s also a link to download the data on the bottom left of the WNYC map in CSV format, in case you want to try your hand at making a choropleth map. Or you can grab some flow data from the Census Bureau.

  • An old one from xkcd. I’m not sure whether to laugh or cry, but I think he’s implying that people who make graphs on weekends are super dateable.

  • Who’s going to be the next pope? I know all of you are sitting on the edge of your seats. Luckily, an analytical research manager who goes by the name AJ hacked together a pope tracker.

    Despite not being Catholic, the papal election fascinates me. Not sure if it’s the old rituals, the world-wide interest, or simply the fact that the Catholic Church has left a huge mark on history.

    There’s no way I know enough about the inner workings of the Catholic Church to have any idea on who the next Pope may be.

    Since domain knowledge is out, the next best option?

    Follow the money!

    He’s scraping odds of possible candidates becoming pope from a betting site, and the above shows the numbers over time. The odds were bumpy at first, but there seems to be some convergence, and as of this writing, Cardinal Peter Turkson from Ghana is the heavy favorite. [via Revolutions]

  • I’m not into video games, and my experience has been near zero since high school, but I’m excited about SimCity 2013 coming out tomorrow. I think my excitement comes from one part nostalgia and one part GlassBox — the game engine that drives the simulations of the city you build and its citizens:

    All the glowing reviews probably have something to do with interest, too. But that memory of installing SimCity 2000 from two floppy disks in my 486 totally brings back happy thoughts.

    Apparently, the game makers were inspired by Google Maps and information graphics to display the data generated during gameplay. I hope Maxis releases some of that data. It could be fun to compare SimCity demographics to the real world. Then again, who’s going to have time to look at the data, when we’ll be too busy building arcologies?

  • Andrew Leonard for Salon fears what might come of the creative process if movies are based on algorithms and data and that we might turn into puppets.

    For years Netflix has been analyzing what we watched last night to suggest movies or TV shows that we might like to watch tomorrow. Now it is using the same formula to prefabricate its own programming to fit what it thinks we will like. Isn’t the inevitable result of this that the creative impulse gets channeled into a pre-built canal?

    Because tastes never change? We don’t have any choice but to watch what is handed to us? Will creators stop making things that go against the norm? Leonard concludes with us stuck in a trance, in front of our televisions.

    The companies that figure out how to generate intelligence from that data will know more about us than we know ourselves, and will be able to craft techniques that push us toward where they want us to go, rather than where we would go by ourselves if left to our own devices. I’m guessing this will be good for Netflix’s bottom line, but at what point do we go from being happy subscribers, to mindless puppets?

    Again, the assumption is that we have no say in the matter. But when a company or service suggests that we buy or watch something, we don’t have to follow.

    Netflix in particular thrives by providing a service that shows us what they think we might want to watch from a selection of thousands of options. Part of that algorithm depends on our own movie ratings and preferences. If Netflix offers poor suggestions, you can leave the service. Yeah. You can stop paying 8 bucks a month.

    Let’s turn it around. What if Netflix analyzed viewing data not to offer their best viewing suggestions or to make shows and movies that people like but to expand people’s viewing windows? Let’s say that the data shows that you watch a lot of “witty, critically acclaimed comedies”, so Netflix suggests you watch more “romantic dramas” to make you more well-rounded. Are you a mindless puppet if you take the suggestion, even if you end up hating the movie? Are you a mindless puppet if you ignore the suggestion and continue watching what you know you like?

    From the production perspective, it makes sense to try to make something a lot of people like. From the consumer perspective, we still get to decide what we want to spend our money on.

    It’s good to be concerned about how companies use personal data. Data privacy, ownership, and ethics are important issues, but it shouldn’t mean a fear of all things data.