• toolbox-thumb

    Putting Analysis Online With StatCrunch and Covariable [Review]

    Are online statistical tools sufficient to analyze our complex datasets?
  • Grandma, Thank You For Giving Us Something to Smile About

    February 18, 2008  |  Announcements

    My grandma, Jane Yau, passed away a couple of weeks ago, and I attended her funeral this past weekend. It was tough at first seeing her laying there lifeless, because the last time I saw her was about 8 months ago, healthy and smiling. I had to walk away with eyes full of tears. I wondered how in the world I was going to deliver her eulogy.

    I went up again though and just looked at her for a long time. She was peaceful, almost like she was sleeping, and I felt this calm cover over me. My heart beat slowed and the sadness left. That was the effect my grandma always had on me.

    I'll miss you, grandma. I hope I can make you proud.

  • How to Read (and Use) a Box-and-Whisker Plot

    February 15, 2008  |  Statistical Visualization

    Box-and-Whisker Plot LessonThe box-and-whisker plot is an exploratory graphic, created by John W. Tukey, used to show the distribution of a dataset (at a glance). Think of the type of data you might use a histogram with, and the box-and-whisker (or box plot, for short) could probably be useful.

    The box plot, although very useful, seems to get lost in areas outside of Statistics, but I'm not sure why. It could be that people don't know about it or maybe are clueless on how to interpret it. In any case, here's how you read a box plot.

    Reading a Box-and-Whisker Plot

    Box-and-Whisker Plot ExplainedLet's say we ask 2,852 people (and they miraculously all respond) how many hamburgers they've consumed in the past week. We'll sort those responses from least to greatest and then graph them with our box-and-whisker.

    Take the top 50% of the group (1,426) who ate more hamburgers; they are represented by everything above the median (the white line). Those in the top 25% of hamburger eating (713) are shown by the top "whisker" and dots. Dots represent those who ate a lot more than normal or a lot less than normal (outliers). If more than one outlier ate the same number of hamburgers, dots are placed side by side.

    Find Skews in the Data

    The box-and-whisker of course shows you more than just four split groups. You can also see which way the data sways. For example, if there are more people who eat a lot of burgers than eat a few, the median is going to be higher or the top whisker could be longer than the bottom one. Basically, it gives you a good overview of the data's distribution.

    That's all there is to it, so the next time you're thinking of making a bar graph or a histogram, think about using Tukey's beloved box-and-whisker plot too.

    Want to learn more about making data graphics? Become a member.

  • Mapping Manhattan’s Skyscraper Districts Through Time

    February 14, 2008  |  Mapping

    Manhattan Timeformations looks like a series of interactive schematics from a video game, but really it's a computer model that allows you to look at the relationships between the developments of the lower Manhattan skyline and other urban factors like farms, urban renewal, subways, and commercial zones. The visualization provides different views in the form of the traditional 2-dimensional map views as well as rotations, fly-throughs, and layers.

    It's nice to step out of that Google mashup look every once in a while.

  • Spamology From Visualizar is Available for Exploration

    February 13, 2008  |  Data Art

    Spamology, by Irad Lee, was one of favorite projects at the Visualizar Workshop, and it's now available online for others to play with. I talked about Spamology a little bit when the showcase was officially opened in Madrid, but the piece wasn't online yet.
    Continue Reading

  • Headed to Computational Journalism at Georgia Tech

    February 12, 2008  |  Announcements

    Computation+JournalismI'm headed to Journalism 3G: The Future of Technology in the Field February 22-23.

    The spreadsheet, word processor, web browser, digital audio and video, blogs–each an example of the vaunted killer software application–have all become valuable, some would say essential, tools of journalism. Now Web 2.0 has forever altered the nature of software innovation, while at the very same time the news industry undergoes historic change. Those two points taken together mean one thing: Time lags which used to buffer innovations in computation from their inevitable impacts on newsrooms are poised to disappear. Who’s ready for this? We plan to see.

    Some of the participating organizations include Digg, The New York Times, and Reuters with some really interesting-looking panels over the two days:

    • Advances in News Gathering
    • Improving Journalism Workflow: Automation & Productivity
    • Social Computing and Journalism
    • Ubiquitous Journalism
    • Participant Journalism & Journalism Participation: Authoring and Interacting in New Media
    • Sensemaking & Visualization
    • Information Mashups: Aggregation, Syndication, and Web Services
    • 21st Century Editor in Chief

    Naturally, I'm most excited about Sensemaking & Visualization. Is anyone else planning on going?

  • A Lesson in Recycling Chartjunk as Junk Art

    February 12, 2008  |  Miscellaneous

    What is Data and Why Should We Care About It?This guest post is by Kaiser Fung, from Junk Charts and Data Matter. He answers my question - "What is data and why should we care about it?"

    Who's got more data? The largest retailer in the world or the largest library in the world?

    Walmart tends to over 500 terabytes of data (see here, here, etc.) while the Library of Congress, largest according to the Guinness Book of World Records, has a petty 20 terabytes, cowered by comparison.

    To hear it from data warehouse vendors, data mining academics, data savvy politicians, or data fixated citizens, Walmart versus the LOC is like New World versus Old World, the future versus the past, fast versus slow, wired versus tired.

    The more things change, the more they stay the same. The flood of data has not washed away these two age-old truisms.
    Continue Reading

  • Understanding Data, Not Just the Realm of Scientists in Ivory Towers

    February 11, 2008  |  Miscellaneous

    What is Data and Why Should We Care About It?This guest post is by Hadley Wickham, a Statistics PhD candidate and a part of the GGobi team. He answers my question -- "What is data and why should we care about it?"

    For me, most data comes in the form of a data frame: a rectangular set of values with observations in rows and variables in columns. Most values are continuous (e.g. real numbers) or categorical (e.g. colours, treatments, subject ids), but are sometimes more esoteric (images, sounds, intervals). Each variable contains values of only one type and may also contain missing values. Missing values are particularly important for statisticians, and are often encoded as . or NA (encoding them as special numeric values, like 99, is generally a bad idea). Most data is "messy" and cleaning it up requires you to ensure that observations are in rows and variables in columns, as well as spending plenty of time to make sure that the values actually make sense (visualisation is really useful for this!).

    Data Helps Illuminate Patterns

    To me, caring about the message in data is the essence of science, where we perform some action on the world and record its response in our data. This isn't just the realm of scientists in ivory towers, but something that we do everyday, whether it's trying to understand the impact of a new marketing campaign, figuring out which house to buy or exploring why a new cancer drug isn't working. Recording and examining the data that matters not only supports rational decision making, but also reveals the unexpected and helps illuminate underlying patterns.

  • Comparing Roger Clemens to Hall of Fame Pitchers

    February 11, 2008  |  Statistics

    Andrew had some comments about the graphs on Freakonomics that showed a seemingly odd "change of fortune" for Roger Clemens.

    Roger Clemens - NYT

    You can see that Clemens almost followed an opposite pattern from all other pitchers in the league. As Andrew notes though, there seems to be a lot riding on the quadratic fit and average values when we know that Clemens has been anything but ordinary throughout his long career.

    Graphing Without Smoothing

    For fun, I tried graphing the ERA data for Clemens against the ERAs for the 16 most recent hall of fame pitchers (that I could get data for). My thinking was the hall-of-famer performances might be a better indicator of what should be "normal" for great pitchers. The results are a little less compelling. However, one thing to note is that most players who played past age 40 saw an increase in ERA while Clemens had a pretty significant improvement in ERA from age 40 to 43.

    Whether this is due to performance enhancing drugs or just a change in pitching strategy, coaching, or some other factor, I can't say. There's probably only a few people who can know for sure.

    Anyways, if anyone has a different take on the data, I'd love to hear it in the comments.

  • Weekend Minis – Design Paradigms, Colbert Bump, and Bullet Graphs

    February 9, 2008  |  Visualization

    Weekend Treats

    There Is No Single View... - Jock D. Mackinlay and Chris Stolte argue that there is no "holy grail" of data visualization, and that to truly understand our data, we need multiple graphical views.

    Seek or Show: Two Design Paradigms for Lots of Data - Ask a user what he wants or show him everything up front.

    The Colbert Bump is Real, Colbert’s Nation Not What He Thinks it is - An analysis to show the true effect on books sales after an appearance on The Colbert Report.

    Bullet Graphs for Not-to-Exceed Targets - A graphical widget becoming more popular in dashboards.

  • Showing Historical & Cultural Connections and Mapping Influence

    February 8, 2008  |  Miscellaneous

    What is Data and Why Should We Care About It?This guest post is by Mike Love, and he answers my question -- "What is data and why should we care about it?"

    Instead of answering in the general case, I'd be better off trying to answer it for an area of my interest.

    Historical Connections

    I think cultural history can be presented as data, and that we could get some benefit out of standardizing some atomic properties of cultural history. There are a couple good efforts at doing this: Artandculture.com is an "interconnected guide to the arts," where you can see what movement artists and others belonged too. The Knowledge Web is a project of James Burke of the television show 'Connections'. They are working to encode tens of thousands of historical connections into a database. I have been working on a similar dataset at the open database project Freebase. Each of these projects have moved beyond text (and hypertext) and into the realm of data.

    One seemingly trivial advantage of data over text or text with hyperlinks: you can specify that making a connection between person A and person B implies a connection in the reverse direction. This cuts the workload in half: Wikipedians entering relationships into an infobox in Wikipedia have to do twice the work of a person working in a database framework.

    Apply Relationships

    Influence Graph

    The more exciting advantage is the kind of applications that are possible once you have settled on a set of relationships. The team working on the Knowledge Web built a graph browser which embeds historical figures in their century and draws lines between these figures. Mousing over a line brings up some descriptive information about the relationship. A team at Metaweb built a graph browser which pulls up pictures of historical figures and lays out their influences and influencees in a circle surrounding them. You can imagine filtering in other ways: show all the connections between artists and writers; show all the cross-cultural connections between China and Europe. You could plug historical data into a recommendation system as well.

    There is nothing new about documenting cultural connections. There are many better, probably more reliable books that serve this purpose. (For Western history, I recommend Richard Tarnas' The Passion of the Western Mind, and Peter Watson's The Modern Mind.) But to design a dynamic interface to these books would require parsing the English language. Maybe we can do this too.

  • Increasing Data Literacy Across the General Public With Truth and Beauty

    February 7, 2008  |  Miscellaneous

    What is Data and Why Should We Care About It?Matthew Hurst, from Microsoft Live Labs and the co-creator of BlogPulse, answers my question - "What is data and why should we care about it?"

    In writing this brief article, I tried to answer the following: what attracts me to data?

    An Abundant Resource

    Data is everywhere - from the streams of posts in the blogosphere to stock trading graphics spilling from news media to science projects in kindergarten. It permeates our modern world, and yet few of us are equipped to interpret it critically. More importantly, few of us are protected against the misuse and manipulation of the truth via data. Users of databases (who include the millions of users of search engines) are slowly but surely becoming exposed to more sophisticated views of data and thus the average data literacy will, hopefully increase.

    Working in the field of data mining is very exciting at this time as it has the potential to truly impact the perception and understanding of the world-as-data. Sites like Swivel and Many Eyes are in some sense at the cutting edge of this progression, with major public databases (like search engines) nervously following their lead.

    A fundamental challenge in empowering users with data is the legacy of impoverished tools. Currently, one is required to make many low level interactions in order to synthesize a result required for a task. Consequently, the tools and infrastructure around data interactions have moved towards high volume, immediate response paradigms. However, the added value, increased accuracy and relevance of more sophisticated processes, and the additional investment required on the part of the user to learn how to consume and manipulate enhanced data displays comes with a cost. To make the jump, users will have to be convinced of the value of enhanced interactions and displays, spend a little more time working with the data and so on.

    A Vector of Truth

    Data, if collected and analysed correctly, can support or refute our intuitions and beliefs. In addition, the anlaysis of data can hint at some very human structures such as those found in language and in the ways in which we conceptualize the world. Data may be used to help us understand our environment. By working with data, we can grasp better models of ourselves and our world.

    Beauty in Exploration

    Visualization is an essential tool for understanding data and drawing inferences from it. The last ten years of advances in computer performance and graphical displays have opened up the possibilities for displaying data in rich and dynamic ways. This has lead practitioners down a dangerous path balanced between aesthetics - the visual impact and design of data display, and utility - the capability of a visualization to intuitively and efficiently assist the user. That being said, the aesthetics of data visualization can play a huge part in attracting users to the topic being visually described, to encourage them to ask 'it's pretty, but what is it?' Hopefully the answers to that question will lead to better understanding on all fronts.

  • Speed Dating Data – Attractiveness, Sincerity, Intelligence, Hobbies

    February 6, 2008  |  Data Sources

    In their paper Gender Differences in Mate Selection: Evidence from a Speed Dating Experiment, Fisman et al. had a bit of fun with a speed dating dataset. Here's what they found:

    Women put greater weight on the intelligence and the race of partner, while men respond more to physical attractiveness. Moreover, men do not value women's intelligence or ambition when it exceeds their own. Also, we find that women exhibit a preference for men who grew up in affl­uent neighborhoods. Finally, male selectivity is invariant to group size, while female selectivity is strongly increasing in group size.

    The dataset is substantial with over 8,000 observations for answers to twenty something survey questions. With questions like How do you measure up? and What do you look for in the opposite sex?, this dataset is definitely high on human element and should be fun to play with.

    [via Statistical Modeling]

  • Data Makes Reasonable Decision-making Possible

    February 6, 2008  |  Miscellaneous

    What is data?This guest post is by Andrew Gelman from Statistical Modeling, Causal Inference, and Social Science. He answers the question - "What is data and why should we care about it?"

    Good data are better than bad data, but worst of all are data whose quality you can't assess. Beyond this, we want to use statistical methods that allow us to combine data from many sources. I'm comfortable with regression and multilevel models, but other methods are out there too. In any case, we have to care about our data because inferences and decisions are just about always data-based, implicitly if not explicitly. Being the person in the room with the hard data gives you authority, as well it should.

  • Tap Into the Wisdom of Crowds, Make Money by Predicting Future Events

    February 5, 2008  |  Social Data Analysis

    Predictify LogoPredictify takes James Surowiecki's The Wisdom of Crowds to heart. Surowiecki argues that when certain factors are present (for example, group diversity), then the group is always smarter than the individual. Predictify has turned this "principle" into a money-making platform.
    Continue Reading

  • May the Data Be With You, Young Skywalker

    February 4, 2008  |  Miscellaneous

    What is data?In response to my question, "What is data and why should we care about it?" - Zach Gemignani from Juice Analytics answered:

    Obi-Wan Kenobi could have been speaking about data in businesses when he said: "It's an energy field created by all living things. It surrounds us, and penetrates us. It binds the galaxy together."

    Data is the residue of every action and interaction that takes place in a company, with customers, and in the marketplace. Businesses have created complicated and effective nets to capture this data as it flies off in all directions. Unfortunately, mountains of data mean nothing. Like young Luke Skywalker's inability to control The Force, a company's inability to make use of data is nothing more than frustration and untapped potential.

    Making use of data takes a subtle combination of capabilities. It takes experience and context about the business, speed and skill to manipulate data, and an ability to visualize and communicate results. Data in the wrong hands is useless if not dangerous; in the right hands data can transform into new insights and informed decisions.

  • What is Data and Why Do We Care About it So Much?

    February 4, 2008  |  Miscellaneous

    What is Data and Why Should We Care About It?I've been fortunate to have worked with people from lots of different fields - statistics, ecology, computer science, engineering, design, etc. If I've learned anything, it's that everyone has a different idea of what data is and why it matters.

    I've found that until I've understood what my collaborators mean by data and what they (and me) are trying to get out of a dataset, it's near impossible to get anything useful done.

    To make things a bit more clear (and for my own enjoyment), I asked a select group of people a single question:

    What is data and why should we care about it?

    Those who responded are from different areas of expertise, ranging from statistics, to business, to computer science, to design. Some names you'll recognize while others will be new to you. All are doing interesting things with data.

    I've been looking forward to this series for a couple of weeks now, and my hope is that you will gain a better understanding about what data is and how people are putting it to use. Keep an eye out for posts with the black square image above.

    Here is who has answered so far:

    If you'd like to answer the question yourself, I'd love to see your response too, or if you write an answer on your own blog, please do post the link in the comments below.

  • Who’s Going to Win Super Bowl XLII?

    February 3, 2008  |  Statistics

    I just put down $20 on today's game for the New York Giants to cover the 12-point spread. Of course, knowing me, I got to thinking how that betting line is decided. Is there one person who calculates the spread? Do Las Vegas casinos just put up numbers based on past experiences? I did a little bit of research, and here's what I found.
    Continue Reading

  • Weekend Minis – Government, Environment & Angry Employee

    February 2, 2008  |  Data Sources

    FedStats - Provides access to the full range of official statistical information produced by the Federal Government, including population, eduction, crime, and health care.

    MAPLight - A detailed database that brings together information on campaign contributions and votes in the California legislature. Check out the video tour.

    EarthTrends - A collection of information regarding the environmental, social, and economic trends that shape our world.

    Angry Employee Deletes All of Company's Data - A woman about to "lose" her job goes to the office at night and deletes 7 years' worth of data. Can we say backup, please?

  • Bad Statistics Leads to Poor Results and a Questionable Trial Verdict

    February 1, 2008  |  Mistaken Data

    Peter Donnelly talks about the misuse of statistics in his TED talk a couple of years back. The first 2/3 of the talk is an introduction to probability and its role in genetics, which admittedly, didn't get much of my interest. The last third, however, gets a lot more interesting.

    Donnelly talks about a British woman who was wrongly convicted largely in part because of a misuse of statistics. A so-called expert cited how improbable it would be for two children to die of sudden infant death syndrome, but it turns out that "expert" was making incorrect assumptions about the data. This doesn't surprise me since it happens all the time.

    Lesson Learned

    People misuse statistics every day (intentionally and unintentionally), and oftentimes it doesn't hurt much (which doesn't make it any better), but in this case improper use directly affected someone's life in a very big way. One of the most common assumptions I see is that every observation is independent, which often is not the case. As a simple example, if it's raining today, does that change the probability that it will rain tomorrow? What it didn't rain today?

    In other words, the next time you're thinking of making up or tweaking data, don't; and the next time you need to analyze some data but aren't sure how, ask for some help. Statisticians are nice and oh so awesome.

    Here's Donnelly's talk:

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.