Jeffrey Heer et al. writes in Design Considerations for Collaborative Visual Analytics about a couple of models for social visualization -- information visualization reference model and the sensemaking model. The former is a simpler, more straightforward model starting with raw data -> processed data -> visual structures -> actual visualization; while the latter is a bit more complicated with similar stages but with feedback loops. My main reflections weren't so much with the ideas proposed by the paper. Rather, I'm more interested in what was not mentioned -- not only in this paper but in other social data analysis papers.
I love to look at how the current week's movies are doing at the box office. I'm not really sure what it is. I think it's kind of like a gauge for what good movies are out; or maybe I'm just constantly amazed by the millions of dollars that movies make; or I think it could be my addiction to numbers?
Something that always strikes me as interesting is how movies are always breaking records at the box office. So and so movie just broke the record for most money made over a single weekend or a month or a long holiday weekend or for a Thursday when there was at least 2 inches of rain and a dog skateboarded two miles.
I took a look at the 25 highest grossing American films, adjusted for inflation. I'm so tired of hearing statistics for money comparisons over time that don't adjust for inflation. Wow, gasoline prices are at an all time high. Well guess what -- so are milk, bread, burgers, televisions, light bulbs, paper, cars, and everything else on the planet. Sorry, slight tangent.
Download the Wallpaper
As an early birthday gift to you, here are my results in wallpaper form:
The movie titles are color coded for genre and the higher grossing films are in a larger font. Drama and action/adventure clearly dominate -- The hills are alive. Luke, I am your father. Phone home. I'll never go hungry again.
Surprisingly (at least to me), only 7 of the top 25 films won the Oscar for best picture and of the top 50, only 9 won best picture.
With the start of a new year, it only seems right to open with John Tukey and his work with interactive graphics. In 1972, when computers were giant and screens were green, John Tukey came up with PRIM-9, the first program to use interactive dynamic graphics to explore multivariate data. PRIM-9 allowed picturing, rotation, isolation, and masking. In other words, PRIM-9 allowed users to see multivariate data from different angles and identify structures in a dataset that might otherwise have gone undiscovered (kind of like the more recent GGobi).
To fully appreciate the revolutionary nature of PRIM-9 one has to view it against the backdrop of its time. When Statistics was widely taken to be synonymous with inference and hypotheses testing, PRIM-9 was a purely descriptive instrument designed for data exploration. When statistics research meant research in statistical theory, employing the tools of mathematics, the research content of PRIM-9 was in the area of computer-human interfaces, drawing on tools from computer science. When the product of statistical research was theorems published in journals, PRIM-9 was a program documented in a movie.
John W. Tukey's Work on Interactive Graphics. The Annals of Statistics, Vol. 30 No. 6. 2002.
Luckily, you can appreciate Tukey's work here at the ASA video library. It's even more amazing when you consider where computers and technology were at back then. Who knows where Statistics would be if it weren't for Tukey and his brilliance and creativity. I can't imagine, or maybe I just don't want to.
Tukey was someone who truly understood data -- structure, patterns, and what to look for -- and because of that, he was able to create something amazing.
I stumbled across this article about Aili Malm, a GIS specialist (I think) who uses social network analysis to find the most probably locations of organized crime.
"I look at where organized crime groups are located and I study how these groups are linked to one another," she explained. "I can chart their cell phone use or e-mail communication or with whom they co-offend. Based on these connections, I try to isolate the important players. Then I take the social and make it spatial. I look at individuals important to the criminal network and map where they live and where they commit their crimes."
It's just like that show Numb3rs on CBS. Albeit, math and statistics is a bit glorified on the show, but hey, at least it's loosely based on reality.
Baseball (or all sports for that matter) statistics are all over the place. You can easily find data for pretty much whatever sport and for whichever player you want at any given time. The problem is that if you want to download all of the data at once, you usually have to write a script and do some parsing. Who wants to do that? I don't.
For our humanflows visualization, we used data from the United Nations Common Database and the Migration Information Source. The great thing about these types of sources is that they are publicly available so that everyone gets to have fun with the data. The downside is that the data is accessible via a user interface that often makes it a chore to get all of the data.
Hence, to save you some time, you can now download the migration database that we used. I don't see any reason why you have to go through the whole data importing process when we already did it. Enjoy!
Disclaimer: Keep in mind that the data is from the United Nations and Migration Information Source, so you should refer to the two sites for any documentation. In a nutshell, the inflows table is from MIS and the rest is from United Nations. If you're looking for more, you might also want to check out OECD. I really wanted to use their data at the time, but was having trouble accessing it from Spain.
Studies on names and performance seem to be all the rage right now:
We like our names. And that preference can have negative repercussions, according to research published last month. Major leaguers with "K" initials tend to strike out more, perhaps reflecting the batters' unconscious pull to appear next to the strikeout symbol "K" on scorecards. Students with initials C and D have worse grades than the A's and B's and everyone else, gravitating toward the grades their initials represent.
Of course, I'm a little skeptical about all of these studies, and with tiny effects like 0.02, these studies probably deserve it. In any case, they're still interesting to read about. I wonder how one could get his hands on such data. The data's probably just an email away, but in my current half-asleep stooper, I'll leave that for another time. I'm sure it'd be really interesting to play with though.
Have you read Freakonomics? If you have, all of these name studies remind me of that chapter about the two brothers named Winner and Loser. If you haven't read the book, uh, there's a chapter on two brothers named Winner and Loser.
Most are familiar with the Netflix Prize. If you're not, Netflix has offered a one million dollar prize to whoever improves their movie recommendation by a certain amount. It's been going on for a little over a year with still no grand prize winner. The dataset is 100 million ratings.
The above is a visualization of the Netflix dataset. Each dot represents a movie, and the closer two dots are the more similar the two corresponding movies are based on Netflix ratings. I'm guessing the orientation of the dots was decided by some variant of multidimensional scaling.
It's kind of fun to scroll over the clusters. Like in the bottom right we see Babylon 5, Buffy the Vampire Slayer, Alias, and Battlestar Galactica clumped together. The giant blob in the middle, however, is pretty useless; it'd probably benefit from some zoom functionality.
The Need to Explore
I'm kind of surprised that I haven't seen more Netflix visualizations like this (or ones better than this), because I'm pretty sure it would help see some relationships that typical analysis won't provide. I was browsing the forum and saw someone ask if others had had success loading the 100 million observation dataset into R. Silly undergrad.
A computer scientist, designer, and statistician walk into a bar; they discuss how they would boost the Netflix recommendation system. The punchline is that they win a million dollars, but I'm not sure what happens in between.
The New York Times recently put up a cool data exploration tool to sift through the transcript of the most recent Republican debate. They call it the transcript analyzer. There are three key features:
- View where candidates put in their two cents indicated by the blue, highlighted rectangles
- Read the actual chunks of transcript for each block
- Search the transcript to see when specific words and phrases were used indicated by the smaller gray highlighted rectangles
My particular favorite is the search feature because it really allows readers to dig into the transcript or a reader can find out which candidate is (or isn't) talking about his or her point of interest and when in the debate the topic was discussed. The intuitive text scrolling is pretty awesome too. Good job, New York Times!
[via Jon Udell]
I'm staying in a hostel here in Madrid and am currently in the "Internet Room." I'm on my laptop, but there are six desktop computers in front of me, all of which are occupied. Three of the six people have Facebook open plus myself. It's come to the point that Facebook has so many ways to share information, that almost everyone can find some use for it. Is there some way to share data in some similar social way?
I know there's some data blogging available and a few social data sites, but they don't have the same feel as Facebook. I think the main reason people like Facebook (other than an entertaining way to waste a few hours) is because they personally relate to the information displayed and there's some kind of connection between friends and strangers.
We all know fast food is incredibly bad for us and yet we still eat it. Why? Because it has tons of fat and tastes delicious. Nevermind that we will die a few days earlier for every French fry we eat.
Over at Calorie Counter, they try to make us feel guilty with numbers. Check out the Carl's Jr. Double Six Double Dollar Burger with 1,520 Calories and a delicious 111 grams of fat. I'm a little surprised that it beat out the Burger King Triple Whopper with cheese. I shudder just thinking about eating one of those.
Anyways, there's a whole lot of numbers here but not an incredible amount of meaning. How bad is bad? How much fat should I consume per day? Is 111 grams of fat bad? If yes, how will it directly affect me? Yes, 111 grams of fat is bad for you. You will directly feel the effects as you sit on the toilet in the morning wondering why it is taking you so long to take a dump. Now that's context.
Also, with all the numbers, I bet all the tables would benefit from some kind of chart or, at the least, a simple infographic. Any takers? We should have a contest for who can make fast food the least appealing using nutritional data and without bending the truth.
When I talk about data, people often zone out or don't really see the interest. Why does this happen? People just don't understand the wonder that is data and how much of their life is led by data. With that in mind, why would people share their data? You can't share something you don't know exists. Off the top of my head, here's 100 reasons to be interested in, want to share, and get excited about data.
On Last.fm, someone took snapshots of some Linkin Park songs, compared them, and concluded that all Linkin Park songs
lookare the same. I guess at a glance, the songs might appear the same because of the dark chunk towards middle left, but it kind of stops there. Sure, there's some loud to soft and soft to loud alternation, but who likes songs who are loud (or soft) throughout?
The beginning of the post:
Each image above shows the audio level in (roughly) the first 90 seconds of a Linkin Park song. The tempo has been adjusted for a few tracks for better visual alignment.
Wait a minute. The tempo was adjusted for better visual alignment? If you're adjusting the tempo, then really, all songs can be made to look the same. On top of that, we don't know the x-axis or y-axis units. Finally, there's a lot more to a song other than dynamics -- such as key, tempo, rhythm, and lyrics.
I just found this in my draft folder from a while back. It's kind of old news, but I think it's still worth mentioning.
Gun control advocates failed to gain local government and law enforcement agencies' access to gun sales data.
The House Appropriations Committee defeated two attempts by gun control advocates to strip four-year-old restrictions on the use of information from Bureau of Alcohol, Tobacco, Firearms and Explosives tracing gun sales. The votes were a victory for the National Rifle Association and came despite the Democratic takeover of Congress in January.
One side argues that gun sales data will help law enforcement agencies track gun dealers who sell guns illegally. The other side argues that there's privacy at stake, and there's a chance that police officers' identities could be inferred. A big victory for gun rights advocates, or so the the article might suggest.
My opinion -- even if gun sales data were given to law enforcement, how could anyone guarantee data integrity? I think it's fair to say that dealers selling guns illegally aren't going to provide accurate reports. Sell a gun under the table with cash, don't report it, and the data doesn't reveal much. Am I missing something here?
Technology Innovations in Statistics Education (TISE) is a new e-journal that was just announced yesterday. The use of technology (e.g. data visualization) has become extremely important in teaching statistical concepts to newbies, and so this new journal will be really useful; computers have allowed students to explore and experiment in ways students couldn't do with just paper and pencil. TISE explores these alternatives.
Technology Innovations in Statistics Education (TISE) publishes scholarhip on the intersection between technology and statistics education. The current issue includes papers by George Cobb (who challenges the introductory statistics curriculum to radically innovate to adapt to new technology), Beth Chance et. al, (who provide an overview of the use of technology to improve student learning), Wlliam Finzer, et.al, (who describe software innovations for improving student access to data), Dani Ben-Zvi, (preliminary research results on using Wiki in statistics teaching), Daniel Kaplan (on the role of computation in introductory statistics), and Andee Rubin (an historical overview of technology in statistics education.)
These papers can be read at http://tise.stat.ucla.edu. Please click on the "subscribe button" to join the mailing list to be informed of future released.
TISE is seeking scholarly papers for Volume 2 that address any of these themes:
- Designing technology to improve statistics education
- Using technology to develop conceptual understanding
- Teaching the use of technology to gain insight into and access to data
The first issue is already online. Take a look. I've had the opportunity to work with some of the knowledgeable and active members of the editorial board, so TISE looks to be very promising.
Raw, fine-grain data is still a bit hard to come by. Summary statistics (i.e. data that came from some analysis), on the other hand, are often easy to find. A lot of the time the data is already online or just a simple phone call away.
The National Center for Education Statistics, a part of the U.S. Department of Education, offers a bunch of data including, but not at all limited to, poverty and math achievement, average science scores overall and by grade level, and quantitative literacy.
I've never really been interested in baseball. I've always been more of a basketball and football fan. However, my summer roommate was a die hard baseball fan, and I'm convinced that he brainwashed me into rooting for the New York Mets. Just a couple of weeks ago, someone told me he was a Phillies fan, and I let out a blech of disgust without even thinking about it.
So with the Mets' most recent loss, I'm a bit disgruntled, and I'm sure my old roommate is pissed as can be. The Mets are no longer leading the Phillies for the number one spot in the NL east.
What better way to see how poorly the Mets are playing than with a graphic? I decided to compare this year's Met season with the 1986 Met World Series winning season, because that should probably be what they're shooting for. As my roommate would angrily exclaim, "If they can't get their #%&$ act together, they don't serve to go to the playoffs!"
I saw this map of the average snow levels in Buffalo. I think I just glanced at it and that was about it. When you first look at the map, what do you make of the colors? When I see green for snow levels, I think no snow. Am I crazy? What do you think?
So the image was kind of in my head all this summer while I was in NYC. When I told people that I was going back to Buffalo after my internship, they always gave this look that said, "Ha, have fun during the winter," and then they would actually say it and then go into how they measure the snow level by comparing it against a giant pole.
When I tell people that I'm a graduate student in Statistics, there are two responses that I get more than any others. The most popular of the two usually goes something like this.
Oh man, I hated statistics in college. The professor totally sucked and I never knew what was going on. All I remember is mean and some... curve thing? I don't know. What's standard deviation anyways?
I threw that standard deviation bit in for effect. No one actually asks about it, and I'm pretty sure most people don't even remember ever hearing about it. It's that whole selective memory thing -- blocking out the bad and remembering only the happies.
So anyways, every time someone tells me they absolutely hated statistics in college, I die a little inside and start bawling like a two-year-old whose lost her bottle. No, no, I'm kidding, but the first thing I think is, "Gee, thanks for letting me know that! Like I really wanted to know that you hate what I study. You know what? I think I hate you a little bit now." I'm exaggerating a tad, but it's slightly frustrating after hearing it so many times.
But why do so many people hate statistics?
In a previous life, I thought anything published in an academic journal was legit, but as a stat student, the story is quite the opposite. Whenever I hear results or see data from some study, I become an instant skeptic.
Were there really that many deaths from 1998 to 2007? Did housing prices really increase that much over the past decade? Do that many people really support that presidential candidate?