With more Many Eyes fun, Aron Pilhofer put in part 2 of his original post. I was pleased to see the first post get 56 comments, but I think part 2 might have gotten lost due to the high post frequency, with the U.S. Open fully on. Still worth a look though.
On their new exploration section, Twitter blocks is available for viewing and use. The viz is in Flash and is supposed to allow you to explore your neighbors as well as your neighbors' neighbors. I think the higher up the blocks are, the more recent. It's kind of hard to say. Other than that, I'm actually not really sure what I'm looking at. I thought it might be because I'm not following that many people, but I viewed the blocks for the public timeline and still had trouble deciphering. Maybe others will have better luck.
Update: Michal posted on the feedback they've been getting on Twitter Blocks that's certainly worth reading:
So we get this a lot: "Beautiful! But useless!". We've heard it in response to most projects we've done over the past few years (one exception has been Oakland Crimespotting, whose stock yokel response is: "no way am I moving to Oakland!").
This kinda surprises me. I think their other projects are pretty useful and informative.
I'm not even going to pretend I know anything about how Statistics and vision go together. That's not to say that they don't go together, because they do. Otherwise there wouldn't be a whole center at UCLA, the Center for Image and Vision Science, a group of statisticians, computer scientists, and psychologists. Lots of modeling involved, lots of data, and lots of applications from security to medical imaging to assisting the visually impaired.
With that being said, I came across Face of the Future, which was setup by a computer science group at the University of St. Andrews. They have a face transformer, averager, morpher, and detection. You can upload your own images for the transformer and averager. (The averager wasn't working when I tried it.) The transformer will do some image processing on your face, and from there you can see what you might look like as a baby, teenager, old adult, and different races. Fun stuff. I would show all the pictures from my little experiment, but they're kind of creepy.
On a somewhat related note: have you ever wondered what you look like as a Simpsons character? Well now you can see for yourself. Burger King and The Simpsons have joined forces to provide you with the Simpsonizer. Undoubtedly, there's some image processing and statistics flowing around in that black box. My Simpsons character actually looks quite a bit like me.
I don't want my credit card numbers floating around, because then I'd be screwed. That kind of data needs to be locked up tight behind a billion firewalls, a lock safe, five armed guards, and another locked safe and then one more guard plus another safe. However, there are lots of other kinds of data that should be online and publicly available or at least accessible via a phone call.
The well-known college rankings are now available for your viewing pleasure. Whether the ranking system is legit or not, I'll let you be the judge, but I think everyone should take note that UC Berkeley was again the number one ranked public national university and UCLA was ranked number three. Go Calee-forn-ee-ah! In a nutshell, here's what U.S. News ranks the universities:
- Peer Assessment - 25%
- Retention - 20% in national universities and liberal arts colleges and 25% in master's and baccalaureate colleges
- Faculty Resources - 20%
- Student Selectivity - 15%
- Financial Resources - 10%
- Graduate Rate Performance - 5%; only in national universities and liberal arts colleges
- Alumni giving rate 5%
I wonder how much bias is in peer assessment.
I began my path of higher education at Berkeley as an Electrical Engineering and Computer Science student. As a stat graduate student, it's hard to remember sitting in all of those (boring) engineering classes.
If I learned anything though, it was from the painful computer science projects. No matter how big the project, I would start by breaking it up into lots of mini-tasks and work my way up to the final solution. I think this has helped me a lot not only in grad school, but solving problems in my life. Hence, my first attempt at continuous data collection has started at a very basic level -- my pedometer.
While doing research on the process of rebuilding New Orleans after Hurricane Katrina and the U.S. Army Corps of Engineers, I've run across a frequent
criticclose and knowledgeable watcher of the New Orleans rebuild: Robert Bea. I don't know much about him except that he seems like a very nice man. I found this on his Berkeley homepage:
The world needs engineers who....
- whose truth cannot be bought,
- whose word is their bond,
- who put character and honesty above wealth,
- who do not hesitate to take chances,
- who will not lose their identity in a crowd,
- who will be as honest in small things as in great things,
- who will make no compromise with wrong,
- whose ambitions are not confined to their own selfish desires,
- who will not say they do it "because everybody else does it,"
- who are true to their friends through good report and evil report, in adversity as well as in prosperity,
- who do not believe that shrewdness and cunning are the best qualities for winning success,
- who are not ashamed to stand for the truth when it is unpopular, and Â· who have integrity and wisdom in addition to knowledge.
Please help me to be this kind of engineer.
This can certainly be applied to statisticians as well. Please help me be that kind of statistician.
UPDATE: Just did some back and forth email with Professor Bea. He IS a nice man.
UPDATE: I found the essay! Programmers Need To Learn Statistics Or I Will Kill Them All by Mr. Zed Shaw
There was this online essay that I read by a guy in the computer science/electrical engineering field who totally loves statistics. He read text books, and truly spoke like someone who respects data. I thought I bookmarked it, but now have no clue where the heck it is. Argh :(. If anyone knows who I'm talking about, please tell me!
He worked with a company where everyone thought they "knew" statistics. Automated reports would give them numbers, and they'd fully trust them. That was statistics to the computer engineers. Crunch some numbers and see what the software gives me. As a result, these engineer-types really pissed off the author of the article. Continue Reading
In her TED talk, Emily Oster challenges our conception of AIDS and suggests other covariates that we need to look at (e.g. export volumes of coffee). Until we get out of the mindset that poverty and health care are the only causes/predictors of AIDS, we won't be able to find the best way to fight the disease. Another great use of data.
I do have one small itch to scratch though. Emily had a line plot that shows export volumes and another line, on the same grid, of HIV infections, both over time. It reminds me of the plots that Al Gore uses with carbon dioxide levels and temperature. Anyways, using the plot, Emily suggests a very tight relationship between export volumes and HIV infections. Isn't export volume pretty tightly knit to poverty? I don't know. She's the economist, so she would know (A LOT) better than me. I guess I just wish she talked a little bit about the new and different data she has that compels us to change our conceptions.
As Jon Udell has mentioned, there's a ton of data online, but it's not often we can find it, often hidden in the deep, dark basement of some website. He has proposed that people book mark public datasets on del.icio.us under the tag "publicdata". I think this is a great idea. In turn, you can subscribe to the feed with the url http://del.icio.us/tag/publicdata.
I've been doing this already for a while, but I had been just tagging with "data". So I'm going to join in on the party and start tagging with publicdata, and I hope others will too. Until sites like Many Eyes and Swivel get more wind beneath their wings, I think it's necessary.
First off, in my initial pass of my parsing script, I accidentally cut the month range short, so I didn't get any data for December from 1980 to 2005. It should be noted that these plots don't show this missing data. Um, there's no axes or labels either. Sorry, I got a little lazy, but that's not the point now anyways.
It's easy to see how Statistics got this bad wrap because it's so easy to lie with data, charts, and graphs. Sometimes it's on purpose -- someone might try to present "good" results that actually suck. Sometimes it's accidental -- someone might have misread or didn't read the documentation that came with the data. In the case of Swivel's most recently featured graph, it was the latter. A case of mistaken identity so to speak.
The data about doping tests in sports came from here. Now the graph on Swivel would have you believe that the data represent the number of doping cases found in each sports; however, according to the USADA report, the data is actually the number of tests the association conducted inside and outside competition during the first quarter of this year. The report contains no data on the USADA's findings.
What We Learn
What can we learn from this? It's great to visualize data, but you have to be careful. Read the documentation. Find out what the data is about, because without context, the visualization or any findings are practically useless. Statistics isn't to lie. In fact, it's the exact opposite. Statistics came about and exists today to reveal the truth.
There was a Sharp Rise Seen in Applications for Citizenship, as reported in The Times today, and of course there was a graphic to complement that article that showed the rise in applications over the years as well as a by-country breakdown for 2006.
Graphics in The Times always site the source, which was Department of Homeland Security in this case. I thought, "Do they have some kind of source who they actually call to get this data?" Thinking such a thing, I feel pretty dumb now. In fact, I always see that source on all of the graphics, and have just assumed that there was some connection between The Times and the source.
So lazy me finally decided to look into things, and you know what, the Department of Homeland Security has a whole section on their website for Immigration Statistics. There are freely available spreadsheets, reports, publications, and even a little something on data standards and definitions, prepared by none other than the -- Office of Immigration Statistics. Very pleased.
It's kind of sad that this is just now news to me, but better now than never, eh?
Swarm Theory, by Peter Miller, talks about how some animals, as individuals, aren't smart, but as a group or a swarm, they can do amazing things. The above is a flock of starlings that can change shapes even though no single bird is the leader.
Can we apply swarm theory to social data analysis? As individuals, we might not be able to hold onto or understand a dataset, but as a group, we can come at a dataset from different perspectives, look at very small parts, and then as an end result -- extract real, worthwhile meaning.
That's how swarm intelligence works: simple creatures following simple rules, each one acting on local information. No ant sees the big picture. No ant tells any other ant what to do. Some ant species may go about this with more sophistication than others. (Temnothorax albipennis, for example, can rate the quality of a potential nest site using multiple criteria.) But the bottom line, says Iain Couzin, a biologist at Oxford and Princeton Universities, is that no leadership is required. "Even complex behavior may be coordinated by relatively simple interactions," he says.
It reminds me of that common saying, or maybe it's a quote, about how if you put a bunch of monkeys in a room with typewriters, you'll eventually get the works of Shakespeare via the magic of probability. While the whole monkey thing is a bit far-fetched, swarm theory is certainly worth my attention.
I was flipping through the channels the other night and came across a televised CitiStat meeting for June 1. A bit of a coincidence since I happened to be looking at the CitiStat website earlier that day. What's CitiStat, you ask? Well it's like a spin-off of CompStat, a program in NYC and LA, that makes police officials accountable for their actions by looking at data -- number of homicides, where they happened, what's being done, etc. CitiStat, in Buffalo, is the same thing, but for the Police, Fire Department, and whatever else they can think of, and seemingly not quite as reputable.
Anyways, they were talking to some city official about fire department employees that were IOD, um, that's injured on duty (but I must've heard IOD like a billion times). There was some discrepancy on the definition of IOD. As a result, the data was worthless. The police commissioner spoke as well with his own IOD numbers. After that, there was a lot of arguing and as a result, a meeting was agreed upon. Well, not really. They agreed that they would schedule some meeting, but it's been a year of "What is an IOD?" Pretty sure that won't be settled for a while.
They were also able to agree that the number of IODs was somewhere between 50 and 200. Yay.
So despite the fact that the CitiStat program is two years old, there's still lots to be done. Officials aren't used to recording and looking at data, and it's clear, few even had any notion that data could be useful. However, I am glad that they're making the effort -- even if all of the data is stored on a bunch of inconsistent Excel spreadsheets :P.