• Stata LogoFor those interested in or who already use Stata, the first Stata users group on the west coast is coming up on October 25-26. It’s $150 for both days and of course students get a good discount at only $50. I’m an R user myself, but to each his own.

    Stata Users Group meetings started in Britain in 1995 and have spread to Italy, Sweden, Germany, The Netherlands, Spain, Australia, and the East Coast. Talks are intended to be accessible to a general audience with mixed levels of expertise in Stata and statistics. Stata developers will also attend, both to present new Stata features and to take notes during the popular “Wishes and grumbles” session. We hope you will consider joining the meeting as a presenter or an attendee.

  • When I tell people that I’m a graduate student in Statistics, there are two responses that I get more than any others. The most popular of the two usually goes something like this.

    Oh man, I hated statistics in college. The professor totally sucked and I never knew what was going on. All I remember is mean and some… curve thing? I don’t know. What’s standard deviation anyways?

    I threw that standard deviation bit in for effect. No one actually asks about it, and I’m pretty sure most people don’t even remember ever hearing about it. It’s that whole selective memory thing — blocking out the bad and remembering only the happies.

    So anyways, every time someone tells me they absolutely hated statistics in college, I die a little inside and start bawling like a two-year-old whose lost her bottle. No, no, I’m kidding, but the first thing I think is, “Gee, thanks for letting me know that! Like I really wanted to know that you hate what I study. You know what? I think I hate you a little bit now.” I’m exaggerating a tad, but it’s slightly frustrating after hearing it so many times.

    But why do so many people hate statistics?
    Read More

  • Border-Crossing DeathsWith a stricter border patrol, more Mexican illegal immigrants are taking dangerous routes to get into the United States. As a result, treks through the dehydrating Arizona desert have caused a significant number of deaths. Most likely there are more deaths than this graph indicates because the data was only for deaths reported by the Border Patrol. There could very well be cases the Border Patrol did not handle or knew about.

    This graph was straightforward, mainly a waiting game for data from first, the Government Accountability Office and then the Border Safety Initiative. Take a look at the GAO report done last year, reporting a double in border-crossing deaths from 1995 to 2005. It’s a little odd though that they use numbers from two different sources, so take it with a grain of salt.

  • I actually did this graphic some time last month for the Week in Review. Slipped through the cracks somehow. It was a slightly different experience doing a graphic for this desk, because, well, I guess they don’t request graphics very often.

    As an aside, I just realized that old Times links are behind that silly TimesSelect thing, which kind of sucks. I hear TimesSelect is going to be free sometime in the near future though. Good.

  • In a previous life, I thought anything published in an academic journal was legit, but as a stat student, the story is quite the opposite. Whenever I hear results or see data from some study, I become an instant skeptic.

    Were there really that many deaths from 1998 to 2007? Did housing prices really increase that much over the past decade? Do that many people really support that presidential candidate?
    Read More

  • The greatest value of a picture is when it forces us to notice what we never expected to see.

    — John W. Tukey. Exploratory Data Analysis. 1977.

    Love it. Great words from the father of exploratory data analysis. Have an excellent weekend.

  • Housing BurdenArcGIS can do a lot for you in terms of speeding up the mapping process, which is great, but here’s my dilemma: do I really want to put in all the time to figure out how to use the software?

    I think the basics is good enough for me and any further than that, I’ll let a mapping expert take over. However, I know that spatial analysis is something I’m going to pursue, so… I’m really back and forth.

    On the one hand, ArcGIS has a lot of functions, but on the other hand, it’s not especially easy to use all those functions. For example, I was doing a join between two data tables, but it wasn’t working at first because the column on one table didn’t have leading zeros (e.g. 1 instead of 01). By “not working” I don’t mean that columns weren’t joining. I mean that I couldn’t select this column and that column to join by, so I couldn’t even get to the step where I knew I had to change something. It’s little things like that that bug me and make me think that ArcGIS is inflexible.

    Plus, it sure does like to crash.

    I don’t know.

    I probably just need more experience. How about this. I’ll just learn what I have to, but I’m not going to go out of my way to become an ArcMap expert. Yeah, that sounds OK to me.

    And on that note, here’s the map I made. Color scale was the main thing I had to fuss with. Too many shades of gray lead to a muddled graphic in the paper even if it looks fine on screen. The map shows the percentage of people who spend 30% or more of their household income on housing. Of course, California leads the way.

  • John Snow Cholera MapIf you’ve read any books on visualization, without a doubt, you’ve seen John Snow’s now famous cholera map. In 1854, people were dying in large numbers and high frequency, but nobody knew what was going on. John Snow solved the mystery with his map.

    It’s crazy to imagine a time when people didn’t think to map data, especially now as mapping data has become second nature for some. Steven Johnson, author of Ghost Map, goes into depth on the Cholera outbreak in London in his book and TED talk earlier this year.

    I’d embed it, but I can’t find the link anywhere on the TED page. They probably had to make it less obvious after Hans Rosling’s talk spread at the speed of Cholera in London in 1854. London hasn’t had another outbreak since Snow’s simple (for this day and age) but effective visualization.

    UPDATE: Here’s Steven Johnson’s TED talk

  • The above picture isn’t totally related, but I just had to put it up. It’s so amusing. A family of five plus groceries on one motorcycle! I think there’s room for one more on the handle bars.

    So in efforts to make the above picture relevant…

    If I’ve learned anything during my internship, it’s how to display as much information as possible in a small amount of space. Two things have helped me in trying to achieve New York Times graphics department worthiness:

    • Decide what data / information is important
    • K.I.S.S. — Keep it simple, stupid. (The Office, Thursdays on NBC)

    Decide What Data is Important

    When you get a large data set, your first impulse might be to show all of it. For some cases, like exploratory data analysis (EDA), this is what you want. However, when you’re trying to show off results or display some kind of idea, then you might not need to point to all 100,000 values in your data set. Instead, evaluate all the data you have and then ask yourself what interesting thing in the data you’re trying to show.

    Keep it Simple

    Once you’ve established what the point is, make sure your graphic draws attention to that point. Don’t clutter with giant labels or overly bright colors that overpower your graphic’s main idea. For example, if you look at a bar graph, I don’t think the labels should be the first thing you notice. Rather, you should notice the bars, the real meat of the graphic, first and then recognize the labels second.

    Oh, and don’t forget about white space.

    Super busy graphics are just plain hard to read. Let the data breathe.

    I guess my main point is that you can try to display as much information as possible in a small amount of space, but if you’re not careful and put too much, your motorcycle will tip over. See what I did there the whole motorcycle idea? You know, full circle. Circle of life. Hakuna matata. Oh forget it.

  • Two more graphics — one ran on Sunday with a story investigating lifeguard competence and the other went yesterday with a story on religious books (or lack thereof) in prison libraries. Probably the most challenging part of both graphics was figuring out what to show; there wasn’t exactly a ton of data to choose from.

    Less than Satisfactory Lifeguarding

    I knew this was running on Sunday, but when I checked online, I didn’t see it. I was a little disappointed, because it kind of sucks to make a graphic and then find out it was grilled. Luckily, that hasn’t happened to me yet. Knock on wood. My lifeguard graphic wasn’t on the Web, but it was in the paper.

    Lifeguards and Drownings at Beaches and Pools

    The graphic started as just small squares, but the results looked like they were missing something. It just looked like 32 tiny, shaded squares. They needed more context, so I highlighted incidents in which there were some serious lifeguard screw-ups. I think the excerpts make the graph a lot more human. What do you think?

    Read More

  • Chris Jordan’s series, Running the Numbers: An American Portrait, just opened this weekend in Los Angeles at the Paul Kepeikin Gallery. Chris depicts large numbers in a way that we can see, because oftentimes, big numbers are hard to imagine. For example, he recreates Georges Seurat’s famous painting, A Sunday Afternoon on the Island of La Grande Jatte, in the form of 106,000 aluminum cans — the number used in the US every thirty seconds. There are others like the number of plastic bags used every three seconds (60,000) and the number of brown paper supermarket bags used every hour (1.14 million).

    If you’re in the area, it should definitely be worth going. I wish I could. As Chris notes, it’s one of those series that you have to see in person to get the full effect. The shear size of each piece allows you to feel the largeness of it all.

  • Wall Street Journal put up a nice little graphic showing the evolution of the iPod along with Apple’s stock price. Semi-informative, I guess. Probably more of a fun graphic than anything else. I think it’s slightly misleading, suggesting the iPod was the only reason Apple’s stock changed. Let’s not forget about the iBook, iMac, Macbook, etc releases. Nevertheless, it’s cool to see Apple’s sexy design over the years.

    [link via Core77]

  • Iraq Senate VotingFor what seems like forever, Democrats have been trying to get Republicans to agree to some kind of timeline to pull troops out of Iraq.

    On the surface, the graphic seems pretty straightforward, but the research took me forever. I had to look through past Times articles to find suitable lead ups to the actual bill being proposed. We were looking for something specific like another version of the proposed bill. In retrospect, I’m not quite sure why it took so long. Maybe because it took me a while to pin down just exactly what direction I wanted to take it. Anyways, once I got the background info, it was just a short time of the boss whizzing through Illustrator hot keys and tada, we had our graphic.

  • As the second day of the New York taxi strike begins over GPS and credit card technology, I’m left wondering what taxi drivers are making such a big fuss over. First, drivers are complaining that GPS is an invasion of privacy, and second, they argue that credit card transactions will cause a decrease in profits due to credit card fees.

    Starting with the credit card transactions, I’m about 80% sure that drivers don’t have any actual data to back up their claims that they’re going to start making less money. Non-strikers say that the credit card capability will not only help business (by bringing in those with corporate credit cards), but also increase tips. This information comes from cabs that are already equipped with the proper gizmos.

    What are taxi drivers trying to hide? What is this invasion of privacy talk? These drivers are working for a large company. I repeat, they’re working. I don’t demand a private office when I’m at work, and I don’t see much reason drivers should care a whole lot. If someone is slacking, taking shady routes, or just plain doing something they’re not supposed to do, then they should be held accountable. Unless I’m mistaken, I don’t recall a whole lot of whining when San Francisco cabs had similar equipment installed.

    So stop the fuss, and just mondernize up to the proper century, New York cab drivers. I’m sure Stamen Design and Cabspotting* would greatly appreciate it.

    *I am not associated to either.

  • Order From Randomness Data Browser

    Order From Randomness has an extensive data collection featuring 360 variables describing all 50 states. The indicators are placed in 25 groups including birth rates, death rates, disease, environment, energy, nutrition, and education.

    Most of the data seems to range somewhere between 1999 and 2005, and I believe there’s four variables to 2007. There’s also a simple data browser featuring a distribution curve and some summary statistics. Generally, students seem to like the extensive set of variables, says one of my professors.

  • With more Many Eyes fun, Aron Pilhofer put in part 2 of his original post. I was pleased to see the first post get 56 comments, but I think part 2 might have gotten lost due to the high post frequency, with the U.S. Open fully on. Still worth a look though.
    Read More

  • Ribbon GraphKarl Broman has an amusing list of the top ten worst graphs found in academic papers.

    One of them, very sadly, was actually from the Journal of the American Statistical Association, a very prominent statistical journal. It just goes to show that some have an eye for data, and others might have an eye for visualization, but one doesn’t necessarily lead to the other. Don’t forget to read the discussion on why the graphs are um, not so good, so that we can all learn from the mistakes of those before us.

    My personal favorite is the 3-d ribbon graph, because it’s just so ugly. Why would anyone use that? Too many shades of gray mixing, too many lines crossing, too many dimensions. Brain overload.

    I guess the graph was made in 1994, so I could cut the authors some slack….

    No, they’re just bad. I was making way better graphs in Excel by that time for my seventh grade science fair project — What Cereal do Red Flour Beatles (Tribolium castaneum) Prefer?

    Look what you’ve done Microsoft Excel. Apologize for what you’ve done this very minute.

    Oh, and they preferred Cheerios and stayed away from the Grape Nuts.

  • Five of my graphics ran in the paper today in a special labor day weekend segment, What to Expect When You’re Electing. The past few days, I and those I talked to have been referring to them as the labor day graphics, so I was surprised to see them go today. Nice Sunday treat.

    Gallup PollThe first graphic changed form a few times. It began as a bubble chart to a stacked bar and then to the pies. An editor quickly pointed out that the bubble chart indicated that the percentages were separate, but they should be represented as a whole. Good point, so I toyed around with a stacked bar chart, but it just didn’t look right, given the alloted space. Hence, the pie charts. I’m not a big pie chart fan, but this one seems to work for me.



    What They RaisedA graphic about the amount of money candidates have spent, have, and raised, this graphic’s stacked bar chart base was fairly straightforward. However, it’s the styling and organization that took the most time, as is often the case. I’ve come to learn that it’s very easy to make a graph, but it’s the styling and organization that really makes a graphic worthy of being in the paper.



    Early Contest CalendarOther than the fact that the calendar is changing from day to day and the whole primary versus caucus stuff is kind of confusing, this graphic was pretty straightforward. I put in shades of gray to make things more readable.



    Candidates’ Internet Market ShareI thought this presidential Web site data from Hitwise was pretty interesting. Based on estimates, we can see what presidential Web sites are getting the most traffic. The tricky part was getting the wording right for the headline and lead-in so that readers would know what the percentages meant.



    Mega Primary VotingClearly very straightforward, Pledged Delegates, on the contrary, took the most time out of all five graphics. The construction was simple, but finding the correct numbers took time. Schedules are changing, the definition of a pledged delegate is different by state, and the whole nomination process is fuzzy. Nevertheless, towards the end of Friday, some somewhat reliable numbers came in.

    That’s all. It was fun putting this group of graphics together. I got to learn about the nomination process and most importantly, learned more about style and organization. Good stuff.

    As I sat at my desk this week, working on these things (and one other coming soon), I thought to myself, “I can’t believe I’m getting paid to do this. This is too entertaining.” You know, this whole internship has never really felt like work, which I think is a good sign that I’m headed in the right direction towards data visualization.

  • On their new exploration section, Twitter blocks is available for viewing and use. The viz is in Flash and is supposed to allow you to explore your neighbors as well as your neighbors’ neighbors. I think the higher up the blocks are, the more recent. It’s kind of hard to say. Other than that, I’m actually not really sure what I’m looking at. I thought it might be because I’m not following that many people, but I viewed the blocks for the public timeline and still had trouble deciphering. Maybe others will have better luck.

    Update: Michal posted on the feedback they’ve been getting on Twitter Blocks that’s certainly worth reading:

    So we get this a lot: “Beautiful! But useless!”. We’ve heard it in response to most projects we’ve done over the past few years (one exception has been Oakland Crimespotting, whose stock yokel response is: “no way am I moving to Oakland!”).

    This kinda surprises me. I think their other projects are pretty useful and informative.

  • I’ve been back and forth on whether or not I wanted to post about this. Two reasons: I feel blasphemous feeling this way; and I’m not sure if I’m working for or against my hopes for data awareness. I also think I might be getting some mild form of carpal tunnel. Ow.

    I’m a graduate student in Statistics, and I don’t like Swivel. Why? How is that even possible? All of my work encircles data, I blog about flowing data, and I read about data. So why can’t I force myself to enjoy the “tasty data treats for data geeks” offered by Swivel?
    Read More