• Most underrated films

    May 6, 2014  |  Data Sources

    Rotten Tomatoes film ratingsBen Moore was curious about overrated and underrated films.

    "Overrated" and "underrated" are slippery terms to try to quantify. An interesting way of looking at this, I thought, would be to compare the reviews of film critics with those of Joe Public, reasoning that a film which is roundly-lauded by the Hollywood press but proved disappointing for the real audience would be "overrated" and vice versa.

    Through the Rotten Tomatoes API, he found data to make such a comparison. Then he plotted one against the other, along with a quick calculation of the differences between the percentage of official critics who liked and that of the Rotten Tomatoes audience. The most underrated: Facing the Giants, Diary of a Mad Black Woman, and Grandma's Boy. The most overrated: Spy Kids, 3 Backyards, and Stuart Little 2.

    The plot would be better without the rainbow color scheme and a simple reference line through the even-rating diagonal. But this gets bonus points for sharing the code snippet to access the Rotten Tomatoes API in R, which you can generalize.

  • Create a barebones R package from scratch

    May 6, 2014  |  Coding

    While we're on an R kick, Hilary Parker described how to create an R package from scratch, not just to share code with others but to save yourself some time on future projects. It's not as hard as it seems.

    This tutorial is not about making a beautiful, perfect R package. This tutorial is about creating a bare-minimum R package so that you don’t have to keep thinking to yourself, "I really should just make an R package with these functions so I don't have to keep copy/pasting them like a goddamn luddite." Seriously, it doesn't have to be about sharing your code (although that is an added benefit!). It is about saving yourself time. (n.b. this is my attitude about all reproducibility.)

    I need to do this. I've been meaning to wrap everything up for a while now, but it seemed like such a chore. Sometimes I even go back to my own tutorials for copy and paste action. Now I know better. And that's half the battle.

  • R for cats and cat lovers

    May 6, 2014  |  Coding

    Programmer catFollowing the lead of JavaScript for Cats by Maxwell Ogden, Scott Chamberlain and Carson Sievert wrote R for Cats. It's a playful introduction to R intended for those who have little to no programming experience.

    The bulk of it so far is a primer on data structures, and there's a little bit on functions and some dos and don'ts. It's stuff you should know before you get into more advanced tutorials.

    Mainly though: ooo look, kitty.

    Once you're done with that (It only takes about 30 minutes.), there are lots of other resources for getting started with R.

  • Hip hop vocabulary compared between artists

    May 5, 2014  |  Statistics

    hip hop vocab

    Matt Daniels compared rappers' vocabularies to find out who knows the most words.

    Literary elites love to rep Shakespeare's vocabulary: across his entire corpus, he uses 28,829 words, suggesting he knew over 100,000 words and arguably had the largest vocabulary, ever.

    I decided to compare this data point against the most famous artists in hip hop. I used each artist's first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake.

    As two points of reference, Daniels also counted the number of unique words in the first 5,000 used words from seven of Shakespeare's works and the number of uniques from the first 35,000 words of Herman Melville's Moby-Dick.

    I'm not sure how much stock I would put into these literary comparisons though, because this is purely a keyword count. So "pimps", "pimp", "pimping", and "pimpin" count as four words in a vocabulary and I have a hunch that variants of a single word is more common in rap lyrics than in Shakespeare and Melville. Again, I'm guessing here.

    That said, although there could be similar issues within the rapper comparisons, I bet the counts are more comparable.

  • Your mobility at various times during the day

    May 2, 2014  |  Mapping

    Isoscope

    Isoscope, a class project by Flavio Gortana, Sebastian Kaim and Martin von Lupin, is an interactive that lets you explore mobility around the world.

    We drive to the closest supermarket, take the bike to the gym or walk to the cafe next door for a nice chat among friends. Getting around — thus mobility — is an essential part of our being. We were especially intrigued by those situations when our mobility is compromised such as in traffic jams or during tough driving conditions. How do those restrictions impact our journeys through the city and who is affected most? Obviously, a car can hardly bypass a traffic jam, whereas a bike is more flexible to continue its journey. Let alone the pedestrian who can stroll wherever he wants to. Isoscope tries to answer the questions above by comparing different means of transport and their sensitivity for disturbances.

    Similar in flavor to the commute maps before it, Isoscope is a bit different in that it focuses on specific time frames, such as Fridays at 8am. Using data from the HERE API, a travel polygon is estimated for each hour of the day selected. Your initial result is an abstract blot overlaid on a map, but then use the menu to change days and highlight hours.

  • The size of Game of Thrones dragons compared

    May 1, 2014  |  Infographics

    Release the dragons!

    Because Game of Thrones. Max Fleishman and Fernando Alfonso III for The Daily Dot compared the size of dragons on various shows and movies, from Mushu to Toothless to Smaug to Balerion. The tiny black dot on the left bottom corner is a person.

    See also the size comparison of science fiction starships, Pixar characters, and everything else.

  • Views of white Americans

    May 1, 2014  |  Statistical Visualization

    Views of White Americans by Amanda Cox at NYTIn light of the Donald Sterling brouhaha, Amanda Cox for The Upshot put up some charts for why you shouldn't be surprised that people still say racist things, based on data from the General Social Survey dating back to 1972. Mmhm.

  • Hiding a pregnancy from advertisers

    May 1, 2014  |  Statistics

    You probably remember how Target used purchase histories to predict pregnancies among their customer base (although, don't forget the false positives). Janet Vertesi, an assistant professor of sociology at Princeton University, made sure that sort of data didn't exist during her nine months.

    First, Vertesi made sure there were absolutely no mentions of her pregnancy on social media, which is one of the biggest ways marketers collect information. She called and emailed family directly to tell them the good news, while also asking them not to put anything on Facebook. She even unfriended her uncle after he sent a congratulatory Facebook message.

    She also made sure to only use cash when buying anything related to her pregnancy, so no information could be shared through her credit cards or store-loyalty cards. For items she did want to buy online, Vertesi created an Amazon account linked to an email address on a personal server, had all packages delivered to a local locker and made sure only to use Amazon gift cards she bought with cash.

    The best part was that her modified activity—like purchasing $500 worth of Amazon gift cards in cash from the local Rite Aid—set off other (in real life) triggers.

  • Interactive visualization used as music video

    April 30, 2014  |  Data Art

    Music visualization from George and Jonathan

    George & Jonathan used an interactive audio visualization for their recent album George & Jonathan III. This is a fun one. You can rotate the camera as you like, as the full album plays and notes are represented with dashes and dots.

  • Combatting the Obsession with New Tools

    April 29, 2014  |  Software

    Michal Migurski thinks about finding the right job for the tool rather than the other way around:

    Near the second half of most nerd debates, your likelihood of hearing the phrase "pick the right tool for the job" approaches 100% (cf. frameworks, rails, more rails, node, drupal, jquery, rails again). "Right tool for the job" is a conversation killer, because no shit. You shouldn't be using the wrong tool. And yet, working in code is working in language (naming things is the second hard problem) so it's equally in-bounds to debate the choice of job for the tool. "Right tool" assumes that the Job is a constant and the Tool is a variable, but this is an arbitrary choice and notably contradicted by our own research into the motivations of idealistic geeks.

    Along the same lines, Frank Chimero on not trying any new tools for the year and how each represents someone's perspective:

    Everything that's made has a bias, but simple implements—a hammer, a lever, a text editor—assume little and ask less. The tool doesn't force the hand. But digital tools for information work are spookier. The tools can force the mind, since they have an ideological perspective baked into them. To best use the tool, you must think like the people who made it. This situation, at its best, is called learning. But more often than not, with my tools, it feels like the tail wagging the dog.

    These approaches apply well to analysis and visualization. In the early goings especially, there tends to be an obsession with what tools to use. Which is best? Which is fastest? Which can handle the most data? Which makes everything beautiful? And yeah, it's good to give these some thought in the beginning, but don't get stuck asking so many questions or pondering so many scenarios that you never settle down and do actual work.

    There's always going to be a new application that promises to help you do something with your data. Work on this stuff long enough and you'll find that you probably won't need that new thing.

  • Learn regular expressions with RegExr

    April 29, 2014  |  Online Applications

    RegExrLearning regular expressions tends to involve a lot of trial and error and can be confusing for newcomers. RegExr is an online tool that lets you learn more interactively. Add a body of text in one area and type various regular expressions in another. Matches are highlighted and errors are noted on the fly, which is kind of perfect. Even if you aren't new to regular expressions, this is worth bookmarking for later.

  • Lawmaking through the House and Senate visualized

    April 28, 2014  |  Network Visualization

    Legislative Explorer

    Researchers at the University of Washington's Center for American Politics and Public Policy built the Legislative Explorer to show the lawmaking process in action. The visualization allows you to watch over 250,000 bills and resolutions introduced from 1973 to present.

    The left half represents the U.S. Senate, with senators sorted by party (blue=Democrat) and a proxy for ideology (top=liberal). The House is displayed on the right. Moving in from the borders, the standing committees of the Senate and House are represented, followed by the Senate and House floors. A bill approved by both chambers then moves upward to the President’s desk and into law, while an adopted resolutions (that does not require the president's signature) moves downward.

    Each dot represents a bill, so you can see them move through the levels. Use the drop-down menus at the top to focus on a Congress, a person, party, topic, and several other categorizations. Or use the search to focus on specific bills. Finally, when you do press play, be sure to keep an eye on the counters on the bottom.

  • Detecting and Plotting Sequence Changes

    Detecting and Plotting Sequence Changes

    Change detection for a time series can be tricky, but guess what, there's an R package for that. Then show the results in a custom plot.
  • Audio visualizer made with matrix of fire

    April 25, 2014  |  Data Art

    The Pyro Board is a matrix of 2,500 flames that have controllable intensity, which can be used as an audio visualizer. Yeah, really. Just watch the video below.

    [via Colossal]

  • Detailed map of baseball fandom

    April 24, 2014  |  Mapping

    Baseball fandom in SoCal

    For the past couple of sports seasons, Facebook mapped the most liked team by county. They did it for football (NFL), the NCAA basketball tournament, and baseball (MLB). Although generalized, the maps provide a view of sports fandom and people clusters across the country, and plus you know, they're fun.

    The Upshot used the same like data, provided by Facebook, and mapped it at the ZIP code level. Then they took it a step further and looked closer at regional rivalries, such as Cubs and White Sox, Yankees and Red Sox, and Dodgers and Angels. Be sure to scroll down to Mets versus Phillies. They incorporated a tidbit of Josh Katz's dialect map.

    The Upshot is off to an impressive start. It's almost as if The New York Times people have been doing this for a while. [via @KevinQ]

  • PourOver allows filtering of large datasets in your browser

    April 24, 2014  |  Software

    The New York Times released PourOver, a library that lets you do database-like things client-side, so that (1) you, the developer, can worry less about database optimization and server loads and (2) users get a more responsive, faster experience.

    PourOver is built around the ideal of simple queries that can be arbitrarily composed with each other, without having to recalculate their results. You can union, intersect, and difference queries. PourOver will remember how your queries were constructed and can smartly update them when items are added or modified. You also get useful features like collections that buffer their information periodically, views that page and cache, fast sorting, and much, much more.

    Also: How great is it that The New York Times is now getting into the habit of releasing code?

  • Where people bike and run, worldwide

    April 24, 2014  |  Mapping

    Strava activity maps

    Remember those running maps I made with limited data from RunKeeper? Strava, which also provides an app to track your runs and bike rides, has a much more expansive version of popular paths around the world. Their dataset includes over 77 million rides and 19 million runs, summing to about 220 billion data points. Just pan and zoom to your area of interest, and there you go.

  • Pride and joy of the Yau household

    The Change My Son Brought, Seen Through Personal Data

    I combed through personal data that I've actively and passively collected since early graduate school to see how life is different now with a 6-month old.
  • The Upshot, a data-centric site from The New York Times launched

    April 22, 2014  |  News

    We heard a little bit about The Upshot last month. Now we get to see it. From editor David Leonhardt on what the site is about:

    One of our highest priorities will be unearthing data sets — and analyzing existing ones — in ways that illuminate and explain the news. Our first day of material, both political and economic, should give you a sense of what we hope to do with data. As with our written articles, we aspire to present our data in the clearest, most engaging way possible. A graphic can often accomplish that goal better than prose. Luckily, we work alongside The Times's graphics department, some of the most talented data-visualization specialists in the country. It's no accident that the same people who created the interactive dialect quiz, the deficit puzzle and the rent-vs-buy calculator will be working on The Upshot.

    Hey I'm on board with any site where Amanda Cox introduces statistical models.

    FiveThirtyEight is still evolving, and I suspect The UpShot to do the same, so it should be fun to see which way each goes (and what other sites come out of it). For now though, I'm just happy that we get to see this statistics-ish thing happen.

  • Music preference by region

    April 22, 2014  |  Mapping

    Music in America

    Movoto mapped music preference for various genres, across the United States.

    We calculated musical taste scores using data from the National Endowment of the Arts, the Bureau of Labor Statistics, and the U.S. Bureau of Economic Analysis (via the Martin Prosperity Institute) and state level music preferences from Wikipedia. The scores include music genre preference survey data and genre performer concentrations by metro, weighted by that metro's influence on the music scene. We took the scores for each metro and used a spatial statistics method called nearest neighbors to create the heatmap.

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.