• Data scientist surpasses statistician on Google Trends

    December 18, 2013  |  Statistics

    Statisician vs Data Scientist on Google Trends

    The relative interest in data scientist surpassed statistician this month. It was also higher in April and September of this year, so it's not new, but it does seem like it's ready to be a consistent thing, at least least for a little while. That said, it doesn't seem like statistician is losing interest to data scientist, as the former has been fairly consistent for the past few years, so take that how you want.

  • Easy text classification

    December 10, 2013  |  Statistics

    etcML

    Text can be a great source of data, but it can be a challenge to glean information from an analysis standpoint. etcML can help you with that. Browse Twitter trends, classify your own text with existing machine learning classifiers, or upload your own training data.

    But most importantly, you can use etcML to learn interesting new things about whatever text data you're already working with in your job or research. Say you're a social scientist with written and multiple-choice survey responses — you can quickly see how well participants' written text allows you to guess their multiple-choice response. Or say you're a literary scholar who wants to know what distinguishes an author's early and late periods — you can train a classifier and visualize the most predictive words for each category.

    Saved for later.

  • Prediction of sexual orientation through Facebook friends

    November 13, 2013  |  Statistics

    Carter Jernigan and Behram F.T. Mistree found that sexual orientation of an individual is strongly correlated to the sexual orientation of the individual's friends on Facebook.

    After analyzing 4,080 Facebook profiles from the MIT network, we determined that the percentage of a given user's friends who self–identify as gay male is strongly correlated with the sexual orientation of that user, and we developed a logistic regression classifier with strong predictive power. Although we studied Facebook friendship ties, network data is pervasive in the broader context of computer–mediated communication, raising significant privacy issues for communication technologies to which there are no neat solutions.

    Gay men had around seven times more gay friends than straight men on average. I imagine similar results for other demographics such as age and race. Although, their predictions weren't full proof with 17 percent of gay men in their test data misclassified as straight and 22 percent of straight men misclassified as gay, which carries plenty of real world problems.

    In any case, the results aren't surprising. Just filing this under the things that others can know about you without actually knowing you.

  • Up all night to get data, a music video parody

    November 12, 2013  |  Statistics

    Neuroscience students at the University of California, San Diego made a music video parody of Daft Punk's "Get Lucky." It's about gathering data in the lab. Graduate students are such nerds.

  • Global status tracker for open government data

    November 5, 2013  |  Data Sharing

    Open Data Index

    The Open Knowledge Foundation launched the Open Data Index, so you can see what data countries provide to their citizens.

    An increasing number of governments have committed to open up data, but how much key information is actually being released? Is the available data legally and technically usable so that citizens, civil society and businesses can realise the full benefits of the information? Which countries are the most advanced and which are lagging in relation to open data? The Open Data Index has been developed to help answer such questions by collecting and presenting information on the state of open data around the world - to ignite discussions between citizens and governments.

    Based on community editor contributions, the index assesses the availability of datasets such as transportation timetables, election results, and legislation, and provides a single-number score. The higher the score is, the more data a government makes available to the public. Of the 70 participating countries, the UK leads the way, followed by the United States and Denmark.

  • Cancer data for the U.S. released

    November 1, 2013  |  Data Sources

    The Centers for Disease Control and Prevention released their most recent cancer data a few days ago. It's the numbers for 2010, which feels dated. However, the annual data goes back to 1999, across demographics and states, which makes this data worth a look. You can download the delimited files here.

    A browser accompanies the release, as shown below. It's really just that though, leaving analysis up to you, and it's rough around the edges.

    Cancer statistics

    So if you're looking for a weekend project, this is a good place to go. I'd probably start with the age breakdowns and work from there.

  • U.S. Open Data Institute

    October 29, 2013  |  Data Sharing

    With a $250,000 grant from the Knight Foundation, Waldo Jaquith pushes forward with the U.S. Open Data Institute, an effort to link government data sources and organizations over the next year.

    I'm convinced that we already have many of the right people, organizations and businesses working on open data in the United States. They just don't know about each other. (The organization certainly won't duplicate any of the efforts of the folks in this space.) And we have nearly all of the necessary software, but so much of it is only known within its narrow domain, despite its broad applicability. The institute will connect all of these entities, promote the work of those who are leading the way and provide supportive, nonjudgmental assistance to those who need help. We don't have all the answers, but we know the folks who do. We want to amplify their message and connect them to new collaborators and clients.

    This could be fun.

  • Degrees of separation between athletes from different sports

    October 23, 2013  |  Statistics

    You've probably heard of the six degrees of Kevin Bacon. The idea is that you can name any actor and trace back to Kevin Bacon through actors who have worked together. Ben Blatt for Slate applied this idea to sports and put together an interactive that finds the number degrees between athletes. The fun part is that you can enter two athletes from different professional sports: basketball, football, and baseball.

    What's even more remarkable is that it's possible to connect players who didn't even play the same sport. Cross-sport athletes like Deion Sanders and Bo Jackson are exceedingly rare, and some combinations of sports are hardly seen at all. Of these 18 athletes, all but one—Bud Grant—played baseball as one of his two pro careers, proving either that the stars of the diamond are athletic enough to master other sports or that anyone athletic enough to play basketball or football can also handle baseball. Hockey is the opposite, as there has never been a pro hockey player who also played top-level basketball, football, or baseball. As a result, hockey is a closed system. But once you get off the ice, it's possible to link every pro baseball, basketball, and football star.

    I like how it only takes 18 players (well, actually probably fewer) to pull double-time to make this possible. To link Yao Ming (basketball) and Joe Montana (football), it only took six hops, with Mark Hendrickson as a link between basketball and baseball and Deion Sanders as the link between baseball and football.

    Surprising? Kind of, but then again, in 2011, almost all pairs of people on Facebook could be linked with just six hops, too. The barebones interactive is still a lot of fun to play with though if you follow sports.

  • Government data shutdown

    October 2, 2013  |  Data Sources

    Census shutdown

    When you go to the United States Census site, Data.gov, or similar government-run sites, you see this. "Due to the lapse in government funding, census.gov sites, services, and all online survey collection requests will be unavailable until further notice." Now it's personal.

  • Consequences of big data exclusions

    October 2, 2013  |  Statistics

    Big data, in all its glory, promises insights into the soul of humankind. There's a hefty restriction though. Data only tells you about the population and actions of individuals it represents, which inevitably excludes part of the population. Jonas Lerman considers two hypothetical people. The first one:

    The first is a thirty-year-old white-collar resident of Manhattan. She participates in modern life in all the ways typical of her demographic: smartphone, Google, Gmail, Netflix, Spotify, Amazon. She uses Facebook, with its default privacy settings, to keep in touch with friends. She dates through the website OkCupid. She travels frequently, tweeting and posting geotagged photos to Flickr and Instagram. Her wallet holds a debit card, credit cards, and a MetroCard for the subway and bus system. On her keychain are plastic barcoded cards for the “customer rewards” programs of her grocery and drugstore. In her car, a GPS sits on the dash, and an E‑ZPass transponder (for bridge, tunnel, and highway tolls) hangs from the windshield.

    That's a lot of data. The second person:

    He lives two hours southwest of Manhattan, in Camden, New Jersey, America’s poorest city. He is underemployed, working part-time at a restaurant, paid under the table in cash. He has no cell phone, no computer, no cable. He rarely travels and has no passport, car, or GPS. He uses the Internet, but only at the local library on public terminals. When he rides the bus, he pays the fare in cash.

    The second person has fewer data flows.

    These days, big data exclusion almost sounds like a good thing — if you're intent on avoiding all marketing-related data collection — but when policy-making, fund allocation, etc. come into play, it's possible the excluded aren't counted. That's not to say people should hurriedly sign up for Facebook and opt-in to every tracking study. It's the opposite. Those in charged of the data and those who decide based on what they see in the data are responsible for knowing the background of their source.

  • Tracking criminal movements and predicting hot spots

    September 23, 2013  |  Statistics

    Finding crime hot spots

    In the latest SIAM Journal on Applied Mathematics, Chaturapruek, et al. describe modeling criminal movements based on where potential criminals live and areas of interest.

    Data available on distance between criminals' homes and their targets shows that burglars are willing to travel longer distances for high-value targets, and tend to employ different means of transportation to make these long trips. Of course, this tendency differs among types of criminals. Professionals and older criminals may travel further than younger amateurs. A group of professional burglars planning to rob a bank, for instance, would reasonably be expected to follow a Lévy flight.

    "There is actually a relationship between how far these criminals are willing to travel for a target and the ability for a hotspot to form," explain Kolokolnikov and McCalla.

    I hear the RV and Pontiac Aztec is the preferred mode of transportation among high school chemistry teachers turned meth cooks.

    Full paper here, if you're into that.

  • A visual explanation of Simpson’s Paradox

    September 19, 2013  |  Statistics

    Simpson paradox

    When you look for overall trends, you often poke around the data in aggregate, but when you zoom out too far, you could miss details or within-category variation. Sometimes when you zoom in, you see a completely opposite trend of what you saw overall. This is known as Simpson's Paradox. Lewis Lehe and Victor Powell explain in a series of small, interactive charts.

    Why does this matter?

    Simpson's paradox usually fools us on tests of performance. In a famous example, researchers concluded that a newer treatment for kidney stones was more effective than traditional surgery, but it was later revealed that the newer treatment was more often being used on small kidney stones. More recently, on elementary school tests, minority students in Texas outperform their peers in Wisconsin, but Texas has so many minority students that Wisconsin beats it in state rankings. It would be a shame if Simpson's paradox led doctors to prescribe ineffective treatments or Texas schools to waste money copying Wisconsin.

    The takeaway lesson: Remember to look at the details. [Thanks, Victor]

  • Dialect quiz shows where others talk like you do

    September 18, 2013  |  Statistics

    Dialect quiz and survey

    North Carolina State statistics graduate student Joshua Katz already mapped dialect across the United States, and now there's a fun addition in quiz form. Answer the 25-question survey (or the more detailed 140-question version if you dare), and you get a map of language similarity. More specifically, the result maps shows the probability that someone in that area understands what you're saying.

    My results were dead on.

  • UK Census at risk

    September 5, 2013  |  Statistics

    There is a possibility the UK Census is scrapped for cheaper options next year.

    The census faces its biggest shake-up in its 200-year history under Office for National Statistics proposals.

    An online survey could replace the study - carried out every 10 years - or information could instead be collated using data already held by government.

    The plans will be fleshed out and put out to consultation this month before Parliament makes a decision in 2014.

    This is both surprising and expected. On the one hand, a decennial Census provides a granular view of a nation that is hard to match. However, on the other, it can be expensive to count everyone, and there are more data sources now than there were 200 years ago that can be drawn upon. The catch is that a lot of data sources use the Census as a baseline or a seed to make estimates.

    The main issue seems to be cost, which is estimated to be £482 million over a ten-year period and comes out to about $1.10 per person per year. In contrast, the 2010 United States Census cost $13 billion, which comes out to about $4.20 per person per year. So it'll be interesting to see what the ONS decides, as I'm sure it's going to get US officials thinking about prices, too, especially since the US Census cost almost four times more per capita.

    In any case, hopefully the UK Census sticks around. A complete cut of the program can't possibly be beneficial at this point.

  • Data in the service of humanity

    September 2, 2013  |  Statistics

    For this rainy Labor Day, here's an uplifting talk by DataKind founder Jake Porway. He talks data and how it can make a worthwhile difference in areas that could use a change.

  • The inventor of modern probability

    August 27, 2013  |  Statistics

    Andrei Kolmogorov is a name unfamiliar to most, but his work had lasting impact. Slava Gerovitch profiled the mathematician, describing the change in thought towards probability theory, which was once more of a joke than a serious approach to evaluate the world. I especially liked the bit about Kolmogorov's appreciation for the arts.

    Music and literature were deeply important to Kolmogorov, who believed he could analyze them probabilistically to gain insight into the inner workings of the human mind. He was a cultural elitist who believed in a hierarchy of artistic values. At the pinnacle were the writings of Goethe, Pushkin, and Thomas Mann, alongside the compositions of Bach, Vivaldi, Mozart, and Beethoven—works whose enduring value resembled eternal mathematical truths. Kolmogorov stressed that every true work of art was a unique creation, something unlikely by definition, something outside the realm of simple statistical regularity. "Is it possible to include [Tolstoy's War and Peace] in a reasonable way into the set of 'all possible novels' and further to postulate the existence of a certain probability distribution in this set?” he asked, sarcastically, in a 1965 article.

  • A master’s degree in statistics is worthwhile

    August 26, 2013  |  Statistics

    Statistician (and brand new PhD student) Jerzy Wieczorek explains the usefulness of a master's degree in statistics.

    There's a huge difference between undergraduate Stats 101 (apply a few standard procedures to nice clean datasets) and real data analysis work (figure out how to clean the data and modify your procedures to the messy context in front of you). So a masters-level mathematical/theoretical stats course, where you learn to prove which estimators have desirable properties or to derive tests that are appropriate in a given situation, is invaluable when you run into non-standard problems. The masters degree will also expose you to many techniques that you probably didn't cover as an undergrad: designing good experiments, computer-intensive methods like the bootstrap, special-use techniques like time series or spatial statistics, other inference philosophies like Bayesian statistics, etc.

    Yep.

    Of course Jerzy and me are slightly biased. Saying a master's degree in statistics isn't worthwhile is like saying we wasted our time, but if you really want to learn data — whether it's for analysis, visualization, journalism, or whatever — statistics helps you get there.

    And whereas the PhD route takes a certain type of person, most master's degrees take only two years to finish, and your analysis skills increase exponentially compared to that of an undergrad. Graduate statistics is also way more interesting, because you focus more on practical usage and less on hypothesis tests.

  • Pickle Index for population estimation

    August 21, 2013  |  Statistics

    Pickle indexAs China moves forward with a plan to move 250 million people to cities, officials developed a need to keep track of how many people are still in rural areas. The problem was that local data from the provinces is unreliable, so instead they use what they call a "pickle index" which banks on the correlation between the amount of pickles eaten by rural residents and population.

    According to the South China Morning Post, the country’s National Development and Reform Commission has found that sales of zha cai, a pickled mustard tuber, provide a better guide to population flows than often unreliable provincial statistical data. As an unnamed planner explained to the Economic Observer:

    "Under normal circumstances, urban consumption levels of convenience foods such as instant noodles and pickled mustard is essentially constant. Therefore, we can assume that volume changes are mainly caused by a city’s floating population."

    Maybe we can work out a hot dog index for the US. [via Waxy]

  • Why everyone is more popular than you

    August 15, 2013  |  Statistics

    Mathematician Hannah Fry is back with another video. She explains why it seems like everyone in your network — on Twitter, Facebook, and in real life — is more popular than you and how we can use this idea to predict the spread of diseases. Fry's understated presentation style totally enhances the interesting subject matter.

  • Google search suggestions by country

    August 8, 2013  |  Statistics

    Search suggestions by country

    Google search suggestions have transformed into a never-ending source of entertainment and a candid peek into what people look for in the world. We've seen insecurities change with age and stereotypes of states in the US. Noah Veltman banked on the locality of suggestions for a country-specific view of the world. He shows suggestions for the same query for the United States, Canada, the United Kingdom, Australia, and New Zealand.

    For example, a search for "why is America" in each country depicts stereotypes and national curiosities about why America is so fat, rich, and better than Canada. Scroll down and you see suggestions for "how to", "why is there", and "why does everyone" which interestingly shows many of the same wonderings.

    Now if you'll excuse me, I have to go eat bacon and swim in my pool of gold coins while I browse through my vastly superior Netflix selection.

Copyright © 2007-2014 FlowingData. All rights reserved. Hosted by Linode.