Online dating site OkCupid continues their run of amusing yet thorough analysis of their users. This time: the real stuff white people like. Well actually, the stuff that all races like:
We selected 526,000 OkCupid users at random and divided them into groups by their (self-stated) race. We then took all these people's profile essays (280 million words in total!) and isolated the words and phrases that made each racial group's essays statistically distinct from the others'.
Top phrase for white males? Tom Clancy. White female? The Red Sox. Black males? Soul food. Black females? Soul food. Asian males? Taiwan. Asian females? Coz. Yeah, I don't know what that is either.
There are a bunch of college ratings out there to help students decide what college to apply to (and give something for alumni to gloat about). The tough part is that there doesn't seem to be any agreement on what makes a good college. Alex Richards and Ron Coddington describe the discrepancies.
If you've ever created an interactive graphic or anything else that requires that you feed in data, you will love this barebones data conversion tool by Shan Carter. Copy and paste data from Excel, which I feel like I've done a billion times, and then take your pick from Actionscript, JSON, XML, and Ruby. Simple, but a potential time saver. [via]
You can get pretty far with data graphics with just limited statistical knowledge, but if you want to take your skills, resume, and portfolio to the next level, you should learn standard data practices. Of all places, UK Parliament has some short and free guides to help you with basic statistical concepts. They provide 13 notes, each only two or three pages long that can help you with stuff like how to adjust for inflation, confidence intervals and statistical significance, or basic graph suggestions [pdf]. I like.
Private browsing. All the modern browsers have it. Turn it on, and the browser won't keep your history during the session. Sometimes it's used to pay bank bills on a public computer. Sometimes it's used for other stuff. In an opt-in study looking at a week in the life of a browser, Mozilla looked at how people use private browsing.
Again, it's worth noting that people opted in to this study (about 4,000 of them), and Mozilla only recorded when users started and stopped private browsing. Nothing in between.
That said, they came up with two basic findings. The first is when people typically use private browsing (above).
They saw usage spikes during the lunch hours as well as just before the work day ended. The other spike is after the dinner hours and then finally, in the late hours of the night.
On Friday, Michael D. Smith, dean of the Harvard faculty of arts and sciences, issued a letter to the faculty confirming the inquiry and saying the eight instances of scientific misconduct involved problems of “data acquisition, data analysis, data retention, and the reporting of research methodologies and results.” No further details were given.
This is why we don't just accept any old data and why we care about the methodology behind the numbers. Stuff like this always reminds me of an exam question that asked us to investigate the data from an article in a prominent scientific journal. The analysis was all wrong.
Sometimes data is wrong out of ignorance. Other times it's wrong because people make stuff up. I can understand the former, but why you would ever do the latter is beyond me.
Update: More details on what happened from research assistants' point of view on the Chronicle. [thx, Winawer]
Weather in the private sector is over a $1.5 billion industry, and it's largely because of the government's open weather data. You can find what the weather is just about anywhere with just a few clicks of the mouse. It wasn't always like that though. Clay Johnson, former director of Sunlight Labs, describes the history of open weather data, starting with Thomas Jefferson in the late 1700s.
My wife is an ER doc, so I hear about this sort of stuff all the time. Hospitals are going all-digital, and the exchange of data from doctor to doctor, from hospital to hospital, from patient to doctor, and doctor to patient is only going to get easier.
This expedited exchange of information will bring advantages such as fewer prescription errors, easier hospital transfers, and through sensors and mobile devices, professional health practitioners will be able to provide better care to those with chronic health conditions. This illustration from Chris Luongo explains a bit more.
Naturally, with all these benefits come plenty of challenges. Data privacy is huge here. Can you imagine if your medical charts ended up in some random hacker's hands and then sold to the highest bidder? At least we might get more useful spam. I want big discounts on mis-spelled drugs that I actually need.
Seriously though. Data is blowing up, and there's going to be monster demand for data scientists in the next ten years. See that wagon? Better jump on it while there's still room.
[via Smarter Planet]
I should just automatically bring the OkTrends feed into FlowingData. In their never-ending quest to understand humankind, the group from online dating site OkCupid analyzes 11.4 million opinions on what makes a "great" photo - as in makes people want to date you. Some of the findings include: photos from Panasonic Micro 4/3s were best received, "photo attractiveness" decreased by age, and the Flash adds seven years.
There's one finding that's got everyone buzzing though. iPhone users have more sexual partners. See the graph above and below for the numbers.
People do everything they can in their OkCupid profiles to make themselves seem awesome, and surely many of our users genuinely are. But it's very hard for the casual browser to tell truth from fiction. With our behind-the-scenes perspective, we're able to shed some light on some typical claims and the likely realities behind them.
Among the findings:
- People exaggerate their height by about two inches.
- If someone says they make $100k per year, they probably mean $80k.
- The more attractive a picture, the older it is.
- Most self-identified bisexuals (80%) only like one gender.
Yeah, you read that right. Tardiness makes the world go 'round:
One day in 1939, Berkeley doctoral candidate George Dantzig arrived late for a statistics class taught by Jerzy Neyman. He copied down the two problems on the blackboard and turned them in a few days later, apologizing for the delay — he’d found them unusually difficult. Distracted, Neyman told him to leave his homework on the desk.
On a Sunday morning six weeks later, Neyman banged on Dantzig’s door. The problems that Dantzig had assumed were homework were actually unproved statistical theorems that Neyman had been discussing with the class — and Dantzig had proved both of them. Both were eventually published, with Dantzig as coauthor.
Other benefits include more hours of sleep, exercise while power-walking to your destination, and all-around warm, fuzzy feelings knowing that you live by nobody's schedule. You might also supposedly inspire films like Good Will Hunting.
internet.artizans reflects on the usefulness of open data:
I'm inspired by the idea that nuggets of opened data could seed guerilla public services, plugging gaps left by government, but i don't see any of that in the data.gov.uk apps list. The reasons aren't technical but psychosocial - the people and communities who could use this data to help tackle their own disadvantage and marginalisation don't have the self-confident sense of entitlement that makes for successful civic hacktivism.
The groups that really need it also often don't have the tech or know-how to make use of - or even collect useful data - to make a case for anything. People like us, the data and tech-savvy can help.
In regards to the UK's recent boom in open data, Simon Rogers of the Guardian, ponders data's role in journalism, and the opportunities this new found information could bring:
The impact on journalism is expected to be great. The Chicago-based web developer and founder of the neighbourhood news site EveryBlock, Adrian Holovaty, says it's going to be challenging but exciting for journalists. "As more governments open their data, journalists lose privileged status as gatekeepers of information – but the need for their work as curators and explainers increases. The more data that's available in the world, the more essential it is for somebody to make sense of it."
This need not only creates a fresh brand of news, but also a new type of journalist:
I once prided myself on my lack of maths knowledge. Now I find myself editing a datajournalism site, the Guardian's datablog: a site where we use Google Spreadsheets to post key datasets. We make the data properly accessible, then encourage our users to take the numbers, produce graphics and applications and help us look for stories.
Priding yourself on a lack of know-how on how to deal with data is a little weird, but okay.
In any case, people always ask me how to get into information design, infographics, visualization etc. Journalism is one of those choices, and there's a lot of opportunity there if you've got the skills.
BP processes about 1.5 million barrels of crude oil per day, across six refineries in the United States. In total, 150 refineries in the United States process just under 18 million barrels per day, so BP processes about 8.5 percent of it. However, as reported by the Center for Public Integrity, 97 percent of the most dangerous violations found by OSHA were on BP properties.
Maybe there's something to this whole data science thing after all. Mike Loukides describes data science and where it's headed on O'Reilly Radar. It's a good read, but statisticians get clumped into suits crunching numbers like actuarial drones:
Using data effectively requires something different from traditional statistics, where actuaries in business suits perform arcane but fairly well-defined kinds of analysis. What differentiates data science from statistics is that data science is a holistic approach. We're increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.
What is data science? It's what well-rounded statisticians do.
Health and Human Services (HHS) is about to announce the launch of their Community Health Data Initiative over in DC right now. The point is to make health data more usable for consumers and communities.
Today groups will be presenting how they've made use of the data in the past few weeks from about 9:30 to 10:30 - as in right now. I've embedded the live webcast below.
They're just going through the formalities of thank yous and intros right now, but the good stuff should start soon.
Men's Health takes a look at America's most sugary drinks and their junk food equivalents. A Peppermint White Chocolate Mocha with whipped cream (venti size) from Starbucks has the same amount of sugar as 8½ scoops of Edy’s Slow Churned Rich and Creamy Coffee Ice Cream. Calorie-wise, the picture might look a little different. Still though, that's a lot of sugar.
Be careful what you drink, boys and girls.
[via Boing Boing]
When you ride your bicycle around, I bet you always wish for two things. First: "I wish this was electric so that I didn't have to pedal so much." Second: "I wish I could use my bicycle as a data collection device." Well guess what. Your dreams have come true. The Copenhagen Wheel, conceived by the MIT SENSEable City Lab, will do just that. With everything rolled up into one hub, a quick and simple installation turns your plain old bicycle into an electric data collection device.