Kaggle: data with destiny
- Asher Moses in Silicon Valley | Episode 1 | 2 | 3 | 4
- Crowdsourcing for geniuses: Kaggle
- Global designs for local businesses: 99designs
The world - and fortunes - turn so fast. 10 years ago Facebook didn't exist. 15 years ago, neither did Google.
The data doesn't lie - the data unearthed that correlation.
Richard Wong, partner with big US venture capital firm Accel, says of the "modern day gold rush" in Silicon Valley: "What's fascinating about the Valley is every 10 years it changes...
Australian entrepreneur Anthony Goldbloom is "making data science a sport". Photo: Jeremy Howard
"So, every 5 -10 years it's pretty exciting that there's another generation of transformative fundamental technology companies that have been created here."
Anthony Goldbloom, the Australian founder of one of the world's hottest tech startups in the field of "big data", is aiming to be the next big thing.
But he says Australia feels like a "backwater by comparison to Silicon Valley".
Goldbloom has toured the conferece circuit promoting Kaggle to adoring geeks.
After hitting his head against a brick wall trying to get people back home excited about his idea, the founder of Kaggle moved to San Francisco last year where he was swamped by hungry investors and eventually ended up raising $11 million from Silicon Valley heavyweights, including one of the founders of PayPal.
He says the biggest issue holding Australia back is not one of policy, but culture.
"In Australia you say 'hey I've got this really big idea, I think there's a chance this could become a billion dollar company', and it's like 'what? really? yeah..' it's just like inconceivable because it just doesn't happen," said Goldbloom, 28.
Kaggle allows those with problems to tap a liquid market of talent.
"Whereas over here ... because this place is surrounded by other success stories of companies going from zero to billion dollar companies over the course of a few years, you find that the enthusiasm around you and the optimism is infectious and it really lifts you up.
"As opposed to having every conversation be like 'are you serious, I can't quite get what you're doing, can you kind of explain it again?', you're coming into an environment where it's 'oh i love what you're doing how can we work together, i've got this great recruit, oh you should meet this investor' - it's just a very different environment."
Watch Kaggle in Digital Dreamers Episode 4
The Kaggle website.
Goldbloom, who coded his site in a Bondi Beach apartment in Sydney, moved to the US last year and is well on his way to realising his vision.
According to Accel, his industry, "big data", which helps corporations take advantage of the swarms of data we all generate, sits alongside smartphones and the social web as one of three drivers of the current tech boom.
Kaggle essentially allows those with data-related problems to tap into a pool of over 33,000 PhD-level scientists and statisticians who compete to find the most accurate solutions. Winners earn cash prizes of anywhere from $5000 to $3 million.
The site attracted such a large investment so early because the solutions it unearths are potentially worth mountains of cash to companies. They enable them to make more efficient and effective decisions.
The competitive nature of Kaggle and the fact that it draws in experts from a variety of disciplines makes it a powerful tool for solving problems that have stumped science agencies like NASA for years, such as when a British glaciologist came up with the most effective algorithm for mapping dark matter in the universe.
A health care provider used Kaggle to develop a formula to help predict which patients would go to hospital in the next year, while a bank used Kaggle to predict which customers would default on loans.
"If you can get a more accurate model, you're shrinking the size of their bad debts, so for a big bank like Citi for instance I think the savings they could make in using a company like Kaggle ... would be in the billions, literally," said Goldbloom.
Analyst firm IDC says the world's data stores will reach 2.7 zettabytes this year, up 48 per cent on last year. Big data is essentially using computers to find relationships in that data that humans wouldn't normally intuit on their own.
For instance, using historical data including sales and inventory information, a used car dealer ran a competition on Kaggle to develop an algorithm to help him avoid buying lemons at car auctions.
The results included for instance that orange cars were generally more reliable - and that colour was a very significant predictor of the reliability of a used car.
"The intuition here is that if you are the first buyer of an orange car, orange is an unusual colour you're probably going to be someone who really cares about the car and so you looked after it better than somebody who bought a silver car," said Goldbloom.
"The data doesn't lie - the data unearthed that correlation. It was something that they had not taken into account before when purchasing vehicles."
Kaggle is also being used for social good. One of its earliest wins was using genetic markers to predict the progression of HIV viral load - which can help inform doctors as to how quickly a patient may deteriorate.
"Four years worth of academic research matched in a week and a half and way outdone in three months," said Goldbloom. "That was a pretty astonishing result at the time. since then it's happened every time we've run a competition"
In one of its most recent competitions Kaggle users were given tens of thousands of high school essays marked by two teachers. They were told to come up with an algorithm that would essentially allow students to feed in their essays and have them automatically marked by machine.
"It turns out that teachers can be pretty inconsistent in the grades they give, such that the discrepancy between the winning algorithms and the discrepancy between the two teachers were about the same," said Goldbloom.
"So the best algorithms were about as reliable as teachers at scoring essays - pretty amazing!"
Goldbloom, a former RBA and treasury economist, said the algorithms took into account everything from the obvious, such as the number of spelling mistakes and grammatical errors, to the more difficult, such as correlating concepts in the question with concepts in the answers and looking for a logical structure/flow.
A key benefit of Kaggle is that it exposes problems to a global pool of talent from different disciplines, and that was demonstrated here.
The winning essay grading formula came from a partnership of a hedge fund trader in London, a programmer at the US National Weather Service in Washington DC and a computer science student based in Frankfurt, Germany.
The William and Flora Hewlett Foundation, which ran the $100,000 competition, said the push for computer marking was driven by the fact that, particularly in the US, students are only assigned a few writing assignments per semester because teachers are grappling with ballooning classroom sizes. Tests are often multiple choice.
"Our core interest is better understanding how technology can help teachers assign and grade more writing assignments," said the foundation's education program director Barbara Chow.
"Research suggests that the best way to improve writing is to write more frequently; revising based upon feedback."
While the automated scoring algorithms have proven ability marking large-scale standardised tests, it is less clear how useful they will be in the classroom. They also have difficulty determining whether statements of fact are correct.
Chow said these issues would be addressed in future competitions, but the NSW Teachers Federation is skeptical.
"A teacher usually marks essays for more than a mark ... it would seem to me to be a difficult exercise to exercise that professional judgment about the student and mark it in a way that would provide each child individually with commentary around where their strengths lie and where they could overcome some of the weaknesses," said senior vice-president Joan Lemaire.
Jaison Morgan of consultancy firm The Common Pool, which is working on this project with the Hewlett Foundation, said classroom trials were in the works to see if computer trials could aid teachers, not replace them.
"We want to see if a classroom application would support teachers (just as a calculator supports math instruction)," he said.
He said the goal with the classroom trials would be to see if a computer algorithm could give students feedback on their essays as they write.
"Computers can strip out a lot of the biases and preferences that affect human markers," said Morgan.
"To do human marking properly or strip out biases you need five teachers marking each essay."
The Board of Studies said it did not have any plans to introduce automated marking for extended response questions in the HSC. In a classroom setting the decision was a matter for individual schools.