Big data: revolution by numbers
Although cricket lovers have a centuries-old obsession with statistics (see Wisden), the sport that has probably been most transformed by data is baseball. The patron saint of the field is Bill James, who since 1977 has been collecting and publishing statistics. He coined the term sabermetrics as “the search for objective knowledge about baseball”. The first person to use sabermetrics as a decision-making tool seems to have been Paul DePodesta, who worked as Billy Beane’s assistant on the Oakland Athletics team. He figured prominently in Michael Lewis’s bestselling book, Moneyball, and the character Peter Brand in the film is partly based on him.
Finance and banking
The stock market has always been a numbers game, but over the last few decades the financial services industry has been comprehensively – and pathologically – mathematised. Casino banking became the preserve of quants – mathematics and physics graduates and sundry geeks who dreamed up the incomprehensible derivative products (such as CDOs – collateralised debt obligations) that eventually led to the banking meltdown. In many ways, the most interesting figures to emerge from the catastrophe are people such as Greg Lippmann who spotted the pattern in the madness and bet against it.
“In God we trust, all others bring data,” reads a plaque in Nasa’s Johnson Space Centre. Science is increasingly data-intensive. The data recorded by each of the big experiments at the Large Hadron Collider (LHC) at Cern in Geneva is enough to fill around 100,000 DVDs every year. Or take the Sloan Digital Sky Survey, which is measuring 500 distinct attributes for each of 100m galaxies, 100m stars and 1m quasars. The result: three terabytes of data, where a terabyte is 1,000 gigabytes. Analysing that volume of data is beyond the capacity of humans, so it has to be done by computers. The same goes for genomics, environmental science and other fields. Some researchers see this as a portent of a radical transformation of scientific research. Whereas once we dreamed up theories and then looked for data to corroborate or refute it, we will increasingly use computer analysis to spot patterns and connections that may have theoretical significance.
The internet is the greatest generator of data about human behaviour that we’ve ever had, for the simple reason that 2bn people use it and everything they do on the net is logged and therefore available for analysis. Business advantage – or even survival – can depend on how good an organisation’s analytics are. The days when one measured “hits” have long gone. (Hits is short for “how idiots track success’,” says analytics evangelist Avinash Kaushik.) No serious company now puts up a website without embedding tools in its code for measuring how visitors use every aspect and feature of the site.
Journalists have always needed to be able to handle data: Florence Nightingale’s exposé of the scandalous conditions faced by British soldiers was published in 1858. What’s changed, says the Guardian‘s data editor, Simon Rogers, is that nowadays data is published as spreadsheets and database files, amenable to computerised analysis, rather than in papers. There are lots of important stories buried in the torrents of data that governments are obliged to publish, but journalists need the computational tools required to find them.
Last week’s news about how Cambridge researchers stopped an MRSA outbreak affecting 12 babies in the Rosie Hospital by rapidly sequencing the genome of the bacteria illustrates how medicine has become a data-intensive field. Even a few years ago, the resources required to achieve this would have involved a roomful of computers and upwards of a week.