Add new comment

I like many of the points you and the authors of the research paper make, Lewis. Just augmenting the argument with some additional thoughts and observations.

Mark Lowenthal focuses on knowledge creation and what little most can mine from Big Data using the equivalent of picks and shovels most commonly used today. But high-volume mining and ore processing via machine learning is becoming possible. A handful of companies are figuring out how to do it.

Are we underspending on more immediately useful kinds of intelligence collection and analysis? That's not a new argument; the same argument has existed since the first Keyhole satellite. Of course we're underspending on HUMINT and on the manpower, time and effort it takes to create the Big Picture. No argument there. And we're "overspending" (at least from a short-term POV) on what's new and shiny. Over the long term, the "overspent" investment holds promise.

The history of PHOTINT provides a model for what's emerging in Big Data. PHOTINT has paid massive dividends ever since the Eisenhower administration and the Cuban missile crisis, but it took decades of investment to bring that set of technologies to fruition. And now, large-scale analytics (if done right) does provide a valid new window on the world.

Big Data, as you point out, also implies new data management means. PHOTINT introduced the data handling problem that's become a crisis today for those who are collecting massive amounts of video, not to mention all the other less structured data. The data handling problem first evident with PHOTINT has just become more critical with all the additional new data sources and scalable digital collection means.

The techniques behind Big Data are one way to tackle that data handling problem. Yes, those are most associated with knowledge discovery using distributed computing and scalable nonrelational data stores and analytics methods, but they're also about scalable data handling on the cheap using commodity clusters. Hadoop is basically a cheap, scalable data storage medium. (FYI--We did a full analysis of the technology behind those methods back in 2010 here if you want a layperson-accessible primer and some thoughts on the business implications: . I don't think the authors listed here ponder how "cloud" data management differs from what's gone before.)

Analytics itself, interestingly enough, also helps with scaling data management and integration. If you look at recent "smart machine" team claims (such as, these are about about machine-assisted pattern mining that hadn't been visible before. You can persist, categorize and reuse those discovered patterns. Machine-assisted ontology building is another way to think about it.

That reusability will provide a new lens into what had been largely inscrutable because it hadn't been properly linked. How many times have you discovered people, places and things out on the web, in social media, via a search engine, or through a personalized set of news feeds? Social provides a layer of human-assisted integration. Entity and relationship extraction at scale with the help of machine learning will create another level of visibility by helping us improve the integration of large datasets (by accelerating the development of the semantic web, something else we wrote about awhile back at

Boyd and Crawford ponder the data oligarchs and the data wealth gap that currently exists. That problem will worsen. Most people don't even realize yet that they need to stake a claim to their own personal data at a minimum. What many companies (and some governments) are doing amounts to claim jumping. During the 2010s, we're in for another gold rush of data miners staking claims, and only a fraction of the mining companies will have the knowledge, the robotic and financial wherewithal, and the motivation to profit.

We're in for another Gilded Age.