A Longtime Tool of the Community

October 1, 2013
By Lewis Shepherd

Point-Counterpoint: Is Big Data the Way Ahead for Intelligence?

What do modern intelligence agencies run on? They are internal combustion engines burning pipelines of data, and the more fuel they burn the better their mileage. Analysts and decision makers are the drivers of these vast engines; but to keep them from hoofing it, we need big data.
The intelligence community necessarily has been a pioneer in big data since inception, as both were conceived during the decade after World War II. The intelligence community and big data science always have been intertwined because of their shared goal: producing and refining information describing the world around us, for important and utilitarian purposes.

Let’s stipulate that today’s big-data mantra is overhyped. Too many technology vendors are busily rebranding storage or analytics as “big data systems” under the gun from their marketing departments. That caricature rightly is derided by both information technology cognoscenti and non-techie analysts.

I personally understand the disdain for machines, as I had the archetypal humanities background and was once a leather-elbow-patched tweed-jacketed Kremlinologist, reading newspapers and human intelligence (HUMINT) for my data. I stared into space a lot, pondering the Chernenko-Gorbachev transition. Yet as Silicon Valley’s information revolution transformed modern business, media, and social behavior across the globe, I learned to keep up—and so has the intelligence community.

Twitter may be new, but the intelligence community is no Johnny-come-lately in big data. U.S. government funding of computing research in the 1940s and 1950s stretched from World War II’s radar/countermeasures battles to the elemental electronic intelligence (ELINT) and signals intelligence (SIGINT) research at Stanford and MIT, leading to the U-2 and OXCART (ELINT/image intelligence platforms) and the Sunnyvale roots of the National Reconnaissance Office.

In all this effort to analyze massive observational traces and electronic signatures, big data was the goal and the bounty.

War planning and peacetime collection were built on collection of ever more massive amounts of data from technical platforms. These told the United States what the Soviets could and could not do—and therefore where we should and should not fly, or aim, or collect. And all along, the development of analog and then digital computers to answer those questions, from Vannevar Bush through George Bush, was fortified by massive government investment in big data technology for military and intelligence applications.

In today’s parlance, big data typically encompasses just three linked computerized tasks: storing collected data, such as with Amazon’s cloud; finding and retrieving relevant data, as with Bing or Google; and analyzing connections or patterns among the relevant data using powerful web-analytic tools.

The benefit of intelligence community’s early adoption of big data was not just to cryptology, although decrypting enemy secrets would have been impossible without it. More broadly, computational big data horsepower was in use constantly during the Cold War and after, producing intelligence that guided U.S. defense policy and treaty negotiations or verification. Individual analysts formulated requirements for tasked big-data collection with the same intent as when they tasked HUMINT collection: to fill gaps in our knowledge of hidden or emerging patterns of adversary activities.

That’s the sense-making pattern that leads from data to information, to intelligence and knowledge. Humans are good at it, one by one. Murray Feshbach, a little-known U.S. Census Bureau demographic researcher, made astonishing contributions to the intelligence community’s understanding of the crumbling Soviet economy and its sociopolitical implications by studying reams of infant-mortality statistics and noticing patterns of missing data. Humans can provide that insight brilliantly, but at the speed of hand-eye coordination.

Machines make a passable rote attempt, but at blistering speed, and they do not balk at repetitive mind-numbing data volume. Amid the data, patterns emerge. Today’s Feshbachs want an Excel spreadsheet or Hadoop table at hand so they are not limited to the data they can carry reasonably in their mind’s eye.

To cite a recent joint research paper from Microsoft Research and MIT, “Big data is notable not because of its size but because of its relationality to other data. Due to efforts to mine and aggregate data, big data is fundamentally networked. Its value comes from the patterns that can be derived by making connections between pieces of data, about an individual, about individuals in relation to others, about groups of people, or simply about the structure of information itself.” That reads like a subset of core requirements for intelligence community analysis, whether social or military, tactical or strategic.

The synergy of human and machine for knowledge work is much like modern agricultural advances—why would a farmer today want to trudge behind an ox-pulled plow? There is no zero-sum choice to be made between technology and analysts, and the relationship between chief information officers and managers of analysts needs to be nurtured, not cleaved.

What is the return for big-data spending? Outside the intelligence community, I challenge humanities researchers to go a day without a search engine. The intelligence community record’s just as clear. Intelligence, surveillance, reconnaissance, targeting and warning are better because of big data; data-enabled machine translation of foreign sources opens the world; correlation of anomalies amid large-scale financial data pinpoint otherwise unseen hands behind global events. In retrospect, the Iraq weapons of mass destruction conclusion was a result of remarkably-small-data manipulation.
Humans will never lose their edge in analyses requiring creativity, smart hunches and understanding of unique individuals or groups. If that is all we need to understand the 21st century, then put down your smartphone. But as long as humans learn by observation, and by counting or categorizing those observations, I say crank the machines for all their robotic worth.
Lewis Shepherd is the director and general
 manager of the Microsoft Institute. For another perspective on this question, see "Another Overhyped Fad" by Mark M. Lowenthal.

Enjoyed this article? SUBSCRIBE NOW to keep the content flowing.


Share Your Thoughts:

I like many of the points you and the authors of the research paper make, Lewis. Just augmenting the argument with some additional thoughts and observations.

Mark Lowenthal focuses on knowledge creation and what little most can mine from Big Data using the equivalent of picks and shovels most commonly used today. But high-volume mining and ore processing via machine learning is becoming possible. A handful of companies are figuring out how to do it.

Are we underspending on more immediately useful kinds of intelligence collection and analysis? That's not a new argument; the same argument has existed since the first Keyhole satellite. Of course we're underspending on HUMINT and on the manpower, time and effort it takes to create the Big Picture. No argument there. And we're "overspending" (at least from a short-term POV) on what's new and shiny. Over the long term, the "overspent" investment holds promise.

The history of PHOTINT provides a model for what's emerging in Big Data. PHOTINT has paid massive dividends ever since the Eisenhower administration and the Cuban missile crisis, but it took decades of investment to bring that set of technologies to fruition. And now, large-scale analytics (if done right) does provide a valid new window on the world.

Big Data, as you point out, also implies new data management means. PHOTINT introduced the data handling problem that's become a crisis today for those who are collecting massive amounts of video, not to mention all the other less structured data. The data handling problem first evident with PHOTINT has just become more critical with all the additional new data sources and scalable digital collection means.

The techniques behind Big Data are one way to tackle that data handling problem. Yes, those are most associated with knowledge discovery using distributed computing and scalable nonrelational data stores and analytics methods, but they're also about scalable data handling on the cheap using commodity clusters. Hadoop is basically a cheap, scalable data storage medium. (FYI--We did a full analysis of the technology behind those methods back in 2010 here if you want a layperson-accessible primer and some thoughts on the business implications: http://www.pwc.com/us/en/technology-forecast/2010/issue3/index.jhtml . I don't think the authors listed here ponder how "cloud" data management differs from what's gone before.)

Analytics itself, interestingly enough, also helps with scaling data management and integration. If you look at recent "smart machine" team claims (such as http://www.irishtimes.com/business/sectors/technology/finding-the-hidden...), these are about about machine-assisted pattern mining that hadn't been visible before. You can persist, categorize and reuse those discovered patterns. Machine-assisted ontology building is another way to think about it.

That reusability will provide a new lens into what had been largely inscrutable because it hadn't been properly linked. How many times have you discovered people, places and things out on the web, in social media, via a search engine, or through a personalized set of news feeds? Social provides a layer of human-assisted integration. Entity and relationship extraction at scale with the help of machine learning will create another level of visibility by helping us improve the integration of large datasets (by accelerating the development of the semantic web, something else we wrote about awhile back at http://www.pwc.com/us/en/technology-forecast/spring2009/index.jhtml).

Boyd and Crawford ponder the data oligarchs and the data wealth gap that currently exists. That problem will worsen. Most people don't even realize yet that they need to stake a claim to their own personal data at a minimum. What many companies (and some governments) are doing amounts to claim jumping. During the 2010s, we're in for another gold rush of data miners staking claims, and only a fraction of the mining companies will have the knowledge, the robotic and financial wherewithal, and the motivation to profit.

We're in for another Gilded Age.

Alan, excellent points. As I read down your comment, after each paragraph I placed a mental "checkmark" of agreement. I particularly appreciate the pointers to the other PWC work, will check it out, particularly the semantic-web piece, a particular interest of mind as you know and an area of research with great potential to span the divide between the technologist and the wetware analyst seeking sense out of volume :)

Thanks for this Lewis. After reading through Mr. Lowenthal's article, your writing clearly makes the stronger case. The three linked tasks you lay out that comprise big data allows the reader to better define what we're talking about here, rather than relying on Mr. Lowenthal's ill-informed assumptions that lead him to label big data as an "overhyped fad." I'm thinking his view is reflective of many others in the "old school" who fear the rise of using technology to understand problems will cause people to disengage their brains.

These old schoolers believe in gaining in-depth knowledge and expertise on problems through first-hand experience and in-depth study. This certainly brings value to understanding a problem and educating policy makers. And sometimes those proficient in the "new school" may become too wowed by big data that they may neglect the importance these other perspectives bring to understanding an issue. But the old schoolers risk alienating the value of their perspective by writing off big data methods, as Mr. Lowenthal does.

Given that, we cannot sacrifice old school intelligence for the new school, and vice versa, though Mr. Lowenthal's article appears to aim for this. We need to continue to strive for balance, and your article effectively highlights this point.

Jesse, I appreciate your comments. You're an astute reader, and I think you have each of us pegged :)

First thanks to Bob @ analystone.com for heads up & links on this worthy endeavor. I'm right in the middle of this topic, and have been for most of career, so thought I would share a few brief thoughts:

Agree that big data is over-hyped. My perspective is surprisingly close actually to Gartner's hype cycle on this one--as well as the forecasts & views relative to combining BD & HUMINT, which like authors we've been working on for a long time--almost 2 decades for us, 3 if counting everything that should be. Our perspective is from a truly independent effort, self-funded to growth phase--well aligned with org mission & customers, not to be confused necessarily with guilds, agencies, units or even mgmt teams--important distinction these days especially given LT fiscal trajectories around the world, and the related influence of service sector on same, which increasingly seem to have unhealthy influence on sustainable tech & economics.

Regarding definition of BD--another broken record usually found in close proximity to over-hyped tech or methods/services--avoidance of definition & therefore accountability. As with the case of others, the science does have great potential, is quite real, but doesn't exist in some black hole--little inconvenient realities are ever-present like economics, rule of law, governance, conflicts, human/org, & physics.

However, in the case of big data a clean science based definition is doable provided that it's limited to scale. Google provides a case even though their work also includes other algo's. We had a good presentation from a Google engineer/VC on this specific point at the SFI symposium a couple of weeks ago where he shared some actual examples that had been tested out--advantage they have most others don't. For some specific use cases with highly specific algo's, relative to quite imperfect info typical of the web, higher scale can/does equate to higher accuracy.

If we were to limit the definition to 'big' data, with primarily one V rather than the many--perhaps Volume + Velocity for most purposes, the science is pretty clear--there are highly specific functions within data physics that are limited to scale which cannot be achieved otherwise, esp conjoined with velocity within time window needed. With some HPC queries still reported to take up to a year--although likely in part for other reasons like data quality/skill, this is not a trivial issue.

However, and this is a very important point, even with vastly improved algo's, scale alone is limited for most use cases/missions, and at least data I have consumed, big alone for most critical purposes isn't nearly sufficient. During roughly the past 15 years of R&D for example, we've seen BD algo efficiency improve radically from 30ish % efficiency to 70ish%, while highly structured, high quality data returns are almost perfect--99%+, which for some purposes is still not good enough.

The evidence is clear -- while BD should be leveraged, the priority for most critical operations & decisions should be on smart data at the confluence of humans & machines. By smart data we mean continuously adaptive, & tailored to the specific needs of each entity within org parameters (governance, regulatory, policy, mission, time).

Thanks for the discussion.

Mark Montgomery
Founder & CEO

Share Your Thoughts: