Flowers from the Flood

Big Data illustration

Can Big Data Save the World by Drowning It?

by Matt Getty

Sight for the blind. Limitless clean energy. A crime-free world. The cure for cancer, diabetes and any other disease you can name. Sound pretty good? What about skies teeming with unmanned drones, implanted RFIDs tracking your every move, 24-7 advertising beamed directly to your optic nerve?

Behold the twin visions of the future promised and threatened by the latest high-tech catchphrase, "big data," a sprawling, information-powered trend touching everything from your most recent Facebook status update to NASA's Jet-Propulsion Laboratory.

"Some people think we're watching the planet grow a nervous system," says Rick Smolan '72, whose recent book, The Human Face of Big Data, assembled more than 100 writers and photographers to detail how the ability to capture and analyze vast amounts of information is changing the world. "All of our smartphones, Google searches and credit-card purchases are generating a constant stream of digital exhaust. Now there are ways to measure this exhaust and analyze it in real time, and that's creating some pretty powerful tools."

Many of those tools are already at work. Pick up the newspaper—or rather click on a link generated by an algorithm analyzing your last 1,000 Google searches—and you can read about how Target mines data to advertise to pregnant customers before some of them even know they're expecting. Tap a personalized in-app ad on your iPad, and you can watch a Netflix series crafted from data on what millions of viewers watch and when they fast-forward, pause, rewind or stop watching altogether.

As Smolan quickly discovered, however, the big-data mine runs much deeper than advertising and entertainment. "At the beginning of this project, I thought the whole thing was just about finding a better way to sell people J.Crew sweaters," he says. "But the story is vastly more interesting than that. A lot of what's happening has the power to truly change how we live. And in most cases, there's a lot of upside."

A taste of that upside includes:

  • using spectral analysis of satellite images to find mosquito larvae in Uganda and more efficiently fight malaria;
  • eyeglasses powered by parallel-processing computers reproducing the human retina's complex code to enable the blind to see;
  • a service that helps police triangulate the exact location of a crime in seconds from the sound of a single gunshot;
  • and a project harnessing mountains of genetic information to create human-designed organisms that could replace fossil fuels.

 

Channeling the flood

So the obvious question: How is this all possible? Though there's a slew of complex math and computer science involved, the short answer is information—lots and lots and lots of information.

In 2010, former Google CEO Eric Schmidt told attendees at the Techonomy conference in Lake Tahoe that humanity had generated five exabytes of data from the dawn of time to 2003 (each exabyte equals one quintillian bytes), whereas now we create that much information every two days. Some have quibbled with Schmidt's numbers, but there's no denying that we're living in the time of the flood.

"We have the ability to collect more data than ever," says Assistant Professor of Computer Science John MacCormick, whose recent book, Nine Algorithms That Changed the Future, explains how big data's complex tools work. "But we can also analyze the data now. You need both for this to be a meaningful trend."

Without that ability to analyze, you get something like Jorge Luis Borges' "The Library of Babel," a fictional ever-growing collection of books holding all of the information in the universe and ultimately rendered useless because there's simply too much to explore. Everything becomes nothing. All the information all the time becomes information overload.

Enter big data, the means to turn that overload into a working load, the means to channel the flood, water the digital jungle and make it flower into something useful.

"I remember when I was in graduate school in the 1970s working as a research assistant in Harvard's Joint Center for Urban Studies, and I was running a regression on 100,000 observations in a study of urban loan applications," says Associate Professor of International Business & Management Stephen Erfle, who introduces students to data-driven decision making in his Managerial Economics course. "It took Harvard's mainframe computer system half of the weekend to run. Now, you could do that in minutes."

Even the building blocks of life—codes so immense and complicated they once looked impenetrable—are becoming increasingly easy to catalog. "When I started at Dickinson, that was the year after the Drosophila Melanogaster [fruit fly] genome sequence was published," says Kirsten Guss, John R. & Inge Paul Stafford Chair in Bioinformatics. "Now the genomes of 12 different drosophila species have been published, as well as honey bees and flower beetles. That this information is so accessible—literally a click away—that's amazing to me."

Pattern power

The secret is pattern recognition, a relatively simple task most children learn to perform by age 5 but computers have taken to new heights thanks to advances in processing power. Driving much of big data, pattern-recognition algorithms fuel everything from IBM's Jeopardy!-winning robot, Watson, and your phone's voice-to-text function to DNA sequencing and three-dimensional brain imaging.

"The simplest way to understand pattern recognition is to look at the example of the nearest-neighbor algorithm," says MacCormick. "Take handwriting recognition. You might think that to develop the program you'd sit down and think very carefully about what an 'a' looks like, what a 'b' looks like, and then you'd program details of that into a computer. In fact, that's not how it's done at all."

Instead, he explains, you feed the program hundreds of thousands of samples of handwriting that has already been labeled. (If you've ever had to identify those squiggly CAPTCHA letters to recover your e-mail password, you've contributed to a similar set of labeled samples for computers converting ancient books to text.) Then the algorithm compares each new handwritten character to the labeled handwritten letters and tags the new character with the letter it most resembles—its nearest neighbor.

Size matters

The programs can get much more complex, determining the most popular of several nearest neighbors or factoring in the whole word's, phrase's or sentence's nearest neighbor, but as graceful as the algorithm is, the critical component is the sheer size of the sample. "This is only possible with vast amounts of data," says MacCormick. "There's cleverness in how the algorithm is designed, but it's not explicitly coded to recognize each individual letter. It learns that from this huge volume of data."

The impact reaches far beyond handwriting. That constant stream of bits we all generate—Smolan's "digital exhaust"—is also reshaping age-old notions like the importance of place.

New York's Justice Mapping Center, which is profiled in The Human Face of Big Data, uses geographic information systems (GIS) to visualize crime data in a way that could change how governments evaluate the cost of crime.

Rather than focusing on the cost of incarceration in the prisons themselves or where crimes are committed, the system uses a wealth of information to show the cost per each city block where an incarcerated criminal lives, highlighting the places where educational investments can have the greatest impact.

"GIS has been around for 50 years, so I have to laugh every time somebody calls it a hot new trend," says Dickinson's GIS specialist, James Ciarrocca, who helps students from a variety of courses visualize geographic data. "But the explosion of location-enabled devices has greatly increased our ability to answer questions with GIS."

Companies considering a new store location have long used surveys to take a snapshot of location-based demographics. Now, thanks to nearly universal GPS devices, information on the potential customer base, shopping habits, traffic patterns and more is widely available. Similarly, governments have used GIS for decades to track the spread of contagious diseases. Big data just provides new and surprisingly more effective tracking tools. As Ciarrocca points out, "Google has documented how they can use the locations of people searching for symptoms to track the spread of the flu faster than the CDC."

Bigger reward, bigger risk

Relying so heavily on this flood of information, however, is not without its pitfalls. As Erfle puts it, "You can answer bigger questions, but that means you can make bigger mistakes." That's why he teaches students not only to use data but to question it. As they learned regression analysis this spring, students like Xiang Yao '15—who helped menudrive.com develop an algorithm to make personalized recommendations based on hundreds of thousands of online food orders—also learned the importance of ensuring that results pass the sniff test.

"Before you get to big data, you have to understand data," Erfle explains. "You need to be able to look at the final result, kick the tires and say, 'Does this make sense, or are there some other variables we should consider?' "

Similarly, as amazed as Guss is at the power big data lends to biological research, she still sees value in small data. Aspects of bioinformatics such as missing heritability (traits for which numerous genes play small roles) and mirtrons (which silence gene expression) still demand investigation on an organism-by-organism basis.

"For me, big data is like Walmart," she explains. "Almost everything is there, and you can go into one place and get it all, but if you want to get something specialized, you still have to go to the small mom-and-pop store."

Guss is confident that computers will help unlock more biological mysteries in the coming years, but no matter how big the data get, she doubts we'll ever have all the information. "I remember when the human genome was sequenced, and a friend said, 'So are you done now?' " she recalls. "But the answer, of course, is no. Then, for instance, we were surprised to find that only 1.5 percent of the genome codes for proteins. … Now we had a new question—what is the function of the other 98.5 percent of the genome? Every answer opens another question."

Quantify everything

Yet even outside of the realm of science, big data is answering questions once thought to lie beyond the reach of numbers. Amazon and Netflix harness algorithms to weave together nebulous threads of personal literary and cinematic taste. Companies like Next Big Sound use social-media data to identify bands on the rise long before they hit the charts, essentially calculating buzz—that indefinite mystical property long sought by every record label's A&R department.

"Things previously thought of as unquantifiable are now quantifiable," says Assistant Professor of Sociology Erik Love, whose upcoming research project on Islamaphobia will explore the 133,000 gigabytes of Twitter updates the Library of Congress is currently cataloging. "Now you can measure qualitative things like taste, culture, class, race. Social scientists have done this for years, but now you can do it on a much larger scale."

With computer science and cultural analysis colliding, Love thinks big data promises big opportunities for students.  "There are a lot of new jobs out there for those who can sift through all these data," he says. "And it's not just about computer programming. It's about asking the right questions. That's exactly what people who have a social-science background at a place like Dickinson are equipped to do. Our students get that interdisciplinary background and learn how to ask big questions. That's becoming more and more valuable."

Who owns our data?

Just as it promises big opportunities, leveraging all of this information also raises some big questions. Concepts like constant surveillance, real-time mapping and online click-tracking can worry the least paranoid among us, as the recent news about the National Security Agency's phone and Internet data-mining program clearly shows. Big data's striking resemblance to Big Brother prompts some to wonder if attitudes toward freely sharing so much information might change.

"I think as this continues to develop, you're going to see more people question how much locational data they're willing to give away," says Ciarrocca. "Right now, I don't think people think a lot about things like how their car's GPS tracks where they're going all the time. As we see more and more of this, we'll need  to have conversations about how much privacy we're willing to give up in exchange for convenience."

Then there's the question of who deserves to benefit from all these data. With private companies like Facebook, Amazon and Google owning the biggest information treasure troves, how much are we willing to rely on a simple "Don't be evil" edict to steer big data toward common good? Researchers like Love are thankful they'll soon be able to take advantage of Twitter's generous tweet donation to the Library of Congress, but it's troubling to think of what would happen to this kind of research otherwise.

"I actually think that if Twitter could go back now [they made the donation in 2010], they probably wouldn't make that donation," says Love. "It's just an incredibly valuable dataset, but it's theirs. They own it. They get to decide whether they want to share it. Traditionally, sociologists relied on public information like census reports, but now if we want these richer data sources, we're at the mercy of private companies. Figuring out to what extent Facebook or Amazon has an obligation to share these data or protect them—I think you'll see a growing ethical and legal debate about this in the coming years."

But if we are the ones generating all these data, Smolan wonders, shouldn't we benefit from it? In one extreme case from The Human Face of Big Data, he highlights Hugo Campos' petition to win access to the information his pacemaker relays to doctors.

"They're telling him he can't have it; it's proprietary information, but he's like, 'Wait a minute—that's my heart!' " Smolan explains. "I hope the book can help start this conversation about who owns our data. Even if there are no privacy concerns because it's all aggregated, we still need to ask, 'If someone's going to profit from our data, shouldn't we get a piece of that?' "

The future looks big

Regardless of the answers, we'll need to start asking these questions soon, because big data shows no signs of getting any smaller. With wearable computers like Google Glass already on the market and futurists envisioning a workable computer-brain interface within the next few decades, Schmitt's five exabytes every two days could start to look tiny pretty soon.

 "We're exactly where the Internet was in 1993," Smolan says. "This will all look primitive and crude in just a few years. You can stick your head in the sand, but there's no going back. Any tool can be used for good or evil, so what we really need to do is figure out what we want big data to do for us. How do we use it for good? That's the conversation we need to have—and we need to be having it now."
 

Published July 24, 2013