A person walks into a brewpub with no idea of what they want to drink. What’s the quickest possible way for a bartender to figure out what beer they’ll like?
The University of Wisconsin-Madison computer engineer Robert Nowak and one of his former graduate students, Kevin Jamieson, began pondering that question four years ago as a thought experiment. To figure out the answer, they turned to what they knew best: Digital tools and processes belonging to the realm of data science fashionably referred to as “big data.”
What they created was “beer space,” a veritable universe of beers that spans across eight dimensions. That may sound like an alcoholic science-fiction fever dream, but download the iPhone app BeerMapper that Nowak and Jamieson created, and that dream comes to life.
The app displays a simplified version of beer space in (a mere) two dimensions that users can interact with. Thousands of datapoints, representing various beer varieties, pepper the screen. They create what looks like a star system of beers, mapped out based on qualities like hoppiness, color and texture. Many of the beers within similar families — stouts, IPAs or doppelbocks — end up clustered together based on their shared traits.
Essentially, it’s a map — just one that explores flavor rather than geographic coordinates.
“It’s a little bit kind of like a Google Earth for beer,” said Nowak.
To create the stupefying array, the two researchers mined a database of reviews left on RateBeer, a website where beer enthusiasts write about brews they’ve recently sampled. By using sophisticated analytical tools, they came up with groupings of words that most readily described different kinds of beers, and after some refining, used those word-clusters to map out the relationships between varieties of beer with a set of multi-dimensional coordinates.
The two then created an algorithm that would enable a computer to navigate through beer space to predict the kinds of beer a user would like. BeerMapper prompts the drinker to sample two different kinds of beers off of a list, and choose which one she likes more. Based on a few of those tests, the software can map out which beers the person will likely take to.
Think of “big data,” and a small, quirky project like BeerMapper isn’t what likely comes to mind. From the U.S. government’s use of phone records to surveil potential domestic terrorists, to the growing ways that companies like Google use troves of internet user data to change their interface and deploy targeted advertising, there are other examples that loom large when it comes to the weird, nebulous, and buzzword-y phenomenon.
But BeerMapper is still a pertinent example of big data at work in the Madison context. Even as it has taken off globally, the big data phenomenon also begun to take hold in and around the city, from institutions like state and city government to the health care industry.
Madison, of course, is not Silicon Valley — it is a moderately sized city in the Midwest with a tech sector that’s relatively young and still very much growing.
Still, it is a place where people are doing interesting and innovative things in the realm of big data, in ways that are even more consequential than figuring out what beer is best to drink.
What is big data?
The idea of “big data” has loomed large within computer science and the technology sector for decades. More recently, it’s begun trending publicly. Yet it’s difficult to say what big data exactly is in the first place. Even those who work with it or study it seem to have different ideas about what it means.
“It has a very fuzzy definition,” said Jignesh Patel, a professor at UW-Madison who specializes in the subfields of computer science associated with big data.
At big data’s core is the idea that in the global scheme of things, humans are now collecting data in very, very large quantities. IBM estimates that collectively, humans are amassing new data of various kinds — from mouse clicks on websites to entries in police records — at a rate of 2.5 quintillion bytes per day. That rapid clip means that 90 percent of all data that exists today has been collected solely in the last two years.
On top of that, those vast quantities of data have become easier to store and process. Technology like disk drives and processors have become cheaper and better. Distributed methods of storage, in which information is broken up into chunks across multiple devices, has made storing big files easier. On-site storage has become less necessary, given the advent of “the cloud.” As a result, it’s now increasingly common for companies and institutions to passively collect and file away data.
But, according to Patel, there’s more to big data than there being, well, big quantities of data.
“There’s this notion that it’s really all about these really large amounts of data — but it’s not all about that,” he said. “It’s the ability to derive big actionable value from data you can collect.”
In other words, said Patel, it’s the tools and trends that have developed alongside the increasing clip of data collection that truly define big data — things like the innovative algorithms and machine learning behind Nowak and Jamieson’s BeerMapper app, for example. On top of that, there are technical tricks and resources that have made working with big datasets more manageable, like distributed processing — the use of multiple computer processors at once to take on resource-intensive tasks — and the spread of relatively accessible big data software.
According to some of the people who work in the realm of big data in Madison, it’s the use of those big resources to answer questions — not just the amount of data itself — that’s at the crux of what big data is all about.
“It doesn’t matter if you collect a miniscule amount of data, or a treasure trove of data,” said Greg Tracy, a Madison-area engineer and entrepreneur. “I think it’s critically important that you’re focused on an outcome that you’re after.”
A breath of relief
Tracy said that for him, that desired outcome is simple: Make the lives of people who live with asthma or chronic obstructive pulmonary disease a little bit better, one data point at a time.
Tracy is the co-founder and chief technology officer of Propeller Health, a burgeoning health care technology company located in a small brick office building on West Main Street. There, a team of about 50 employees provides clients with data-driven reports about their respiratory problems, which they do by analyzing data collected from sensors attached to patients’ inhalers.
When patients use their inhaler, the Propeller Health-engineered sensors collect data about when the attack took place. That gets transmitted to a phone or Bluetooth device, which collects location data, and then sends the information back to Propeller’s databases. The company’s engineers then merge that data with weather and air quality data, and after analyzing the resulting cocktail of information, they’re able to report trends back to the patient.
“We’re doing this data collection to try to unearth insights for them, and to help them better understand what might be causing them to flare up when they’re not doing well,” said Tracy.
The company also uses its system to help patients keep abreast of potential attacks.
“We might know that 90 percent of your events are occurring when air quality is bad,” said Tracy. “Once we know that about you, you wake up the next day, and we know the air quality is bad where you are, we can notify you and let you know, ‘Be careful — air quality’s bad. Don’t forget your inhaler.’”
Propeller sends alerts to health care providers, providing doctors with specific information about when and where attacks are happening — information that’s otherwise tricky for practitioners to pin down, given that asthma attacks are episodic by nature.
All things considered, Propeller Health is a striking example of a big data company. Its sensors make it part of the trend known as “the internet Of Things” — the proliferation of “smart objects” ranging from Fitbits to coffeemakers that are able to collect data and transmit it via the internet. Many experts see the spread of iOT — and the subsequent spread of new physical data collection points — as a prominent factor in the increased clip of data collection.
On top of that, Propeller is representative of what could be described as one of the big data niches that Madison occupies: the intersection of big data and medicine. Madison has become something of an enclave of biotechnology and health IT companies.
Epic Systems, the Verona-based giant health care IT giant, by definition serves as a massive big data platform, hosting millions of electronic health records. On top of that, company representatives say it’s also working on a number of big data initiatives. For example, it’s merging patient data with things like socioeconomic information and insurance claims data to generate a more holistic understanding of health care in action, and to better assess and predict health risks for patients.
There are also smaller health IT companies like Redox, which is working in an area that many trade experts say is the biggest issue in the realm of big data and medicine: interoperability. Essentially, the company is creating an interface for accessing electronic health record data from different record-keeping systems — data that would otherwise be incompatible.
Besides health tech companies, examples of companies both big and small utilizing big data abound. The startup Export Abroad analyzes data to help clients that want to sell products or services overseas identify markets that could be lucrative. American Family Insurance, one of the biggest employers in the Dane County area, created a research team to work on big data initiatives.
Then there are companies that have established their identity through work with big data. Take MIOsoft, which has quietly become an industry leader within the field of data quality.
Data sets that companies and institutions want to work with are often messy, disorganized, or seemingly incompatible. MIOsoft has made a business out of cleaning and organizing those data sets into something that’s friendly and useable, to make it possible for clients to parse signal from noise.
Maurice Cheeks is a city alderman who also serves as MIOSoft’s vice president of business development. He said that in a world that hinges on data, there’s a growing demand for MIOSoft’s services.
“In today’s age, people are recognizing that data are an asset,” said Cheeks. “In the same way that people, property, inventory are all assets to a company, the data that they hold about their customers, about their suppliers, about their energy usage — that’s an asset to their company.”
Then there’s the work in big data that exists outside of the private sector.
The Center for Predictive Computational Phenotyping, housed within the UW-Madison’s Medical Sciences Center, was created in 2015 as part of a National Institutes of Health initiative to improve medicine through the use of big data. Currently, a group of about 40 biostatisticians and researchers are working within the center to take large data sets and information on raw, observable traits — in other words, phenotypes — and using them to better understand or better predict health outcomes.
Much of the center’s activity is happening in health clinics across south-central Wisconsin. One project involves merging data on things like demographics, health imaging and genetic information to figure out a new standard for breast cancer screening.
“As you look at more and more data sources, you get a better idea of who’s more likely to get breast cancer,” said Mark Craven, the director of the center.
The center has also been working with the Marshfield Clinic to use electronic health records to predict disease. That’s where Robert Nowak — the same Robert Nowak who co-created BeerMapper — comes in.
Nowak is seen as something of an innovator in the realm of machine learning, the growing computer science subfield that’s recognized as one of the core elements of big data analysis. It’s defined as the teaching of computers to learn on their own, enabling them to independently improve algorithms as they parse complex sets of data.
Nowak has developed a platform called NEXT that refines the science of active learning — in other words, when humans give machines input to help them improve their performance.
“Much like people in a classroom, computers learn best when they’re given active feedback,” said Nowak.
When it comes to the Marshfield Clinic project, Nowak has used his active learning systems to improve the models that are being used to diagnose disease. Teams of doctors will sit down, pore over an individual health record, and then based on the data declare whether or not that person has, or will likely develop, a certain condition — say, for example, cataracts.
Nowak’s computerized algorithm then makes predictions of its own based on what it learns from the human doctors’ diagnoses. As it goes, it recommends new records that are tricky to diagnose for doctors to look at so that it can improve its algorithm.
The results so far from the Marshfield Clinic EHR project: a model that can make diagnostic predictions for about 3,500 diseases with a failure rate of about 10 percent.
Soon, said Craven, the plan is to deploy tools like the EHR prediction system in actual clinics around the area.
“This is certainly part of our vision — developing new methodologies, evaluating them with real data, and then trying to translate things into actual practice,” he said.
Already, a “decision support tool” that utilizes the center’s advances in breast cancer screening has been deployed for a trial run at a number of area clinics.
It should perhaps come as no surprise that the NIH selected UW-Madison to host such a center. The university has played a considerable role in the big data revolution.
In the 1970s, the work of early computer scientists like David Dewitt helped advance understanding of database systems and what’s known as “parallel processing,” and molded a foundation for big data science that’s possible today. The computer science department at the university is seen as a major reason why Microsoft and Google have established satellite campuses in the city.
There are now all kinds of research projects across disciplines at the UW that increasingly rely on big data. One agronomist has been advancing what we know about corn phenotypes — genetically expressed traits of corn, like kernel size or hardiness — using massive data sets on corn growth collected using time-lapse imaging. Then there’s the IceCube, an enormous observatory at the South Pole that collects data in its hunt for cosmic neutrinos.
From prescriptions to policing
Outside of the university, big data has been used or is being used in the public sector in other ways. State government institutions are taking on problems using big data. For example, the state Department of Safety and Professional Services recently hired MIOSoft to help overhaul the state’s prescription drug monitoring network.
The current network launched in 2013 to help doctors, pharmacists and law enforcement monitor opiate prescriptions to prevent drug abuse in the state. Currently, it spits out raw information upon request. However, the DSPS said that its planned big data-ified network will feature in-depth analysis, data visualization, alerts about overdoses from law enforcement agencies and predictive models showing who may be at risk of abusing the drugs they’re being prescribed.
In city government, the Madison Police Department has embraced big data science as well. It uses data to identify trends across the city, like when and where certain kinds of crime tends to take place, in an effort to more effectively utilize police resources.
“We’re looking at a certain area where cars are being broken into, where there are burglaries, and we can say, ‘This is a place where we can centralize our resources,’” said Police Captain Jim Wheeler.
But while examples of innovative and interesting applications of big data in Madison abound, there are some issues that people working in the field have to grapple with.
Craven said that certainly in the context of medicine, security and privacy are a huge concern. Even though his center is subject to strict HIPAA regulations that requires a high degree of security, there are still concerns about the implications of taking personal information — even with information that’s been anonymized — and pooling it together.
While “there are no easy answers,” Craven said, there’s a value to giving up personal medical information.
“There are tradeoffs between privacy and utility. We can perhaps discover more and do better medicine if we didn’t have to worry about privacy. But of course, privacy issues are valid and important,” he said.
The application of data in law enforcement is also at times a controversial one. One analyst with the department, Tom Scholten, said there are things to be mindful of when it comes to policing and data science.
“There’s that fine balance with, ‘We're going to send police resources there’ — but then maybe we’re targeting a community,” he said.
There are other big data challenges that are Madison-specific. In some ways, the city has failed to capitalize on the “big data revolution”: Its open data and civic hacking initiative, which in some cases relied on methods and tools employed by big data analysts, has floundered, despite a string of initiatives to build a stronger relationship between hackers and municipal government.
That said, there’s also optimism about where big data is heading in Madison.
“I think Madison is very well poised to become a leader in big data,” said Nowak. “You hear about the Bay Area and Boston and New York a lot. But we’re not tackling the same kinds of big data problems that Google and Facebook are, either. I think we have our niches — our areas of expertise.”
Certainly, in the beer-mapping department, the city has nothing to worry about.