Theory in the Age of Big Data

Ankit Patel & Korok Ray

Current Issue

Theory — or models of the world resting on a few principles expressed primarily, but not exclusively, through mathematics — has never been less fashionable in the sciences. Theory provides researchers and industry leaders alike with insights that can be applied beyond specific datasets, as well as the explanatory power needed to understand why certain patterns emerge. Albert Einstein's theory of general relativity, to take just one example, continues to provide scientists with a deep, universal framework for understanding gravitational phenomena, even as they collect more precise astronomical data.

Following World War II, scientists established much of the foundational theory that undergirded the great advances in physics, economics, biology, chemistry, and computation of the 20th century. But as the digital revolution vastly reduced the cost of computing and data analysis, researchers and industry leaders alike began to emphasize empirical, predictive models that could deliver short-term insights over more generalizable theoretical models that can help us make sense of our complex world.

In the medical field, for instance, researchers are increasingly using machine-learning algorithms to predict patient outcomes based on large datasets. But these predictions often lack the explanatory power that comes from understanding the biological mechanisms underlying these outcomes. Similarly, financial markets now rely heavily on data-driven models that can identify short-term patterns. But these models sometimes fail to account for systemic risks that require deeper theoretical frameworks to recognize.

Data-driven approaches provide researchers and industry leaders with valuable tools, to be sure, but they also tend to steer our attention toward outcomes without fully addressing the principles governing those outcomes. As we increasingly focus on solving immediate problems using data, certain foundational, visionary pursuits have become less prominent. We once dreamed of putting a man on the moon and building flying cars; nowadays, we argue over politics on social media and in the metaverse.

And yet, all is not lost. A few recent examples have shown the enduring power of theory. Bitcoin, for one, has used fundamental insights in cryptography, distributed computing, and economics to create a new industry. Likewise, the modern semiconductor industry blends theories of electrical engineering with the practice of computer-hardware architecture to great effect. We should take inspiration from these examples.

To drive the next wave of innovation, academic researchers and leaders in the technology sector must learn to reason about the world using simple, transparent principles. As a society, we must once again embrace theory in a way that accounts for its relationship to Big Data. Moving forward, the challenge will be to rethink how theory-driven and data-driven approaches can work together, combining data's predictive accuracy with theory's capacity for explanation and long-term insight to tackle the complexity of our world.

HOW DID WE GET HERE?

The Allied victory in World War II reinvigorated the American spirit, making the 1950s the zenith of national ambition for technology and society. A raft of new theories across multiple disciplines emerged alongside the boldness to set lofty goals like landing a man on the moon. Much of modern economics rests on game theory, the achievement of mathematicians who, during the war, sought to formally model strategic interactions between rational agents. Similarly, computer science was born from the fields of philosophy, logic, and mathematics.

The digital revolution eroded theory's standing in the sciences. Suddenly, a wide array of social- and natural-science data were available to researchers on any laptop in the world. The growth of the internet further expanded the availability of data, while huge advances in micro-processing power made massive analyses of data cheap and easy. As computing power became increasingly accessible, science's objective shifted from theorizing to measurement. The academic community switched from theory to data analysis en masse, moving from trend to trend in 10-to-15-year cycles. The first cycle focused on summary statistics and variance analysis, the second on linear regression. The third prizes machine learning. When problems arose in the domain of each discipline, scholars rarely returned to their underlying theories for revision. Instead, they simply fed more data into the computer, hoping that measurement error and omitted variables were to blame.

As the academic profession migrated away from theory, students gravitated toward data work. The increasing ubiquity of computers throughout American society exposed students to computation earlier in life than ever before. By the time they arrived in college and in graduate school, they had already attained basic facility with data manipulation and analysis. Why bother with mathematics when some simple experiments and linear regressions can provide tables of results for swift publication?

On the publication side, it became easy for academic journals to accept papers establishing some small experimental or empirical fact about the world. Given that editors and other referees make decisions on academic research on a paper-by-paper basis, no one makes an overarching evaluation of whether the body of empirical and experimental work as a whole truly advances human knowledge. Data analysis has therefore run amok. Teams of researchers make ever more incremental advances, mine the same core datasets, and ask smaller and more meaningless questions. Does rain or sunshine affect the mood of traders when picking stocks? Does the size of a CEO's signature on an annual statement prove that he is narcissistic and therefore more likely to overinvest? While these questions may seem absurd, the sad truth is that they are based on actual papers published in top social-science research journals.

This preoccupation with increasingly narrow questions of causality comes at a high price: It generally requires the researcher to restrict his domain to behaviors that are easily observable and measurable. Since the large, complex mathematical theories developed after World War II were largely untestable, empirical researchers abandoned them. Where theorists once asked the biggest questions of the day, increasingly narrow empirical research now dominates academic scholarship. Experimental physicists and empirical economists alike mostly cite other data-driven work. The growth of Big Data in concert with machine learning has led us to artificial intelligence, the ultimate black box. No researcher can fully explain what exactly AI is doing under the hood.

One might suppose advances in computation would have led researchers to seek to verify the deep theories developed after the war, but this hasn't happened. In technical terms, many of those complex models are endogenous, with multiple variables determined in equilibrium simultaneously. This makes it challenging for empirical researchers to identify what is happening — to discover, for instance, whether increasing the minimum wage will increase unemployment, as Economics 101 suggests — and thus explains their turn to causality. However, causal inference requires precise conditions, and often times those conditions don't hold over the economy as a whole, but rather in a few specific examples. Ditching theory does not entail freedom from methodological problems.

THEORY'S FAILURES

The growth of a narrow empiricism isn't the only problem: The second trend harming theory's status has been the shrinking of the theory community, both within and outside of the academy. The sheer number of theorists in the sciences has sunk, with the remnant often refusing to collaborate with their empirical and experimental colleagues. This tribalism has led theorists to write ever more intricate, self-referential mathematical models with tenuous bases in reality and little hope for empirical validation.

Much theory has become experimentally untestable and therefore unfalsifiable. Much of game theory, for instance, remains untestable. String theory is perhaps the most extreme example of a self-referential world that can never be fully verified or tested. Particle physics, multiverse theory, and the theory of the firm in economics can never be confirmed or rejected. Rigor becomes rigor mortis as analysis gets bogged down in highly abstract mathematical definitions, claims, and proofs. Unilluminating symbol manipulation buries intuition. The reader feels like he is being scammed by theorists who hide their work behind layers of abstraction designed to give the impression of complexity and sophistication. Why bother engaging with complex theory, this reader might ask, when journal editors could instead accept the high volume of empirical work knocking at their door?

As the complexity of theory grows, its readership falls, even among theorists. And why not? Research presentations ask the audience to take the work at face value, since it's virtually impossible for outside readers to verify all the computations and analyses. Tribalism leads the theory community to act as a club that bars anyone who doesn't adopt its arcane language and methods from joining.

Making matters worse, theory currently trails technology by a long shot. Too often, mathematicians, physicists, and economists provide ex post facto rationalizations of technologies that have already found success in industry. These theories don't predict anything new, but rather affirm conventional wisdom. What is the value of this?

Thus, two negative trends exacerbate each other: The theory tribe shrinks year by year, losing relevance to reality, while the empirical-data community grows, asking ever smaller questions with little to no conceptual guidance. This leaves both academics and technologists in the dark about what problems to solve and how to approach them. It also fosters a certain randomness in scientific fields, a sense of being blown by the winds of the moment.

This sense has leaked into wider society. For instance, while economists have established sound theories of markets and how they function, technology companies largely unmoor themselves from those theories. Similarly, computer science rests on a sturdy foundation of algorithms and data structures, yet trillion-dollar technology companies perform simple A/B tests to make their most significant decisions as the theory community obsessively debates computational complexity.

THE PROBLEM WITH IMPOSSIBILITY THEOREMS

Another problem plagues theory: that of misinterpretation. This is often the fault of theorists themselves, who craft and sell theorems in a way that can obscure important details from consumers.

Impossibility theorems — or their close cousins, hardness theorems — furnish a prime example. Misinterpretations of impossibility theorems have done great societal harm, mainly by discouraging research in directions that eventually proved to be feasible and indeed fruitful.

Operations research and computer science, to take two examples, have been enormously successful despite contravening impossibility and hardness theorems claiming that solving certain kinds of problems are, in the worst case, very hard. More engineering-minded researchers and enterprising companies using state-of-the-art algorithms have demonstrated unequivocally that hardness theorems need not squelch research and innovation. Today, companies like the NFL, FedEx, and Amazon use such algorithms to optimize the logistics of our world. In the field of machine learning, "no-free-lunch" theorems suggest that no single algorithm can perform well on all problems, but this is no reason to give up: We just need to build diverse kinds of inductive biases into models. The machine-learning community has spent two decades doing just that, and has marshaled a large, open-source reference library of models.

Another example appears in quantum physics and computation. Quantum entanglement (when a group of particles interact in a way that blurs their independence of each other, even over large distances) is non-local, meaning objects are not necessarily influenced solely by their immediate surroundings. This non-locality does not imply instantaneous communication but is in strong tension with Einstein's principle of locality in a subtle yet "legal" manner. Modern day quantum-computing efforts try to leverage this non-locality and other uniquely quantum properties to solve problems that are classically hard but, in quantum terms, "easy."

In mathematics, Kurt Gödel showed that there always exist propositions that are true or false but that cannot be proven true or false. The power and influence of this result on mathematicians, scientists, and society at large cannot be understated. It turns out that this lack of decidability hinges on some axioms that give us infinities, and some obscure technicalities about circular definitions, best illustrated in the paradox of the barber who cuts the hair of everyone who does not cut it himself. He does not cut his own hair because he only cuts the hair of those who don't cut their hair themselves. But if he doesn't cut his own hair, he becomes one of those whom the barber must cut the hair of, meaning he must cut his own hair.

To dispose of such problems, we can simply adopt another axiomatic system, one that includes an axiom of constructability (which basically ensures that all sets in the mathematical universe are well defined in terms of already existing "simpler" sets), thus excluding the barber and other such self-referential definitions from our mathematical universe. In some such finitist or constructivist systems, this makes all propositions decidable.

Similar technicalities underlie the continuum hypothesis, the Banach-Tarski paradox, and other undecidability theorems regarding the foundations of mathematics. For general scientists, intellectuals, and even laypeople, it's important to know these little "details" when interpreting results about the supposed limitations of mathematics, especially given that almost all "everyday" mathematical propositions of interest do not involve self-referential definitions nor unfathomably large infinities.

In social-choice theory, the Gibbard-Satterthwaite (GS) theorem provides an example where innocent-sounding constraints on a "good" measure of social welfare make it impossible to aggregate social preferences. The GS theorem shows that if a social-choice function is truthfully implementable, it must be dictatorial. In other words, the social-choice function must grant one person the right to act as a dictator if it hopes to be strategy proof, or immune to strategic manipulation. This landmark result, discovered simultaneously by two economists, clarified the conditions for a strategy-proof social-choice function. The problem is that this theorem forces social-choice functions to adhere to quite a high standard. In practice, today's technology sector routinely uses many social-choice functions that technically aren't strategy proof, and, under the GS theorem, are not dictatorial.

Think, for instance, of Doodle, the popular scheduling app that lets multiple users schedule a meeting using a simplified approval-voting mechanism. Each user selects the meeting times he can make, and the software picks the time that has no conflicts. A user could simply indicate he is only available for his most preferred time, and not for any other slot. By not representing his true preferences, he manipulates the outcome in his favor. The GS theorem tells us that Doodle is not strategy proof because it's not dictatorial; that is, any single user can manipulate the outcome by misrepresenting his preferences. But so what? Doodle is a simple mechanism that's widely used in practice despite its theoretical shortcomings. The academic economics community shied away from an analysis of second-best social-choice mechanisms like Doodle partly because of the strong result of the GS theorem.

There are many other impossibility theorems that have discouraged valuable research directions for years, sometimes decades: the halting problem in computational complexity theory; the no-free-lunch theorem in machine learning, search, and optimization; Marvin Minsky and Seymour Papert's takedown of the perceptron with the XOR function; John von Neumann's hidden-variables theorem in quantum mechanics; and the independence of the continuum hypothesis. What was the common thread? In all these cases, the devil was in the details of the theorem, leading to costly misinterpretations. Years later, when more practical engineering types took over, they ignored these theorems and pursued these research directions anyway. The result of defying the impossible was the creation of productive fields and useful technologies that benefit society.

In the future, scholars should move beyond general impossibility theorems and instead consider how a theory might apply to the real world. This will require academics to develop more realistic constraints and pursue their research within those constraints. It will also require greater transparency — in contrast to opaque impossibility theorems that incur huge opportunity costs.

IS THERE HOPE FOR THEORY?

Theory can succeed under certain conditions, its recent failures notwithstanding. It can illuminate perplexing and mysterious behavior and observations. It can explain a wide variety of empirical facts by means of simple principles, axioms, and abstractions, as has happened with classical physics, statistical mechanics, probabilistic inference, and price theory. It can fuse together a medley of otherwise disparate facts. When done well, theory can be elegant, beautiful, and unifying.

Theory's virtues aren't just aesthetic; they enable the design of powerful new technologies and therapies such as spectroscopy, atomic clocks, the atomic bomb, telescopes, microscopes, lasers, gene sequencing and editing, drug design, auctions, and cryptocurrencies. At its best, theory aims to teach not just future theorists, but scientists generally. Many graduate students who don't pursue careers in research can nevertheless learn key theoretical concepts that will aid them in their professional lives.

When theory plays this beneficial role, it typically requires a forced separation from the academic norms of the time. At the University of Chicago during the second half of the 20th century, the economics department created and promoted price theory. The theory itself, and the empirical tests that accompanied it, were notable for their simplicity. Famous economists like Jacob Viner, Milton Friedman, Gary Becker, George Stigler, and Kevin Murphy broke with tradition in mainstream academic research and followed their scientific principles toward truth. They believed a parsimonious model, like the elemental supply-and-demand graph in economics, could explain the vast majority of economic life, and they then tested their model against data. This combination of theory and data was unique and out of sync with the rest of the profession. While those Chicago economists eventually received the highest awards in their field, they bore years of ridicule for breaking with orthodoxy.

Bitcoin emerged in similar fashion. At the turn of the century, a group of computer scientists, cryptographers, and mathematicians laid the infrastructure for the first digital currency. This innovation did not emerge out of formal channels: There was no venture-capital financing, no research or government contracts, no traditional measures of academic success. These "cypherpunks" sought a technological solution to the most significant economic object ever (money) based on principles of individual liberty, privacy, and sovereignty. While economists have articulated the benefits of sound money for years, these scientists actually built a stable currency with limited supply. They were not just theoretical cryptographers; they were applied cryptographers who used their understanding of the core theory of public-key cryptography to pursue a solution that could be implemented in the real world.

A SHIFT IN MINDSET

There is no reason why Bitcoin couldn't have emerged out of the academy. It was, after all, circulated through a white paper, similar to research papers that scholars and scientists write. That white paper included references to academic papers and rested on several well-established bodies of knowledge. The problem was not the idea of Bitcoin, but the academy.

Today's academy is organized around discrete, well-defined silos of knowledge that traditionally haven't interacted with one another. This structure simply isn't conducive to creating the next Bitcoin. Future innovation will most likely occur on the boundaries between disciplines, not within silos. At the same time, professors are required to teach students, conduct research, and publish papers, regardless of whether the papers are truly applicable. Their research languishes in niche topics within their own disciplines. Instead, they should be incentivized to work with others across various disciplines.

Contemporary universities still largely pursue knowledge of the natural, physical, and social aspects of our world for its own sake. This is important work, but they shouldn't stop there: Scientists should approach their considerable knowledge about reality with an engineering mindset that seeks to design and implement solutions to our most pressing problems.

In the long term, this shift would help close the gap between the academy and industry. Academic curricula are currently out of step with the needs of the market, a fact that puts pressure on students to search for jobs and start companies at the expense of their course work. Closing this gap through more applied scholarship would allow students to spend their time in college building solutions to real problems.

This transformation has already begun in some disciplines, like economics. One of the most successful applied areas of economics is market design, which has fully adopted an engineering mindset and delivered three Nobel Prizes since 2000. Market-design scholars came from engineering fields and adapted game theory to build better markets: They contrived more effective ways to match kidney donors to recipients, students to schools, and medical residents to hospitals. They also designed many of the most widely used auction methods, such as Google's ad auction and the communications-spectrum auction employed by the government.

There is nothing stopping the rest of the economics profession, or even the rest of the academic community, from adopting an engineering mindset rather than endlessly debating whether, say, climate change exists. Why not proactively propose solutions for our climate problems instead?

Aligning the academy with industry could also assuage public distress about escalating tuition and student debt. Once faculty orient their research toward developing solutions, so too will their students, and, in the long run, the companies that employ them.

If research creates technologies that ultimately benefit students, their future employers, and society at large, students will be less likely to resent their professors for spending time on research rather than teaching. Such a recalibration could naturally close the skills gap that America now faces. Universities will no longer need to focus explicitly on STEM skills, but rather on providing technological solutions that happen to draw heavily from the STEM fields. Professors would work closely with companies and other scholars in different disciplines to solve stated problems.

A CALL TO ACTION

Changing culture takes hard work; changing academic culture is no different. There's no silver bullet here, but we can provide some practical recommendations.

The first step is to acknowledge the problem. We need widespread admission that the academy suffers from severe anti-theory bias and irrelevance. This is evident when we see enormous amounts of empirical and experimental work yielding little impact alongside a small amount of self-referential and irrelevant theoretical work. We need to reunite theory and application. This begins with calling out problems with the current state of academia.

Next, we need more contact between the academy and the marketplace. Universities remain reticent to encourage faculty engagement with industry. "Consulting" is still a pejorative term, and non-engineering faculty — especially those in the arts and sciences — relish their intellectual purity. Yet more contact with industry will expose faculty to real problems they won't discover on their own, prompting more creative pursuit of novel solutions. Of course, faculty are intellectuals first and foremost, and should distinguish themselves from their industrial counterparts by offering theoretical frameworks that the fast-paced, operational world of the market doesn't produce. But academics can use their strengths in ways that benefit both firms and consumers.

Third, faculty members need to acknowledge the importance of second-best solutions. The theory community often obsesses about the best solutions, such as optimal auctions and theoretical security guarantees. These results are intellectually interesting, but they would never be adopted in practice, making them irrelevant. Second-best solutions, while theoretically suboptimal, are far more applicable to real-world scenarios and can fuel major upgrades in many industries.

Fourth, journals and funding agencies should require (and provide funding for) scientists to develop computationally verifiable proofs using proof assistants that learn alongside traditional informal natural-language proofs. This would enable "frictionless reproducibility" (a term first coined by Stanford's David Donoho) and distributed social collaboration in mathematics — something that Fields medalist Terence Tao has recently explored with encouraging results.

Fifth, in order to improve the transparency and accessibility of theoretical work, researchers must provide accessible and intuitive introductions to their otherwise highly esoteric and abstract work. On YouTube, creators like 3Blue1Brown have developed beautiful visual explanations of mathematical concepts, sparking a revolution in accessible mathematical content. But the skills to create such videos do not emerge on their own. Funding agencies must facilitate the training of mathematical creators whose primary job is to amplify the accessibility and translatability of theory.

We in the scientific community must develop technological solutions that are more robust than theoretical paradigms yet more systematic than today's ad hoc practical innovations. We must also encourage our academic colleagues to aim higher. It's easier to retreat to esoteric journals, pages of equations, and massive data tables than it is to deliver real value for society. Tenure affords unique opportunity for risk-taking, but too many have abused this protection to facilitate only intellectual sloth. Academics need to realize that running on the treadmill of incremental research papers and government grants will not advance their fields.

Ancient Greek and Roman philosophers mixed theory and practice thousands of years ago, when the universe of knowledge was much smaller than it is today. Maybe it's time for us to go back to that future.

Ankit Patel is an assistant professor in the department of neuroscience at the Baylor College of Medicine, and in the department of electrical and computer engineering at Rice
University.

Korok Ray is an associate professor at Texas A&M University's Mays Business School.


Insight

from the

Archives

A weekly newsletter with free essays from past issues of National Affairs and The Public Interest that shed light on the week's pressing issues.

advertisement

Sign-in to your National Affairs subscriber account.


Already a subscriber? Activate your account.


subscribe

Unlimited access to intelligent essays on the nation’s affairs.

SUBSCRIBE
Subscribe to National Affairs.