Dickens, Austen and Twain, Through a Digital Lens
ANY list of the leading novelists of the 19th century, writing in English, would almost surely include Charles Dickens, Thomas Hardy, Herman Melville, Nathaniel Hawthorne and Mark Twain.
But they do not appear at the top of a list of the most influential writers of their time. Instead, a recent study has found, Jane Austen, author of âPride and Prejudice, â and Sir Walter Scott, the creator of âIvanhoe,â had the greatest effect on other authors, in terms of writing style and themes.
These two were âthe literary equivalent of Homo erectus, or, if you prefer, Adam and Eve,â Matthew L. Jockers wrote in research published last year. He based his conclusion on an analysis of 3,592 works published from 1780 to 1900. It was a lot of digging, and a computer did it.
The study, which involved statistical parsing and aggregation of thousands of novels, made other striking observations. For example, Austenâs works cluster tightly together in style and theme, while those of George Eliot (a k a Mary Ann Evans) range more broadly, and more closely resemble the patterns of male writers. Using similar criteria, Harriet Beecher Stowe was 20 years ahead of her time, said Mr. Jockers, whose research will soon be published in a book, âMacroanalysis: Digital Methods and Literary Historyâ (University of Illinois Press).
These findings are hardly the last word. At this stage, this kind of digital analysis is mostly an intriguing sign that Big Data technology is steadily pushing beyond the Internet industry and scientific research into seemingly foreign fields like the social sciences and the humanities. The new tools of discovery provide a fresh look at culture, much as the microscope gave us a closer look at the subtleties of life and the telescope opened the way to faraway galaxies.
âTraditionally, literary history was done by studying a relative handful of texts,â says Mr. Jockers, an assistant professor of English and a researcher at the Center for Digital Research in the Humanities at the University of Nebraska. âWhat this technology does is let you see the big picture â" the context in which a writer worked â" on a scale weâve never seen before.â
Mr. Jockers, 48, personifies the digital advance in the humanities. He received a Ph.D. in English literature from Southern Illinois University, but was also fascinated by computing and became a self-taught programmer. Before he moved to the University of Nebraska last year, he spent more than a decade at Stanford, where he was a founder of the Stanford Literary Lab, which is dedicated to the digital exploration of books.
Today, Mr. Jockers describes the tools of his trade in terms familiar to an Internet software engineer â" algorithms that use machine learning and network analysis techniques. His mathematical models are tailored to identify word patterns and thematic elements in written text. The number and strength of links among novels determine influence, much the way Google ranks Web sites.
It is this ability to collect, measure and analyze data for meaningful insights that is the promise of Big Data technology. In the humanities and social sciences, the flood of new data comes from many sources including books scanned into digital form, Web sites, blog posts and social network communications.
Data-centric specialties are growing fast, giving rise to a new vocabulary. In political science, this quantitative analysis is called political methodology. In history, there is cliometrics, which applies econometrics to history. In literature, stylometry is the study of an authorâs writing style, and these days it leans heavily on computing and statistical analysis. Culturomics is the umbrella term used to describe rigorous quantitative inquiries in the social sciences and humanities.
âSome call it computer science and some call it statistics, but the essence is that these algorithmic methods are increasingly part of every discipline now,â says Gary King, director of the Institute for Quantitative Social Science at Harvard.
Cultural data analysts often adapt biological analogies to describe their work. Mr. Jockers, for example, called his research presentation âComputing and Visualizing the 19th-Century Literary Genome.â
Such biological metaphors seem apt, because much of the research is a quantitative examination of words. Just as genes are the fundamental building blocks of biology, words are the raw material of ideas.
âWhat is critical and distinctive to human evolution is ideas, and how they evolve,â says Jean-Baptiste Michel, a postdoctoral fellow at Harvard.
Mr. Michel and another researcher, Erez Lieberman Aiden, led a project to mine the virtual book depository known as Google Books and to track the use of words over time, compare related words and even graph them.
Google cooperated and built the software for making graphs open to the public. The initial version of Googleâs cultural exploration site began at the end of 2010, based on more than five million books, dating from 1500. By now, Google has scanned 20 million books, and the site is used 50 times a minute. For example, type in âwomenâ in comparison to âmen,â and you see that for centuries the number of references to men dwarfed those for women. The crossover came in 1985, with women ahead ever since.
In work published in Science magazine in 2011, Mr. Michel and the research team tapped the Google Books data to find how quickly the past fades from books. For instance, references to â1880,â which peaked in that year, fell to half by 1912, a lag of 32 years. By contrast, â1973â declined to half its peak by 1983, only 10 years later. âWe are forgetting our past faster with each passing year,â the authors wrote.
JON KLEINBERG, a computer scientist at Cornell, and a group of researchers approached collective memory from a very different perspective.
A version of this article appeared in print on January 27, 2013, on page BU3 of the New York edition with the headline: Dickens, Austen and Twain, Viewed in a Digital Lens.