1
THE BESTSELLER-OMETER, OR, HOW TEXT MINING MIGHT CHANGE PUBLISHING
Back in the spring of 2010, Stieg Larsson’s agent was having a good day. On June 13, The Girl Who Kicked the Hornets’ Nest—third in the series from a previously unknown author—debuted at number one in hardback in the New York Times. You can imagine the lists would have been a pleasing sight over morning coffee. Hornets’ Nest straight in at the top, Dragon Tattoo at number one in two paperback formats, and The Girl Who Played with Fire a roundly satisfying number two. This had been going on for forty-nine weeks in the U.S., and for three solid years in Europe. It would have been hard not to be smug.
The following month Amazon would announce Larsson was the first author ever to sell a million copies on the Kindle, and over the next two years sales in all editions would top seventy-five million. Not bad for an unknown political activist–turned-novelist from a little Scandinavian country, especially one who had chosen a rather uncharming title in Swedish and had written some brutal scenes of rape and torture. Men Who Hate Women—or The Girl with the Dragon Tattoo as it was renamed in English—was the sensation book of the year in more than thirty countries.
The press didn’t understand the success. Major newspapers commissioned opinion pieces on what on earth was going on in the book world. Why this book? Why the frenzy? What was the secret? Who could have known?
Answers were lackluster. Reviewers scratched their heads about it. They found fault with the novel’s structure, style, plotting, and character. They groaned over the translations. They complained about the stupidity of the reading public. But still copies sold as fast as they were printed—whether you were in the UK, the U.S., in Japan, or in Germany; whether you were male, female, old, young, black, white, straight, or gay. Whoever you were, practically anywhere, you knew people who were reading those books.
That doesn’t happen very often in the book world. The industry might enjoy a phenomenon breakout like Larsson once a year, if that. E. L. James has been the biggest breakout since, with Fifty Shades of Grey, and unlike Larsson she was available for a big publicity tour. Larsson had died before publication. The level of sales his trilogy achieved without even the backing of its author was supposedly just unfathomable. Freakish. Unpredictable.
Let’s consider some numbers. A company in Delaware called Bowker is the global leader in bibliographic information and the exclusive provider for unique identification numbers (ISBN) for books in the U.S. Their annual report states that approximately fifty to fifty-five thousand new works of fiction are published every year. Given the increasing number of self-published ebooks that carry no ISBN, this is a conservative number. In the U.S., about two hundred to two hundred twenty novels make the New York Times bestseller lists every year. Even with conservative numbers, that’s less than half a percent of works of fiction published. Of that half a percent, even fewer hit the bestseller lists and stay there week after week to become what the industry calls a “double-digit” book. Only handfuls of authors manage those ten or more weeks on the list, and of those maybe just three or four will sell a million copies of a single title in the U.S. in one year. Why those books?
Traditionally, it is believed that there are certain skills a novelist needs to master in order to win readers: a sense of plot, compelling characters, more than basic competence with grammar. Writers with big fan bases have mastered more: an eye for the human condition, the twists and turns of plausibility, that rare but appropriate use of the semicolon. These are good writers, and with time and dedication almost all genuinely good writers will find their audience. But when it comes to the kind of success involved in hundreds of thousands of people reading the same book at the same time—this thriller and not that thriller, this potential Pulitzer and not that potential Pulitzer—well, unless Oprah is involved, that signals the presence of a fine stardust that’s apparently just too difficult to detect. The sudden and seemingly blessed success of books like the Dragon Tattoo Trilogy, Fifty Shades of Grey, The Help, Gone Girl, and The Da Vinci Code is considered very lucky, but as random as winning the lottery.
The word “bestseller,” by the way, has always been a book world term, and as a word it is relatively young. It first entered the dictionary in the late nineteenth century, about the time of the first list of books ranked by consumer sales. While it should be a neutral term, it has developed some connotations that are likely misleading. The literary magazine The Bookman started to print “Sales of Books during the Month” in 1891 in London and in 1895 in New York after the International Copyright Act of 1891 slowed down the distribution of cheap pirated copies of British novels. Until then, no sales statistics had really been possible. From the beginning, the lists—which were printed in each major city and typically reported the top six sellers of the month—were about two things that were new to the book world. The bestseller lists were about sales as the only criterion for inclusion, and a proxy recommendation system for what to read next. These recommendations were based not on the choices of a select few reviewers or publishers, but on the choices of everyday fellow readers. The reader’s choice was and still is the only vote. The term “bestseller,” then, should carry no intrinsic comment on quality or type of book, and is not a synonym for either “genre” or “popular fiction.” While the word has often been used pejoratively by some members of the literary establishment, who have felt that the collective taste of the reading market signals bad literature, the data itself suggests a less subjective and more balanced truth. Bestsellers include Pulitzer Prize winners and Great American Novels as well as books by famous mass-market writers. The list can house Toni Morrison and Margaret Atwood alongside Michael Connelly and Debbie Macomber. This is why the bestseller list is such a rich cultural construct and so dynamic to study.
Obviously there’s a lot of value in writing one of those books. There’s a lot of value in finding those books as an agent or editor. There’s a lot of value for retailers, too—the top few titles alone are why some retailers are able to stay in business and keep selling books at all.
Of course, we are talking for now of value in monetary terms. Imagine a seven- or even eight-figure advance for finally getting onto the page that book you are always telling your friends is inside you. Not many authors command that kind of clout in one territory, but they are certainly around. And you can glamorize the impoverished artist with his pen and notebook as much as you like, but wouldn’t it be nice to think of the story you just made up as appearing on bedside tables, beside bathtubs, and on commuter iPads and Kindles in different languages all over the world?
The key sellers of a given year bring the glamor and the drama. They represent the houses in the Hamptons, the fancy cars and diamond tiaras of the literary domain. Hit the lists and stay there for a while and you will be revered, respected, loathed, and condemned. You might be asked to judge a prize or review other books. Maybe your movie rights will be optioned. People will be talking.
Wouldn’t it be fun if success weren’t so random?
White Swans
The bold claim of this book is that the novels that hit the New York Times bestseller lists are not random, and the market is not in fact as unknowable as others suggest. Regardless of genre, bestsellers share an uncanny number of latent features that give us new insights into what we read and why. What’s more, algorithms allow us to discover new and even as yet unpublished books with similar hallmarks of bestselling DNA.
There is a commonly repeated “truth” in publishing that success is all about an established name, marketing dollars, or expensive publicity campaigns. Sure, these thing have an impact, but our research challenges the idea it’s all about hype in a way that should appeal to those writers who toil over their craft. Five years of study suggests that bestselling is largely dependant upon having just the right words in just the right order, and the most interesting story about the NYT list is about nothing more or less than the author’s manuscript, black ink on white paper, unadorned.
Using a computer model that can read, recognize, and sift through thousands of features in thousands of books, we discovered that there are fascinating patterns inherent to the books that are most likely to succeed in the market, and they have their own story to tell about readers and reading. In this book we will describe how and why we built such a model and how it discovered that eighty to ninety percent of the time the bestsellers in our research corpus were easy to spot. Eighty percent of New York Times bestsellers of the past thirty years were identified by our machines as likely to chart. What’s more, every book was treated as if it were a fresh, unseen manuscript and then marked not just with a binary classification of “likely to chart” or “likely not to,” but also with a score indicating its likelihood of being a bestseller. These scores are fascinating in their own right, but as we show how they are made we will also share our explanation for why that book on your bedside table is so hard to put down.
Consider some of these percentages. The computer model’s certainty about the success of Dan Brown’s latest novel, Inferno, was 95.7 percent. For Michael Connelly’s The Lincoln Lawyer it was 99.2 percent. Both were number one in hardback on the NYT list, which for a long time has been one of the most prestigious positions to occupy in the book world. These are veteran authors, of course, already established. But the model is unaware of an author’s name and reputation and can just as confidently score an unknown writer. The score for The Friday Night Knitting Club, the first novel by Kate Jacobs, was 98.9 percent. The Luckiest Girl Alive, a very different debut novel by Jessica Knoll, had a bestselling success score of 99.9 percent based purely on the text of the manuscript. Both Jacobs and Knoll stayed on the list for many weeks. The Martian (before Matt Damon’s interest in playing the protagonist) got 93.4 percent. There are examples from all genres: The First Phone Call from Heaven, a spiritual tale by Mitch Albom, 99.2 percent; The Art of Fielding, a literary debut by Chad Harbach, 93.3 percent; and Bared to You, an erotic romance by Sylvia Day, 91.2 percent.
These figures, which provide a measure of bestselling potential, have made some people excited, others angry, and more than a few suspicious. In some ways that is fair enough: the scores are disruptive, mind-bending. To some industry veterans, they are absurd. But they also could just change publishing, and they will most certainly change the way that you think about what’s inside the next bestseller you read.
We should make it clear that none of the books we reference were acquired based on our model’s figures, and figures, beyond the ones you’ll read about here, have never been formally shared with any agent or publishing house. We should also be clear that these figures are specific to the closed world of our research corpus, a corpus we designed to look like what you’d see if you walked into a Barnes & Noble with a wide selection to choose from. Agents and editors do a good job of putting books in front of consumers—it’s not as though we are short of things to read. And some individuals in publishing have a particular reputation for the Midas touch. But remember that the bestseller rate in the industry as it stands is less than one-half of one percent. That’s a lot of gambling before a big win. Note, too, that year after year, the lists comprise the names of the same long-standing mega-authors. Stephen King is sixty-eight. James Patterson is sixty-eight. Danielle Steel is sixty-eight. As much as fans are still thrilled by another new novel from one of these veteran writers, it is telling that the publishing world has not discovered the next generation of authors who will similarly enjoy thirty to forty years of constant bestselling. Nor did the industry find, despite the thousands of manuscripts both rejected and published annually, a runaway bestseller for 2014 (Dragon Tattoo, Fifty Shades, and Gone Girl had been the standout hits of previous years), and neither did it publish a manuscript to impress the Pulitzer Prize committee in 2012. Why?
Well, it is a universal wisdom that bestsellers are freaks. They are the happy outliers. The anomalies of the market. Black swans. If that is the truth, then once you find a bestselling writer, why put your money anywhere else? Why put your millions on a new twenty-year-old writer instead of Stephen King? How could you possibly know if a new literary author is worth the sort of investment worthy of a future big-prize winner?
Book publishing is, aptly, full of the language of gambling. Acquisitions meetings often revolve around passionate arguments about choosing whether or not to “back a debut author.” The excitement of a bidding war across different publishing houses might have you go “all in” and spend almost your entire season’s budget on one book. The process is fun, and guesses are certainly educated, but it’s a casino. Before finding a home at Bloomsbury, J. K. Rowling’s Harry Potter was turned down by twelve publishers, and Rowling was told “not to quit her day job.” The Harry Potter brand is now worth an estimated $15 billion. John Grisham was rejected by at least sixteen different publishers. Since then, Grisham has written the biggest seller of the year more than a dozen times.1 James Patterson was repeatedly rejected as he tried to get published. In 2010, he sold more than 3.5 million copies of his three titles that year. Kathryn Stockett was turned down by sixty agents before she found someone willing to represent The Help. That novel went on to spend one hundred weeks on the NYT bestseller list. There are, no doubt, many similar writers whose work currently sits discarded on the so-called slush piles of new manuscripts in offices all over New York and London.
Anyone connected even tangentially to the world of readers and writers knows a friend of a friend who got up for months at 4 A.M. to write her novel before work, who felt inspired by a killer story, who knew the muses were around, and who, having sent manuscripts all over Manhattan, gleeful and expectant, received nothing more than standard rejection slips.
Those friends of friends might be in good company. One editor who read the manuscript of The Spy Who Came in from the Cold told John le Carré that he had no future as a writer. William Golding’s Lord of the Flies was rejected twenty-one times. After writing the now iconic On the Road, Jack Kerouac received a letter from an agent stating, “I don’t dig this one at all.” Ursula Le Guin was rejected on grounds of being “unreadable.” That unreadable novel went on to win two major awards. Even George Orwell’s novella Animal Farm was deemed unpublishable, and that by none other than T. S. Eliot. The great poet thought one of the most canonized political allegories of all time was “not convincing.”
To publish or not to publish is a tough question. Big success prediction in the realms of storytelling can involve trying to estimate the sensibilities and inner selves of hundreds of thousands of different people. It is no easy job, and often the rationale behind decisions seems perfectly understandable. The U.S. editors who rejected The Girl with the Dragon Tattoo, for instance (and we have asked some of them), thought that American readers would be bored by all the Swedish politics in the novel. They thought Lisbeth Salander was a bit moody and aggressive for a female lead. They believed the mainstream would respond badly to a book with horrific scenes of anal rape and the avenging Lisbeth with her tattooing needles. That seems a quite reasonable reaction.
It’s no surprise, then, that editors, when perfectly honest, sometimes claim that big success prediction ranges somewhere between a wet finger held up into the air, and the mysterious crystal ball that the highest paid agents and publishers seem to conceal under their desks. Unless the author is already a big name, a James Patterson or a Nora Roberts, it’s a crapshoot. Sometimes, circumstances help—now and again your author is a Hollywood diva and her subject is her sex life—but even when it seems like a sure bet, we have seen some of the vast print runs that follow big advances end up in the pulping machine. The public is fickle.
Naturally, every book agent and publisher does what he or she can to understand commercial books, whether that’s on the mass-market scale of a veteran franchise author like Patricia Cornwell, or the less hyperbolic but nonetheless satisfying numbers involved with the most popular literary writers. There is a famous anecdote about a now ex-CEO of one of the major New York publishing houses who, when asked to predict a title for a definite megahit, replied “Lincoln’s Doctor’s Dog.” The combination of a beloved president, our obsession and paranoia concerning health, and America’s favorite pet could never fail.
It was a wry comment, of course, but it turns out that not one but two books were subsequently published with exactly that title. Both were flops. The literary professor and author John Sutherland, who has written two studies of bestselling books, concluded one by saying, “As a rule of thumb what defines the bestseller is bestselling. Nothing else.” He added more definitively that “to look for significant patterns, trends, or symmetries [in hit books] is, if not pointless, baffling.” And his judgment seemed prudent, fair, and final. That is until machines started reading and discovering the secret sauce of hitting the NYT list.
For the Love of Books
Let’s go back to those oft-rejected but now well-known writers. Our model’s prediction on J. K. Rowling was 95 percent. On John Grisham it was 94 percent. On Patterson it was 99.9 percent. History has been a satisfying precision check. The model was, however, wrong about Kathryn Stockett’s novel The Help. The Help was one of the roughly 15 percent of books that confounded our machine. The machine only gave Stockett’s novel a fifty-fifty likelihood of being a bestseller. Upcoming chapters will get into the intrigue and complexities of the machine imitating editor. Let it suffice here to say that the model looks deeply, and it told us in the case of Stockett’s book that style on the whole was good for an American readership, that the themes were generally good, but that the use of emotional language and verbs specifically was not consistent with novels that most reliably hit the lists. This is the book that, when it was published, drew much reviewing attention because its white author had written so much of the prose in the imitated dialect of black characters. Opinions were divided about the efficacy of that narratorial choice: the model agreed entirely with the opinions of critics from the New York Times to Goodreads.
So why develop a computer model, you might ask, to do the work good editors are already doing? Perhaps Rowling would have been published sooner with the model’s help. Perhaps Grisham would have won a much higher first advance for A Time to Kill. But ultimately, these authors found their fame. Editors were unsure about The Help; so was the model. What’s the gain?
Well, our desire to work on discovering the elements of success is about more than mercenary advantage. Yes, it is surely intriguing that a computer model picked J. K. Rowling, or Liane Moriarty (99.6 percent), or Jonathan Franzen (98.5 percent). Public conversation about human and machine crossover does, we think, matter, especially as far as creativity is concerned. But working on finding viable new manuscripts in a threatened industry is also, if we may, about keeping that industry not just running but diverse. Our work is, of course, about an interest in identifying and explaining latent patterns in our culture. But in more practical terms, we are interested in the potential to launch new authors, about encouraging publishers to use more of their Patterson/King/Steel budget on the young writers who may one day replace them. We care about giving writers of all levels of experience more information and assistance with their craft. We care about bringing people who don’t have the right contacts in New York to a readership. Given that the model does not care if you have published before, if you have an MFA, if you are male or female, Hispanic or Asian, if you are beautiful and twenty-five or less so and seventy, then our work is also about widening access, potentially, to the career of writing. Perhaps one day your friend of a friend gets an 80 percent score that earns him an advance, and he can finally quit his job and stop waking to write at 4 A.M.
Writing about books that feature on the most public and revered of lists—the New York Times weekly bestseller list—is also an unashamed cry to readers, be they scholars or hobbyists, to join a thoughtful conversation about novels that masses of people read.2 Bestsellers are a class of books that are more often dismissed as objects for amusement than studied as works of literary art, or, at the very least, as works of considerable craftsmanship. Yet too much is missed about contemporary culture and the history of reading if we ignore them. Beyond value in terms of millions of dollars, the value of those writers on the bestseller lists is that these books make us read. They make us imagine, feel, discuss, think, and empathize. They let us fantasize, spy, escape. The New York Times novelists form the core of literary discussion and debate around the country, in bars, on the train, and at the dinner table. We look to them to see where culture is going. We look to them for understanding of our world. We look to them to help develop our tastes and opinions and to practice our expression of them. If we can bring readers some new insight into their beloved pastime, then we will only be pleased.
Perhaps by now you can tell you’re in the hands of two writers who are passionate enough about the importance of books and reading that they have spent a combined fifty years studying and teaching narrative, and another several years buying and selling books for the biggest players in the industry. We have coached and defended the right to love and hate different novels, or even the same one. We have pitched for publication of stories in many genres. We have, sometimes covertly, helped our best students and our wannabe author friends write letters to their parents, spouses, and future editors to explain why they just had to give up the sensible life, forget the medical degree, and follow that hallucinatory, ecstatic, and sometimes depressive drug that is a life with stories and words. We have, it is safe to say, totally “bought in” to the emancipatory and educational power of reading and writing fiction. First and foremost we are readers and then writers. Given this devotion to books, it is natural to wonder what in the world made us turn to computers?
Two Backgrounds
There is probably no one more surprised by “the bestseller-ometer,” as our model has been dubbed, than the two of us. To be honest, the research began with little more than a gut-level urge. It took four years of daily collaboration, and it brought results that neither of us expected despite two different backgrounds—Jodie’s in publishing and contemporary fiction and Matt’s in literature and the burgeoning academic field known as the digital humanities.
It all began when Jodie left her role as an acquisitions editor for Penguin Books to pursue a PhD in English at Stanford. Her time in publishing had left her with a lingering question, never adequately answered. What makes novels best-sell? The latent, associated questions, were equally interesting: What makes readers read? What is reading fiction in contemporary culture for?
During her early training at Penguin, Jodie had worked on the sales team. Sometimes over lunch, she would walk to the nearest big book retailer to make sure that the marketing budget spent on store positioning was being honored. It is common—and this is no industry secret—for publishers to pay an agreed figure for their top books to appear in conspicuous places in the store. Some retailers will accept money for positioning a book on the first row of the first table, for example, or for having a book’s full front cover face you on the shelf. These strategic placements are said to help sales. At the time, The Da Vinci Code was enjoying its seemingly endless run on the bestseller charts. Week after week, every lunchtime visit would confirm with a huge blue “number one” sign that Dan Brown’s novel was eating the world.
After months of this, what became obvious was that however much publishers were spending on positioning books or marketing of “Dan Brown copycats,” The Da Vinci Code was in a league of its own. Its phenomenal success was about something beyond the reach of sales and marketing. No marketing spend can explain that long-lasting impact on global imagination, not to mention the eighty million copies sold. Such success couldn’t all be hype. There had to be something beyond the marketing, something about those particular words on those particular pages.
Admittedly, it would be foolish to claim that marketing and publicity has no effect on sales. Of course it does. There must be some correlation between the fact that the five biggest publishing companies own approximately 80 percent of bestsellers: their marketing budget can, of course, go further. But it would also be foolhardy to claim that the effect of marketing dollars in the book world is at all consistent—there are too many examples of huge spends that lead nowhere, or of self-published, word-of-mouth runaway successes. Fifty Shades of Grey was first published only as an ebook and print-on-demand paperback by a house with no marketing dollars at all. William P. Young’s The Shack, first published on credit-card loans with only the marketing of a $300 website, has now sold over ten million copies. Other very different bestsellers that have risen to success and critical acclaim through nontraditional channels are Mark Z. Danielewski’s experimental online novel House of Leaves and Chris Ware’s originally self-published Jimmy Corrigan: The Smartest Kid on Earth, which is one of the most celebrated in a recent surge of graphic novels. There are many such examples, enough, in fact, to indicate that “marketing” is at best a safe guess and not a real answer to the question of why some novels are read by millions of people and others barely sell a handful.
When Jodie took her research question to Matt, who at the time was a lecturer at Stanford and cofounder of the Stanford Literary Lab, a better answer began to emerge. In 2008, Matt had just completed his part in a controversial computational study of authorial style in the scriptural text of the Book of Mormon. The computer’s analysis of writing style in the book suggested that theories of multiple authorship were probably true, and the study presented evidence that supported one particular theory about the book’s origins, a theory that has been rejected by the church as spurious. The results of the analysis were ultimately inconclusive, but the response to the article, including an interesting rebuttal from a team of Mormon scholars at Brigham Young University, showed how revolutionary computational analysis of text can be.
This work on authorship attribution and “stylometrics” convinced Matt that computers can help us see things in text that we would otherwise skip over. With more research, he found that a computer program could correctly guess, 82 percent of the time, whether an author was male or female just by having it examine uses of simple words like “the” and “of.” Matt was not the first to discover that male and female authors have different habits of style, but his work had focused specifically on the nineteenth-century novel. He then found that using just the single word “the” his machine could identify, with a reasonable degree of certainty, whether one of these same authors was an American writer or one from England.
Jodie’s reaction to that was more or less “so what?” It was an impressive idea to think that a computer knew a Brit from a Yank, but this was an artificial problem that didn’t need to be solved in the first place. She wanted to see the machines solve a real literary problem before she was convinced. Matt had a similarly underwhelmed response to Jodie’s passion for contemporary bestsellers. He thought they were fun to read and then forget about. He wanted to be convinced that they contained gold that was worth mining.
That was several years ago. Since then, we’ve teamed up to explore the hypothesis that bestsellers have a distinct set of subtle signals, a latent bestseller code. Instead of trying to guess what book might sell, our idea was to begin by trusting what the readers had already, perhaps unconsciously, figured out. The bestseller list, while ostensibly a jumble of very different books, represents a weekly list of favorite signals, curated according to the collective vote. Could it be that that collective vote had something to teach? Could our machines detect a signal in the noise? Did these attention-grabbing novels, be they so-called highbrow university curriculum novels or page-turning beach reads, have telling things in common?
If the answer was yes, then we might learn something about how success works. We might even prove a long-held industry theory wrong and make bestselling predictable.
And so we began teaching our computer how to read.
Machines Reading
Computers, of course, cannot really read, at least not in the sense that you are reading this page. But computers can read books in the sense that computers do most everything; they “read” (that is, they accept input) and then parse the input into units that we human beings think have meaning: things such as letters, commas, words, sentences, chapters, and so on. There is a certain mimicry of human reading there, and the more sophisticated the training the more sophisticated the mimicry. The difference between the human reader and the machine reader is that the human understands that the content being read has meaning. Ironically, though, the kind of reading that computers can do gets us closer to the details of a novel than even some of the most practiced literary critics. That’s because computers are experts in pattern recognition, and computers can study patterns at a scale and level of granularity that no human could ever manage.
Consider the question that started our research. Can bestsellers be predicted? To be able to predict things, you must be able to detect repeated patterns. Unless you are a fortune-teller, then prediction is all about established patterns. Typically, finding meaningful patterns in words is the job of a literary critic or scholar. Joseph Campbell, the great mythographer, spent a lifetime reading stories written by people all around the globe, and he specifically trained his eye to identify the similarities in these stories. He was a master of pattern recognition. But even with his level of commitment, there’s a limit to the number of stories and books that any one reader can examine, and, at the same time, there is a limit on how closely a reader can examine any single book. So there are matters of scale in both directions: one eye must be in the microscope and the other in the macroscope.
Christopher Booker is another scholar whose tenacity we admire. Booker spent thirty years reading hundreds of books in order to put forward his theory that all works of literature, and in fact all stories, fall into seven basic plot types. Perhaps he read a thousand books in forty years. Perhaps he was able to retain more of their content than most of us ever could. But a cluster of computers, once trained, can read thousands of novels over thousands of points of data in about a day, and they have an uncanny ability to reveal things that we human beings either take for granted or totally ignore.
Here’s one example. As readers, especially as trained close readers, we might be very consciously aware of the adjectives that a particular writer is using, but we’d probably not be aware of the noun-to-adjective ratio, that is, how frequently the author uses an adjective to modify a noun. That’s precisely the kind of thing that a computer can notice very easily, and it matters because it tells us something about description and style. The computer can also scrutinize and compare the ratio found in one book to the ratio observed in thousands of other books. If the machine discovers that the ratio is a bit higher or lower in bestsellers, then that feature has some significance.
Here is an experiment to try when picking your next book to read. Instead of taking a friend’s recommendation or picking up a book by an author or in a genre you already know, try reading one entire week’s NYT list in succession. Do it with your book club or your English class. If you read with good attention, you’ll become a bit like our machines and start seeing unexpected patterns between literary and mass-market authors, between books “for men” and books “for women,” between the Pattersons and the Pulitzers and so on. Some patterns will surprise you. You’ll wonder, for example, why heroines are so often twenty-eight years old. Does it matter? You’ll ask yourself if these authors truly consciously keep putting their first love scenes exactly at page 200 if it’s a 400-page novel or at page 110 if it’s 220 pages. If they do, why? You’ll argue with your friends about whether endings without satisfying closure can or should make or break an otherwise very pleasing novel. You might even want to make the claim that bestsellers in all categories have so many latent things in common that they are practically a specialized genre in themselves.
What’s interesting is how deeply readers respond to these things without really thinking about it. Scholars who specialize in an emerging field of “literary neuroscience” have been using MRI scans to map the brain while people read. This research is all about noticing what people notice. While this cognitive psychology angle on how readers read comes from a very different perspective than ours, both approaches recognize that all literary response is precisely about which words go in which order in which sentences. And whatever that combination triggers.
So, using computer reading techniques isn’t antitraditional or counter to our usual literary critical methods. In fact, these methods of zooming in on features for extraction and analysis are very much in the service of traditional approaches, and they provide the possibility of gaining insights that were, quite simply, impossible before.
The precise ways that computers can be taught to read and extract information from text are manyfold.3 The programs, algorithms, and codes we wrote for this study were designed to process books and extract detailed information about each book’s unique style, as well as its themes, its emotional highs and lows, its characters, and its settings, along with all sorts of seemingly mundane linguistic data that does not easily translate into concepts such as style and plot. Getting at the larger elements of fiction that are typically discussed in writing classes and books on how to write novels (theme, plot, style, and so on), involves using hundreds of points of data. To grapple with style, for example, we measured hundreds of variables: how many times an author uses words like “a,” “the,” “in,” “she”; how often an author uses period points and exclamation marks; how many adverbs a writer employs and the precise nature of those adverbs. These little details tell a reader a lot. Consider the importance of pronouns to the effect of Charlotte Brontë’s very famous line from Jane Eyre: “Reader, I married him.” The computer notices this “him,” and how often we hear about him, and how close he is in linguistic proximity to the all-important narratorial “I.” It notices when “I” and “him” appear almost side by side in more and more sentences, with less and less description in between. Of course, that is just what the reader is watching too. Isn’t the entire point of so many stories to get that “I” and that “him” closely aligned, separated only by an all-important verb like “married”? So often, this is entirely why we keep turning the pages.
Question marks and exclamation marks are very telling too. You might remember being in high school and being told to keep exclamation marks to a minimum. If every sentence screams with excitement (Oh my God!), and every exchange is a command (Freeze!), or a yelp (Ah!), or the discovery of some spooky thing that goes bump in the night (Thump!), then you risk giving your reader a cardiac arrest. Many exclamation points tell us something both about likely content and level of melodrama, and the proficiency of our author with her pen. Similarly, question marks often indicate use of dialogue, and endless pages of description without it can slow down the pace and the reader’s interest. These subtle habits of individual style are discussed in chapter 4.
We started with more than 20,000 extracted features—of which exclamation points and the word “him” were just two—and we studied them all. Some were entailments of style, others offered clues about the plot and setting, and some told us what the books were about. Not all of these features proved to be useful in determining the difference between a novel that had captured a huge number of readers and one that had, despite its unique brilliance, tanked. It turns out, for example, that an author’s use of numbers—911, 1984, 867-5309, $1,000,000—has no relationship to sales. Similarly, while we spent a lot of time teaching our machines how to detect that The Devil Wears Prada is set in New York City while Gone Girl begins in New York but ends up in Missouri, it turns out that (with a few exceptions) the geopolitical setting of a book is not all that important in terms of whether or not it sells well. There were just as many non-bestselling books set in New York as there were bestsellers. The megahit books that are set there—Sylvia Day’s Bared to You, Tom Wolfe’s The Bonfire of the Vanities, James Patterson’s The Quickie, and Safran Foer’s Extremely Loud and Incredibly Close, to name a few—show a deeper intended or accidental understanding of the minutiae of bestselling DNA than just being set in New York.
In the end, we winnowed 20,000 features down to about 2,800 that were useful in differentiating between stories that everyone seems to want to read and those novels that were more likely to remain, well, niche. After teaching our machines how to read books and extract all of these features, we analyzed the feature set using another batch of computer programs that are designed to discover and learn the latent patterns. Aptly enough, this analysis phase of our study employs something called “machine learning.” In text mining, we frequently wish to sort or classify documents according to their similarity. Say, for example, that we want to differentiate between e-mails that are spam and emails that are legitimate correspondence. Because spammy emails tend to have a lot of things in common: misspelled words, a high incidence of the word “Viagra,” and so on, we can write programs that measure how likely a given email message is to be a spammy one. The work we are doing in classifying novels is quite similar to the work that your email filter does. Suppose we want to predict whether a new book that we have never seen before is likely to be a bestseller. If we already have a whole lot of books that best-sold (not spam) and another bunch of books that did not sell well (spam), then we can feed all these books to our computer and train it to recognize these two classes by their distinct feature profiles. This is precisely what we did. In fact, we did it three different ways and when we averaged the results we found that 80 percent of the time our machine could guess which books in our corpus were bestsellers and which ones were not.4
That average of 80 percent means that if you randomly selected 50 recent bestsellers and 50 recent non-bestsellers, our machine could correctly identify 40 of the bestsellers as bestsellers, and 40 of the non-bestsellers as non-bestsellers. Of course, this also means that our machine would think that ten of the bestsellers really should not have been bestsellers and that ten of the non-bestsellers should have sold well. When we conducted a series of tests just like this, our machine was very certain that Pride and Prejudice and Zombies, for example, was not bestseller material—and, of course, it was a bestseller; our machine got this one wrong. Of course, Pride and Prejudice and Zombies is a book that sold at a time when any reference to Austen assured attention (and it likely still does), and when the movie theaters were full of zombie films. The context of the title, then, likely had an out-of-proportion impact on its sales.
Naturally, there were also non-bestselling books that our machine begged us to read, but that’s another story.
The Contract
When the two of us discuss new novels, we tend to talk about the relationship between writer and reader in terms of an unwritten contract to fulfill, a contract whose details are hazy but that nevertheless point to the aesthetic, emotional, intellectual, and even ethical reasons behind the choice to read. We thought a lot about all these expectations of a writer as we trained our model in detecting theme, plot, style, and character.
The tacit contract has many implied clauses. If you’re a thriller author, for example, you better have a dead body or two, and you better have mastered the heart-racing scene. If you are writing romance, your stories better end, but not start, with a unified, happy couple. Whoever you are, with the rare exception of a new literary wunderkind who is sometimes acceptable at double length, you have got about 350 pages to take us somewhere and bring us back. These are some of the big expectations, and you’ve seen the vitriol or heartache of the Goodreads reviews when writers don’t fulfill them.
With this in mind, dear Reader, we will make our own contract with you very clear. To wit, here are a few clauses.
1. The One
One of the phenomenons of our culture, and of course the book world, is an obsession with ranked lists. This goes far beyond the bestseller list itself. Just this year newspapers and major retailers have run features titled everything from “The Most Beautiful Settings from Your Favorite Novels,” to “The Ten Most Influential Books of All Time,” to “Find Your Book Boyfriend.” Goodreads users have collectively curated lists of books on all sorts of topics: best books set in space, best Japanese editions, heroes that matter, the best of the tearjerkers. There are thousands of lists, and there is a certain glee to deciding the rankings, to arguing with them, and, of course, to debating the merits of Mr. Darcy versus Christian Grey as a potential date.
Don’t think we could resist playing the list-making game. We know that all book people are asked to recommend a favorite novel. When that question comes, responding “I haven’t got one” is a death knell of an answer, both to small talk and to your street cred as a professional reader. It’s a four-word way to turn out the light in someone’s eyes. So we have played this precarious game because we live in a world of the one. One matters. Number one on the NYT list means something different from number ten. Perhaps because of the overwhelming possibilities of choice in the contemporary world, we seem to have a psychological and cultural need for a winner, a king, a god. Pick one.
By the end of this book, we will give you our list and our winner—the model’s pick for the paradigmatic bestseller of the last thirty years.
2. Blind Faith
The next promise of The Bestseller Code is that there has been no editorializing of this choice at all. We agreed from the beginning that we would seek not to choose but to explain the choice. In fact, while we knew other works by the writer, neither of us had read “the one” before the computer picked it for us. Of course we pulled it off the shelf instantly, read it in tandem, and laughed together at the unexpected irony. We recommend you don’t jump straight to it—every chapter explains a piece of the puzzle—but then we know the temptation of reading the first and last page of a book.
3. No Magic Tea
We are not going to claim that reading this book for the first or even second time is going to turn you into a bestselling fiction writer. This is not a prescriptive “how to” book and comes attached to no guarantee. You will definitely find many tips that neither of us would overlook if we were going to attempt to write a bestselling novel, and it’s unlikely either of us would ever submit a novel to an agent any more without doing some computational analysis of the final draft. But part of the beauty of this story is the twist it gives to the old axiom that great writing is a skill that can’t be taught. We are more concerned with twists than teaching.
Almost all the writing guides we know—and we have most enjoyed the ones by blockbuster authors like Dean Koontz and Stephen King—offer wisdom on aspects of prose such as style, character, and plot. We will do the same. We hope it will even take you deeper into the DNA of bestselling than any human eye could manage, and it will lay some of that ineffable je ne sais quoi of talented writers bare. But it will not give you a formula to apply. This book will tell you a lot about blockbuster DNA, but you won’t be able to copy it any more than you can slice off Adam Johnson’s fingerprints and type with them on your own hands.
Our belief, while it may be irritating and old-fashioned, is still that if you want to be a bestselling writer then first you have to learn and really appreciate fiction with as many tools as you can. If we can help you with that process, and you become a bestselling novelist, we would love to hear about it. We’d buy your book and no doubt we’d mine it. But please don’t complain that you looked for an easy formula to get a million-dollar contract, and we didn’t give it to you. Anyone who offers you that is the same person who will offer you overnight weight loss if only you buy their magic tea.
4. The Black Box
This is not a book about algorithms. We will share the key features we extracted, we’ll give you the method in broad strokes, but this is not where to go for machine learning or document retrieval or natural language processing. There are many good books on those subjects, but ours is a book about books, mostly bestselling ones.5 We hope we will make you think again about yourself as a reader or writer, about the purposes of fiction, about writers you think you adore or detest, and even about the relationship between humans and machines. We’ll give you lots of results and interpretation about where the machine succeeded in finding bestsellers, where it failed, and what it taught us, but our focus is on Gone Girl and The Goldfinch not latent Dirichlet allocation and named entity recognition. These sometimes esoteric methods inform the work we have done here, and the work could not have been done without these tools, but they are only the tools by which the story is wrought: the painter does not paint the brush.
Copyright © 2016 by Jodie Archer and Matthew L. Jockers