Shaping the Technology of the Future: Predictive Coding in Discovery Case Law and Regulatory Disclosure Requirements

BY Christina T. Nasuti

Click here to view PDF*


[A]cquiring preemptive knowledge about emerging technologies is the best way to ensure that we have a say in the making of our future.

Catarina Mota, TEDGlobal Fellow[1]



The twentieth and twenty-first centuries brought about a technological revolution for businesses and everyday life. The legal profession was not left unchanged, as innovations from typewriters to computers to the Internet transformed legal practice. [2] The corresponding creation and accumulation of electronic data continues to fundamentally impact legal practice, [3] particularly by accounting for massive increases in digital matter available for legal review. [4] Computer-assisted review technologies, and predictive coding in particular, permit the legal field to foray into the digital mass that now defines the professional world. [5] It is not surprising to see that these technologies provide important future venues for dealing with the electronic realities. What is significant, however, is the rapidity with which these technologies could be incorporated into the current judicial and regulatory regimes. Because the legal field already is feeling the impact of these technologies on its traditional review and conflict-resolution mechanisms, this Comment focuses on how judicial and regulatory regimes should react to the changing landscape.

The framework in which these emerging technologies develop will guide how they evolve and whether they can be sufficiently implemented in the legal context. As a result, the legal profession must advocate for courts and regulatory bodies to promptly develop and update guidelines intended to address predictive coding and other computer-assisted review technologies. While recent cases have provided some transparency into the judicial perspective on predictive coding and its role in litigation, agencies have generally been less transparent in their implementation or acceptance of predictive coding in the regulatory context. This Comment argues that, to effectuate a meaningful evolution and implementation of predictive coding, courts and government agencies should not experiment in vacuums independent of each other. Rather, both courts and agencies should demonstrate the utmost transparency when utilizing and addressing this new technology. Increased transparency will allow each body to learn from the other’s experiences and eventually enable both to put predictive coding technology to its highest and best use in appropriate contexts, while still maintaining a healthy skepticism regarding the different uses and challenges posed in each venue. The need for transparency and consideration of other contexts is especially important when considered in light of normative concerns. Namely, the juxtaposition of these forums raises an important question regarding whether courts and agencies should consider each other and work in tandem—otherwise, predictive coding technologies could hypothetically support an agency determination of wrongdoing but remain unacceptable for use in a judicial context.

This Comment addresses the origins of, current status of, and future possibilities for predictive coding. To that end, Part I addresses the rise of electronic data and introduces the reader to predictive coding as well as other terminology and concepts surrounding computer-assisted review. Part II surveys the general status of predictive coding and the opportunities it presents to the legal profession. The next two parts, Parts III and IV, address how predictive coding is applied and addressed in cases by the judiciary and by regulatory agencies. Part V provides a relatively brief overview of miscellaneous issues to consider as judges, regulators, and reformers develop rules, regulations, and recommendations addressing the legal use of predictive coding. Finally, Part VI compares and distinguishes the judicial and regulatory treatments of predictive coding and offers recommendations for each in order to assure best practices.

I. The Rise of Electronic Data and an Introduction to Predictive Coding

A. The Electronic Revolution

The recent data revolution resulted in a global shift from hard-copy files and communications to electronic versions. [6] This shift was inspired by the newfound ability to create and maintain electronic records and communications—a change so fundamentally revolutionary that it is sometimes compared to the fifteenth century introduction of the printing press. [7] Companies and individual users are not alone in this electronic shift. The government is a substantial force in the move, as the executive branch has mandated that agencies embrace the digital reality. [8] As a result of the nearly universal move towards electronically stored information (“ESI”), [9] the sheer amount of data now available compounds generic concerns regarding information and data production and thus poses greater concerns for litigation in the digital age. [10] To visualize the vast amount of data facing modern litigators and regulators, consider this fact: as of “2011, the digital universe [had] expanded to over 1800 exabytes, enough data to fill 57.5 billion 32GB Apple iPads.” [11]

As an unsurprising result, electronic discovery (“e-discovery”) represents a crucial but overwhelming part of litigation budgets. [12] For numerical orientation, e-discovery costs surrounding discovery and document production independently comprised an astounding “$2.8 billion in 2009,” [13] with continued projected increases as “[t]he amount of electronically stored information in the United States doubles every 18–24 months, and 90 percent of U.S. corporations are currently engaged in some kind of litigation.” [14] In a case that eventually permitted the use of computer-assisted review, the original electronic records would have required “10 man-years of billable time” simply to adequately locate relevant documents. [15]

These astounding costs represent the irreconcilable chasm between traditional data review and the vast digital prowess of the modern era. In response, human ingenuity and the marketplace developed a solution: computer-assisted review and predictive coding technologies. [16] These advances promise fiscal benefits and increased efficiency. [17] In the aforementioned scenario, utilization of coding technologies would only require “two weeks to cull the relevant documents” at a fraction of the cost for traditional methods. [18] As a result, humans matched the electronic revolution with potential electronic solutions—computer-assisted review technologies and predictive coding—that hold the potential to revitalize legal review despite massive increases in ESI.

B. Concepts and Terminology Surrounding Predictive Coding

Despite its rapid emergence and strong potential, predictive coding brings with it a host of confusion and concerns for modern attorneys—ranging from gaining the technological expertise necessary to understand its promises to learning how to apply the new technology. Even more simply, however, one must learn to speak the language of these new technologies. This Comment provides a brief overview of some of the essential terms necessary to understand the scholarship and debates surrounding predictive coding. However, parts of this terminology lack uniformity, and new technological advances may change the terminology. As a result, interested readers must dedicate constant self-study to the rapid revisions. [19]

As an initial matter, technology-assisted review (“TAR”) and computer-assisted review (“CAR”) are broad terms that encompass a number of technologies, including predictive coding as well as less advanced but similar technologies, such as keyword searching. [20] However, because there is no single lexicon governing computer-assisted review technologies, the terminology distinctions between TAR, CAR, and predictive coding technologies are not followed in much of the literature discussing and comparing predictive coding, technology-assisted review, and computer-assisted review. [21] The common example of a square and rectangle can be applied to clarify the distinction here. While predictive coding–the so-called square–is a type of computer- or technology-assisted review technology, CAR and TAR–the so-called rectangles–are broader terms that include other less automated technologies as well. This Comment seeks to define these terms for the reader as well as to use their most precise forms in order to alleviate confusion.

At its most technical, predictive coding is “[a]n industry-specific term generally used to describe a Technology [or Computer]-Assisted Review process involving the use of a Machine Learning Algorithm to distinguish Relevant from Non-Relevant Documents, based on Subject Matter Expert(s)’ Coding of a Training Set of Documents.” [22] In plain English, predictive coding matches human judgment and hands-on training with computer learning and iterative skill to teach software to quickly and accurately search and categorize documents, much like human-only review. [23] Pandora Internet radio provides a helpful analogy for this process: the computer program learns from users’ positive or negative feedback—the equivalent of a Pandora “thumbs up” or “thumbs down” of a song—to predict future outputs, such as a desired song to play next, and the users’ preferences. [24] For predictive coding, the initial learning process occurs as humans code a primary seed set of documents to teach the program what constitutes relevancy and privilege for the overall document set; the training is then repeated using sets of documents until the machine reaches a pre-determined accuracy in self-categorizing documents. [25]

A number of other terms are especially helpful for understanding how predictive coding is described, how it works, and how to grasp the legal significance and potential for its technological components. [26] First, the repeated interactive process [27] between the machine software and the human teachers is known as active learning. [28] This process employs a training set of documents coded by humans to teach predictive coding programs how to evaluate the relevance of future data. [29] Predictive coding [30] and similar technologies are able to learn from their human teachers because of their ability to “emulate human judgment”—a characteristic of their artificial intelligence. [31] A machine’s ability to use its artificial intelligence to properly engage in the learned coding process, known as machine learning, [32] is measured by its accuracy. [33] Accuracy levels are determined by the program’s responsiveness. [34] Responsiveness, a measure of how well the program returns relevant [35] documents, can be determined by the particular informational or legal need at hand and is often based on the proportion [36] of relevant documents returned versus non-relevant documents. [37] In order to measure accuracy, users employ a control set of random documents to assess the sufficiency of the system’s coding abilities at that point. [38] Scholars often measure this machine-learning automated process against pre-determined accuracy levels, to ensure quality control, [39] as well as against the results that human reviewers would achieve under manual review. [40] These comparisons allow scholars not only to assess whether technology is sufficiently advanced to achieve an accuracy level sufficient for the micro-level coding tasks, but also to discuss on a macro policy level whether this technology is sufficiently advanced to complement, or even replace, traditional human review. While a number of more technical aspects and terms surround predictive coding and the mathematical decisions governing acceptable quality control levels, this explanation provides the rudimentary lexicon necessary for a discussion of predictive coding in the judicial and regulatory spheres.

II. Predictive Coding and the Legal World

To elaborate on predictive coding’s role in the legal world, this Part proceeds with five main themes: (1) predictive coding’s legal applications; (2) solutions to problems posed by ESI and discussion by predictive coding’s advocates; (3) attorneys’ roles in relation to predictive coding; (4) predictive coding’s main legal advantages; and (5) predictive coding’s main disadvantages. Finally, it provides an intermediate conclusion to resolve these disparate pieces of a complicated technology in the broad legal realm before discussing specific case applications.

Predictive coding has numerous applications beyond the legal world. However, within the legal world, it is often referred to in connection with e-discovery, which is defined as “[t]he process of identifying, preserving, collecting, processing, searching, reviewing, and producing Electronically Stored Information [ESI] that may be [r]elevant to a civil, criminal, or regulatory matter.” [41] The Federal Rules of Civil Procedure automatically deem ESI to be available as potential evidence in lawsuits, [42] consequently rendering almost any medium as fair game for discovery. [43] Because the discovery process was created around traditional review mediums and processes, conventional methods unsatisfactorily address the electronic world. [44] Predictive coding provides a solution to the problems created when incorporating ESI into the legal arena: “(1) volume and duplicability, (2) persistence, and (3) dispersion.” [45] For example, the multitude of emails produced in an investigation is compounded when the emails are produced in duplicates, as a new copy of a document is included for each sender or recipient. This unnecessarily requires evaluating attorneys to classify the same document multiple times. With problems like these, which can also include assessing incoming data, shifting production costs, or searching one’s own documents for responsive materials, [46] ESI’s sheer financial bulk—document review costs comprise “nearly 75 percent of the eDiscovery budget” [47]—renders predictive coding a necessary advancement in the modern legal profession. [48]

Advocates of predictive coding champion its centrality to revitalizing an efficient modern legal practice. They often concede that that the technology is not fully equivalent to human review, [49] instead arguing that predictive coding works best in mundane contexts, characterized by the easy relevancy or privilege determinations on a large number of documents, whereas humans are better at making close calls. [50] Despite this concession, a number of arguments support predictive coding’s efficacy in the legal profession.

Numerous factors—such as readability, the area(s) of law at issue in a particular piece of litigation, and the professional judgments made by the attorneys who effectively teach the program a particular issue [51]—impact how well predictive coding works on different documents and implicate its relative success in certain areas of law. [52] The initial “teachers” are often “attorneys with knowledge about the responsiveness of those documents.” [53] While the coding program provides the ultimate review for responsiveness, it learns from the attorneys who make the initial structural responsiveness determinations and classifications. [54] However, the mechanical nature of the predictive coding programming still leaves room for human judgment. After the automated processes are designed in compliance with the particular request, “attorneys must then decide which documents to produce.” [55] Attorneys’ professional judgment is then necessary to determine what to do with the mechanized data and the corresponding documents: (1) manual review of sufficiently responsive documents; [56] (2) a combination of production, culling, and review depending on relevancy benchmarks; [57] or (3) production and culling based solely on relevancy benchmarks. [58] More lawyers will soon face these choices as predictive coding gains increased traction among judges, agencies, and the profession.

For instance, pre-existing legal doctrines may push towards wide-scale implementation of predictive coding protocols due to the fiscal benefits available from implementation. Under the doctrine of proportionality, parties are excused from retrieving and sharing ESI should it not be cost-effective. [59] Because predictive coding significantly reduces the cost of e-discovery, [60] it renders ESI more accessible to litigants. The cost savings provide positive and negative impacts for a variety of parties: (1) it may cause businesses to release unfavorable data in response to disclosure or discovery requests, but that same data may help challengers who would otherwise be stymied by informational inequities; (2) it may allow businesses to comply with stringent regulatory requirements at lower costs to the bottom-line; and (3) it may simply revolutionize all parties’ access to the amount of information available—either by promoting negotiation or by rendering more cases more suitable for trial due to the influx of evidence. [61] Intuitively, predictive coding helps producing parties satisfy the opposing sides’ requests at a decreased cost. [62] These savings are particularly helpful when agencies request documents, as companies, and possibly individuals, wish to fully comply with the requests at a minimal fiscal cost. [63] Scholars also recognize, however, that automated learning and coding technologies “can be equally valuable for analyzing incoming document productions” because they allow for an incoming review triage: the protocols “rank documents by degree of responsiveness so attorneys can home in on the most important documents quickly.” [64] Consequently, the concept of proportionality does not just change whether predictive coding can be mandated; [65] rather, it also impacts who has access to the fruits born from this technology.

Those convinced by the advantages of predictive coding must face another large hurdle: determining the accuracy and effectiveness levels that predictive coding must achieve before it gains wider acceptance. The legal field promotes cautious but vigorous advocacy for clients. [66] Before fully adopting these relatively new methodologies, the legal field must be certain that predictive coding represents an improvement on or, at the very least, a supplement to current practices. A number of studies support the assertion that predictive coding methodologies are at least equal to, if not more accurate than, traditional human review. [67] Should these studies prove an acceptable foundation for the profession’s ethical standards, predictive coding may gain a substantial foothold in the future of the legal profession.

Overall, given the symbolic Mount Everest posed by the ever-growing amount of ESI in all legal spheres, predictive coding provides the metaphorical climbing poles and oxygen allowing attorneys to trek to the top “to assess cases faster and more efficiently[,] mak[ing] case preparation easier, more comprehensive, and less expensive.” [68] Whether it is employed in the courtroom {{69}][[69]] See infra Part III. [[69]] or when interacting with administrative agencies, [70] predictive coding is a new path for traditional data management and dispute resolution. What is not an option, at this point, is simply ignoring the data mountain in the room, as ESI grows exponentially and automated technologies increase in significance for all attorneys. [71]

Proponents of court-based acceptance of predictive coding point to traditional principles, such as the Federal Rules of Civil Procedure’s emphasis on “balanc[ing] costs and completeness” in discovery, in support of their claims. [72] Predictive coding promises a sharp increase in these existing values as it promotes both efficiency and fiscal responsibility. [73] As such, it promises a level of continuity by pursuing traditional judicial ideals. Advocates include Judge Andrew Peck, [74] who has asserted that, in his “opinion, computer-assisted coding should be used in those cases where it will help ‘secure the just, speedy, and inexpensive’ determination of cases in our e-discovery world.” [75]

At the same time, however, proponents and critics alike recognize the fundamental choices inherent in the move from manual, human-based review to artificial intelligence-based review through the use of automated predictive coding processes. [76] It is in the necessary re-education and re-orientation process that the link to this Comment’s thesis lies: because predictive coding complements existing judicial and litigation values, including efficiency, and because it will require an intellectual overhaul, it is very important to study current case law trends in addition to administrative and agency responses. Studying these trends will determine if the future of predictive coding supports existing legal values and, if it does not, will guide what changes can be made during this re-shaping of the legal consciousness to ensure that the new framework supports healthy development in accordance with those generally accepted principles.

III. Predictive Coding and Recent Cases

Recent cases demonstrate judicial experimentation as courts gain a tentative footing in how to address and incorporate predictive coding in existing discovery and litigation norms. [77] This experimentation with predictive coding provides more than simply useful data points that demonstrate the significance of ESI. Rather, these cases show that predictive coding can and does work for modern legal issues and that courts are responding to these issues through traditional norms intertwined with an embrace of revolutionary technology. [78] To that end, this Part discusses key recent cases to show not only that predictive coding has been used and discussed by the judiciary, but also to demonstrate with meaningful examples that case law is creating an initial framework through which to evaluate and respond to predictive coding. Finally, this Part uses the review of these cases to get a sense of how that framework compares to similar responses from regulatory agencies in this time of legal and technological flux.

A. Da Silva Moore v. Publicis Groupe

The Southern District of New York’s 2012 decision in Da Silva Moore v. Publicis Groupe [79] arguably represents predictive coding’s most well known foray into case law. The gender discrimination case gave rise to a class action suit, [80] which provided data ripe for coding technologies. The plaintiffs’ case rested on allegations that there was “ ‘systemic, company-wide gender discrimination against female PR employees.’ ” [81] Because this employment complaint was based not on the actions of a select group of individuals, but rather on a comprehensive practice, a number of electronic documents and data¾including hiring and promotion practices along gender lines, email communications between employees and supervisors, communications regarding promotions, and any company policies¾comprised the relevant discovery material. [82] At the initial stages of this case, the parties sought to address the “ ‘electronic discovery protocol’ . . . [for] approximately three million electronic documents . . . .” [83]

Judge Peck, who served as the magistrate judge in Da Silva Moore and whose scholarship addresses emerging litigation technology, recognized the potential that predictive coding and related technologies possess for enhancing the efficiency of review and production processes during discovery. [84] His discussion highlighted that the technologies allow companies to fully respond to the plaintiff’s legal requests while engaging in a timely and cost-feasible retrieval process. [85] In this early-stage decision, Judge Peck also acknowledged the numerous legal challenges inherent in introducing a new technology to court practice, including the role Daubert v. Merrell Dow Pharmaceuticals [86] would play in the admissibility of evidence derived from this new, expert-driven technology. [87] Finally, he advocated a uniform manner in addressing how parties should signal their intent to use predictive coding in a particular case: “The best approach . . . is to follow the Sedona Cooperation Proclamation model . . . [and] [a]dvise opposing counsel that you plan to use computer-assisted coding and seek agreement; if you cannot, consider whether to abandon predictive coding for that case or go to the court for advance approval.” [88]

By layering his academic suggestions into a judicial opinion, Judge Peck provided a logical bridge between a previously hypothetical and purely academic discussion—predictive coding’s role, if any, in the courtroom—and the day-to-day efficiency and evidentiary challenges faced by modern judges. He demonstrated more than just the promise of predictive coding, which his academic research had already elucidated. Rather, through this opinion, Judge Peck provided a very real and practical example of coding’s applications [89] and moved predictive coding from academic intrigue into the ever-growing array of litigators’ tools.

Da Silva Moore “recognize[d] that computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases.” [90] When accepting predictive coding on the facts of the case, Judge Peck outlined five main factors in support of his decision:

(1) the parties’ agreement, (2) the vast amount of ESI to be reviewed (over three million documents), (3) the superiority of computer-assisted review to the available alternatives (i.e., linear manual review or keyword searches), (4) the need for cost effectiveness and proportionality under [Federal] Rule [of Civil Procedure] 26(b)(2)(C), and (5) the transparent process proposed by [the defendant]. [91]

He also recognized Da Silva Moore’s significance and the “lessons [it held] for the future.” [92] They included the importance of “cooperation among counsel,” the need to train the automated programs and to validate the results through quality control processes, the importance of triaging document review based on relevancy in order to minimize costs, and the need for active participation in “court hearings” by “the parties’ e[-]discovery vendors.” [93] All of these individual “lessons” illustrate the legal implications courts need to address in order to embrace the ready and practical opportunities presented by predictive coding. [94]

Interestingly, even Judge Peck, a strong advocate of incorporating coding technologies, recognized in initial discovery conferences that predictive coding, at least in its current form, is not an automatic answer for electronic discovery in all cases. [95] Judge Peck’s approach, in the parties’ initial conference, signaled a need to weigh the benefits and costs of technology, particularly in its untested state, and to determine what forms should be mixed and matched to create the most appropriate option in a particular case. [96] The issues raised in this case exemplify areas of the law that are and will continue to be subject to tension when incorporating predictive coding technologies into “typical” litigation. For instance, even though predictive coding makes vast amounts of data relatively more accessible, it does not justify carte blanche judicial access to all document caches. In Da Silva Moore, the parties had to agree on what data sources would be searched and subjected to the automated protocol. [97] Additionally, even though these parties shared a general agreement about predictive coding and the necessary confidence levels, [98] they disagreed on next steps and, in particular, how the trained system should be used, where the cut-off for manual review should lie, and how that should be balanced with the potential for responsive, but unproduced, documents. [99] This dilemma presented the question of whether “document production is [truly] complete and correct as of the time it was made.” [100] In a concrete application of his academic scholarship, Judge Peck also resolved the evidentiary rules and Daubert issues presented by these concerns and determined that the rules would not apply to the actual search methodology but would remain relevant if particular pieces of evidence are presented at trial. [101]

This initial acceptance signals an expedition among the judicial community into a new world of predictive coding. [102] Previously, judges were certainly aware of its existence, but had not provided guidance on its admissibility or potential value in the judicial system. [103] For now, the intellectual revolution remains in its infancy, and judges, proponents, and critics of coding alike must recognize the active role they play in shaping the future of automated technologies in the legal system.

B. Cases Following in Da Silva Moore’s Footsteps

A number of cases followed the predictive coding trail begun in Da Silva Moore. One such case, decided mere months later, is Global Aerospace Inc. v. Landow Aviation, L.P. [104] Extending Da Silva Moore’s initial permit for predictive coding, Global Aerospace answered the next logical question: If one party is opposed to the use of predictive coding, can and, arguably, should a court mandate its use anyway? [105] As a matter of positive law, Global Aerospace mandated “the use of predictive coding for purposes of the processing and production of electronically stored information,” while reserving the opposing party’s right to later object to the coding methodology itself. [106] However, the normative question—should a court force this new and unfamiliar methodology on parties who may, either for personal, strategic, or intellectual reasons, prefer manual review—remains unanswered. Judges, and perhaps legislatures, must address this normative issue because predictive coding is now a potential piece of litigation and discovery. [107]

Only a few months later, In re Actos (Pioglitazone) Products Liability Litigation [108] arrived. It too built upon the shift towards predictive coding technologies. First, Da Silva Moore established that a party could use predictive coding. [109] Then, Global Aerospace showed that a court could require a dissenting party to engage in predictive coding-based discovery. [110] Finally, Actos demonstrated how predictive coding worked in discovery. [111] Actos presented foundational and technical practices helpful to parties seeking to implement predictive coding in a case management order concerning ESI production. [112] These foundational practices included presenting the sources and the likely custodians that would have been helpful to achieving a comprehensive data foundation as part of the case management process. [113]

Most significantly, however, Actos provided a summary of its “search methodology proof of concept to evaluate the potential utility of advanced analytics.” {{114}][[114]] Id. at *3. [[114]] This methodology, essentially laying out how these parties would employ predictive coding for electronic discovery, presents both an example of the parties’ initial agreement on the details and provides a salient picture of the technicalities that need to be resolved by courts—whether there is, or should be, an official methodology adopted by the judiciary or agencies. Literature spawned in the aftermath of these opinions urged judges to develop unified and defined methodologies for automated technologies. [115] In this way, like the previous cases, Actos plays a dual role as an example of and as a spur to future judicial action. Each role occupies a rung in the logical ladder pushing predictive coding from academic hypotheticals to industry possibility to legal reality to, ideally, a fully and adequately integrated part of the discovery and litigation schemes.

As a final point, the line between governmental uses or responses to predictive coding and its status in case law is not as clear as it may seem. For instance, National Day Laborer Organizing Network v. U.S. Immigration and Customs Enforcement Agency [116] involved a Freedom of Information Act (“FOIA”) request for disclosure of information from the United States Immigration and Customs Enforcement Agency. [117] Under FOIA requests, courts determine whether a search is adequate by evaluating “the methods used to carry out the search.” [118] Consequently, the court’s willingness to endorse automated technologies represents an important step towards the normalization of predictive coding in FOIA cases. [119] The court’s decision presents, like before, the normative question of whether these technologies should be endorsed in this situation. Additionally, if the parties meet the normative requirements, the decision requires the secondary evaluation of predictive coding’s methodological features to ensure that they are sufficient to meet the FOIA search burden faced by the agency. As such, this judicial observation about machine learning creates a signal for those who interface with government agencies under FOIA, suggesting that the future will likely involve automated systems as a natural feature of agencies’ data retrieval. [120]

By showing how agencies may need to use predictive coding on the production-side, rather than receipt-side, of ESI production and discovery, National Day Laborer illustrates the overlapping nature of predictive coding for government agencies, which are impacted both in the courtroom and in their regulatory duties. The overlap points to a logical necessity: courts and agencies should consider each other when creating the schemes in which to envelop predictive coding. Their uses and concerns are not incompatible, and each will be better served when making the difficult choices necessary to develop this technology properly by learning from the other’s mistakes and successes. This would allow for the creation of a framework best poised for the development of a legal system that maximizes predictive coding’s potential while avoiding its dangers, including overreliance on technology and frameworks that misunderstand how the technology fits with existing legal obligations.

C. Progressive Casualty Insurance v. Delaney: Enforcing Judicial Predictive Coding Norms

In the years since the initial predictive coding cases, courts have issued decisions focusing on the pragmatic concerns that continue to illuminate how they have addressed predictive coding in discovery. One such example is Progressive Casualty Insurance v. Delaney, [121] which, similarly to National Day Laborer, involved private parties in a suit involving a government agency. [122] In this case involving underlying causes of action based on banks taken over by the FDIC, [123] the parties agreed to use keyword searching, a computer-assisted review technology, to cull the 1.8 million potentially responsive documents to a supposedly manageable mass of 565,000 documents to review for responsiveness. [124] Progressive, the reviewing party, facing efficiency and fiscal concerns, decided to further review the documents using predictive coding. [125] Significantly, it made this determination “without seeking leave of the court to amend the [parties’ previously agreed upon] ESI Order,” [126] or even informing the opposing party of its intention. [127] In addition to making these determinations without consulting the court or opposing party, Progressive also violated the agreed-upon ESI Order protocol by failing to produce the responsive documents “on a rolling basis.” [128]

In response to this behavior and corresponding motions by the opposing party, the court discussed predictive coding in the aggregate, [129] describing it as an “accurate means of producing responsive ESI in discovery,” particularly as compared to “ineffective tools” like “manual human review[] or keyword searches . . . .” [130] In this discussion, the court highlighted that it is an adherent to the potential for predictive coding in certain ESI cases—even noting that if parties “agree[] at the onset of this case to a predictive coding-based ESI protocol, [it] would not hesitate to approve a transparent, mutually agreed upon ESI protocol.” [131] This juxtaposition with the court’s hypothetical support for the use of predictive coding highlights the significance of its disapproval of what happened in this case. Here, the parties only agreed to search term or manual review for document retrieval. [132] In emphasizing the problem of unilateral action in this case, the court pointed to literature showing the judicial trend towards a need for “unprecedented . . . transparency and cooperation among counsel” when using computer-assisted review since judges “typically . . . give deference to a producing party’s choice of search methodology and procedures in complying with discovery requests.” [133] The problem in this case was not that predictive coding had been used; rather, the problem was that Progressive used the technology without adhering to the necessary cooperative and transparency requirements “for a predictive coding protocol to be accepted by the court . . . as a reasonable method to search for and produce responsive ESI.” [134] In response to Progressive’s violation of these new predictive coding norms, the court required it to turn over the entirety of the 565,000 documents originally located through keyword searching, subject only to privilege restrictions. [135] This harsh result illustrates how seriously the court took the need for above-board behavior when employing coding technologies in discovery.

Significantly, the court showed a remarkable acceptance of predictive coding as something that could be appropriate and even viewed positively, as compared to other review mechanisms, in ESI cases. However, this acceptance came with strict methodological strings attached, emphasizing a focus on cooperation and transparency. Interestingly, although courts and agency responses have developed separately, these same strands appear in regulatory agencies’ limited publications. [136] Because similar themes emerge in each context, increased cooperation between courts and agencies regarding the specifics—such as how much transparency is needed (e.g., should a party only show its methodology or should it show every document classified as responsive or non-responsive?); how much agreement is required versus if a court or agency could unilaterally order its use; what confidence intervals are appropriate in which contexts; and more—will allow each to develop a strong protocol. These strong protocols will enforce and regulate the use of predictive coding to ensure that it is a positive development and strictly regulated by the enforcing body to avoid any exploitation, such as what happened in Progressive. [137]

IV. Predictive Coding and Regulatory Agencies

While the jurisprudence addressing predictive coding remains in its infancy, its relative youth is offset by increased usefulness for scholars and attorneys due to its fairly broad accessibility. What judges say in these cases is not a secret. Future judges and lawyers, who will argue based on and for these new judicial rules, can see what has and has not been applied and, at the same time, what works and what does not work in the predictive coding jurisprudence. This transparency allows the profession to understand where points of contention will arise, what policy choices must be made, and what solutions are available to resolve these questions.

Unfortunately, government agencies lack full-scale transparency in their interactions with predictive coding—including both their independent use of the technology and their receptiveness to data retrieved with automated methods. This Comment argues that increasing transparency regarding agencies’ procedures, uses, and concerns would allow for an information exchange to create comprehensive understandings of the potential for, and challenges surrounding, predictive coding in agencies and regulatory law. More importantly, greater transparency would also allow for a more holistic comparison between regulatory uses for predictive coding and its uses in the judicial system. Significantly, this sort of meaningful comparison would create a feasible environment for both courts and agencies to adapt their policies to best address the issues posed by automated technologies and ESI in the legal world. [138]

Much like discovery and litigation, ESI’s exponential growth creates a mass of documents subject to potential regulatory inquiries. [139] Similarly, predictive coding’s promises of lower cost and increased efficiency make it a revolutionary option to facilitate companies’ and individuals’ responses to regulatory inquiries. [140] The future benefits, which are even greater in regulatory contexts, may create incentives for cooperative methodologies and emphasize a shared end-goal between companies and agencies: the opportunity to manually review fewer documents, resulting in monetary and manpower savings. [141] Moreover, despite a lack of comprehensive and transparent information, evidence indicates that at least some agencies¾particularly the Department of Justice, the Securities and Exchange Commission, and the Federal Trade Commission¾are allowing predictive coding to be used to respond to inquiries—although sometimes merely “on a ‘case-by-case’ basis.” [142]

A number of factors support a deliberative approach when considering the use of predictive coding in regulatory contexts. For example, one observer highlighted that “[w]ading through [a] virtual avalanche of data can be intimidating in civil litigation, but effectively sorting through ESI in a government investigation is even more daunting, where one potentially exculpatory document may change the nature of a case.” [143] Given that regulatory agencies share many of the same potential benefits and concerns surrounding predictive coding, it is helpful to see how agencies address automated methodologies to provide a meaningful contrast, and possibly comparison, with courts as both entities must reformulate their existing frameworks to adapt to machine-learning technology’s growing influence. To that end, this Comment explores how three agencies—the Department of Justice (“DOJ”), the Securities and Exchange Commission (“SEC”), and the Federal Trade Commission (“FTC”)—are responding to, regulating, and incorporating predictive coding technologies.

A. The Department of Justice

  1. The Department of Justice, established in 1870, serves under the Attorney General as “the central agency for [the] enforcement of federal laws.” [144] Because the Department is quite large, it is subdivided into smaller divisions that have their own particularized missions. [145] One example is the Antitrust Division, which “promote[s] economic competition through enforcing and providing guidance on antitrust laws and principles.” [146] This Comment focuses on the Department’s approach to predictive coding technologies through the lens of the actions and perspectives taken by the Antitrust Division.

    The Department recognizes that electronic discovery provides new challenges for companies, which are now required to search and produce more documents. [147] In response, it is among the agencies that have already permitted predictive coding to be used to satisfy required document production in some situations. [148] The Antitrust Division is an example of a division that needs access to large amounts of documents and financial information in order to determine whether any criminal wrongdoing has occurred. However, before disclosing relevant information, companies may need to first peruse broad caches of data to gain a sense of what is relevant. Illustratively, the Antitrust Division permitted the use of predictive coding to determine relevance in the “proposed merger of Anheuser-Busch InBev NV and Mexico’s Grupo Modelo SAB.” [149]

    In doing so, the Department recognized the mutual benefits available when properly employing coding technologies. For example, these technologies allow the Department to “reduce the document review and production burden on parties while still providing the [Department] with the documents it needs to fairly and fully analyze transactions and conduct.” [150] Predictive coding technologies also allow for a prioritized allocation of resources so that “only the ‘really’ relevant documents [are] produced.” [151] These benefits mirror the fiscal and efficiency gains that the same technology promised in litigation and discovery.

    However, predictive coding is not a perfect solution to the imbalances between companies or individuals and government agencies. Many of the same concerns raised in the case law also occur with regulatory inquiries: in order for predictive coding to work within the existing dynamic, parties must employ “a high degree of cooperation and transparency about the implementation and structure of the predictive coding process.” [152] Notably, these transparency requirements apply only to the producing party, not to the agency. The Department’s concern mirrors Judge Peck’s discussion with the parties in the discovery conferences in Da Silva Moore, where he emphasized collaboration’s centrality to the meaningful application of predictive coding. [153] Questions also arise in the regulatory context as to who is qualified to determine a document’s relevance and whether intentional or unintentional bias could taint the retrieval process. [154] Because of these concerns and the lack of sufficient data, the Department currently requires “written modification” to the informational request when using the artificial intelligence potential for predictive coding and emphasizes the need for “cooperation, transparency, time, and hard work,” along with review of the training set and the qualitative samples and the creation of a mechanism for supplementary document retrieval. [155] These concerns show the likely direction for any predictive coding agreements permitted by the Department. [156]

    Significantly, despite the potential issues, the Department does not proscribe potential use of predictive coding in these inquests. Its willingness to at least consider using coding in particular cases bolsters the need for open informational exchanges about predictive coding so that all agencies can put it to its highest use. Given the Department’s emphasis on transparency to be sure that responding parties adequately employ coding technologies, [157] it only makes sense that the Department’s similar transparency about its evaluations would allow for the development of a broad-based, efficient framework.

    B. The Securities and Exchange Commission

The Securities and Exchange Commission has a comprehensive mandate “to protect investors, maintain fair, orderly, and efficient markets, and facilitate capital formation.” [158] In order to effectuate its duties, the SEC requires extensive disclosures from companies to create an informational balance for investors. [159] To produce required information to the SEC, companies may have to sort through extensive caches of information and communications. As such, predictive coding may provide a more efficient retrieval system, allowing these companies to fully comply with the Commission’s requirements at the lowest possible cost.

Similar to the Department of Justice and Federal Trade Commission, the Securities and Exchange Commission does not have a blanket ban on predictive coding for data production. Rather, it requires specific disclosure by the requesting company and SEC approval before the company can use the technology to satisfy its regulatory guidelines. [160] What sets the Commission apart, however, and represents the next logical progression in regulatory treatment of predictive coding is the fact that it is beginning to use predictive coding software within its own systems and review processes in addition to merely accepting information selected through automated systems. [161] The Commission provides a perfect example of how agencies hold the potential to shape the way private parties use predictive coding: by accepting the fruits of these technologies when the technologies are properly employed per publically accessible guidelines, which still require Commission approval. However, the fact that a body as concerned with data and accurate production as the SEC is signaling acceptance of coding technologies provides an important foundation for private parties, as they now have an external standard by which to measure their interest in using coding in private cases. Additionally, this example could provide a yardstick by which courts could measure their own standards. This potential is strengthened by the fact that regulatory bodies, like the Commission, will learn how to address coding technologies through repeated interaction with it on the other side of the table as well as through its actual use. [162]

Significantly, this potential for insider familiarity with the technology and knowledge about its strengths, weaknesses, and possibilities could allow the Commission, along with other regulatory bodies, to use that familiarity to create effective guidelines for disclosures. [163] The benefit of actual knowledge is without parallel and illustrates the next logical step in regulatory bodies’ acceptance of coding technologies. [164] Moreover, this knowledge can be put to even greater use if the employing agencies are willing to be transparent and share their experiences, successes, and trials not only with other agencies, but also with the courts. This transparency would allow effective rules to develop across the spectrum and also permit software providers to adopt more effective methodologies. [165]

C. The Federal Trade Commission

The Federal Trade Commission is tasked with “prevent[ing] business practices that are anticompetitive or deceptive or unfair to consumers; . . . enhance[ing] informed consumer choice and public understanding of the competitive process; and . . . accomplish[ing] this without unduly burdening legitimate business activity.” [166] Similar to the Department of Justice’s Antitrust Division, the Commission’s mandate necessitates perusal of large amounts of data to determine if any inappropriate behavior has occurred. [167] As such, companies targeted by the Commission for disclosure are subject to the same burdens as those faced under obligations to the other agencies and during litigation.

In a move similar to those of the Department of Justice and the Securities and Exchange Commission, the Federal Trade Commission shows preliminary acceptance of predictive coding through its response letters that encourage the use of coding in disclosures procured by informational subpoenas. [168] Interestingly, however, the Commission moved beyond merely permitting predictive coding (which raises the tensions and concerns previously discussed [169]) and instead moved towards requiring companies to, at the very least, consider predictive coding when attempting to decrease their disclosure responsibilities. [170] For example, in an April 11, 2012, response to a motion to quash or limit civil investigative demand, the FTC made a passing reference, when critiquing the petitioner’s cost analysis, endorsing the affordability of predictive coding technologies in these contexts. [171] The letter critiqued the cost estimate, stating that it failed to “account for factors that may reduce the cost and time of production . . . [because] Petitioners have not sufficiently addressed the availability of e-discovery technology, such as advanced analytical tools and predictive coding, to enable fast and efficient search, retrieval, and production of electronically stored information . . . .” [172]

Lest this appear to be an isolated incident, in another response letter, the Commission addressed the correlation between ESI and an increased burden, asserting that parties need to include “affirmative suggestions [that] could include . . . predictive coding.” [173] Taken in conjunction with each other, responses like these illustrate the Commission’s relative acceptance of coding technologies. This step has logical parallels to the progression described in the case law. Automated technologies’ mandatory acceptance, however, raises the same needs for transparency and interactive learning with other entities, given predictive coding’s growing support in the regulatory community.

Even more importantly, the Commission has begun the process of moving predictive coding decisions from informal determinations into the actual regulatory structure. [174] This shift represents a significant step in the progressive ladder towards full-scale acceptance and implementation of predictive coding. In early 2012, the Commission proposed rule changes to 16 C.F.R. Parts 2, Nonadjudicative Procedures, and 4, Miscellaneous Rules. [175] In explaining these proposed changes, the Commission addressed the “Need for Reform of the Commission’s Investigatory Process” and cited three key reasons in support of the use of automated technologies: (1) “information is no longer accurately measured in pages, but instead in megabytes,” (2) “ESI[] is widely dispersed throughout organizations,” and (3) “because ESI is broadly dispersed and not always consistently organized . . . , searches, identification, and collection all require special skills and, if done properly, may utilize one or more search tools such as . . . predictive coding, and other advanced analytics.” [176] Notably, only the explanatory section in the Federal Register specifically mentions predictive coding. [177] However, the specific reference in the Federal Register section still provides helpful insights into how the agency views predictive coding and similar technology. Additionally, and most importantly, it signals that at least one regulatory body is willing to move beyond the intermediate stages of agency responses to meaningful rule-making that seriously considers predictive coding technologies.

V. An Overview of Industry Insights and Miscellaneous Factors to Consider When Developing the Rules and Regulations Surrounding Predictive Coding

Much of the review and analysis in this Comment, and much of the available academic literature, [178] presupposes that private entities, or the attorneys representing them, will be the ones actually employing predictive coding software in response to disclosure inquires, whether in litigation or in regulatory inquests. Interestingly, the SEC’s use of coding technologies, when considered in conjunction with National Day Laborer, which also touches on this issue to an extent, [179] poses an additional normative question: whether government bodies should also be allowed to use predictive coding in response to document requests. Government bodies are held to different standards than private parties, [180] and thus something that may be appropriate for a corporation or individual to employ may not be sufficient to meet certain higher burdens held by agencies. On the other hand, because the government has similar interests in producing relevant documents at low cost to personnel and budgets, [181] predictive coding provides parallel possibilities for increased speed, quality of review, and fiscal efficiency. However, the ways in which states and other branches of government do or do not use predictive coding will likely be influenced by the examples set by the courts and regulatory protocols. As such, increased transparency and awareness of the ramifications of these decisions are crucial for the proper development of predictive coding in a number of venues.

Another serious question is whether these technologies should be permitted in criminal cases. [182] Given that the Department of Justice is already accepting data produced through automated review, [183] this concern is far more than academic. It also brings together regulatory and judicial concerns—hypothetically, predictive coding technologies could be sufficient when an agency determines wrongdoing has occurred and pursues regulatory action against an entity but the court may not deem it acceptable for use in a judicial context. Additionally, the higher moral and punitive stakes in criminal cases particularly implicate the margin of error from automated retrieval. If the one document to prove innocence is missed because it is within the margin of error, is that acceptable in a criminal case? Should it be? Should an attorney be able to make that call for a client or should explicit understanding and consent be required from defendants? These are issues that connect regulatory and judicial action and raise salient issues that touch more than procedural concerns and may call for public commentary in addition to legal recommendations.

VI. Comparisons, Differences, and Recommendations for Courts and Agencies Addressing the Use of Predictive Coding

A review of the seminal cases and limited information available about regulatory bodies’ behavior reveals a number of parallel trends. Most fundamentally, predictive coding solves a common problem: ESI results in massive caches of data that are fundamentally incompatible with traditional human review but can be more readily and affordably accessed with technology. Moreover, among both courts and regulatory bodies, there is clear support for heightened transparency and cooperation between the parties who seek to employ predictive coding. Furthermore, shared areas of tension also arise. For example, can an unwilling party be mandated to engage in, or at the least consider engaging in, predictive coding? Is it better for decisions to be made by individuals or bodies with personal experience working with predictive coding? The latter question is particularly poignant given that while agencies may gain actual experience using, rather than regulating, coding, it is less likely that judges will do so. Additionally, actual use will help to enforce accurate expectations of what the technology is capable of but may also create other biases, including those in favor of vendors or in favor of decreased transparency by parties, who may fear that transparency could implicate privilege or attorney-work product concerns. [184] It also creates an imbalance between judges and agencies, as only agencies really have the power to gain this hands-on experience, since courts act as mediators while agencies can be parties to disputes involving actual data. Moreover, it raises the question of whether there should be differences between litigation and regulatory use. Finally, both contexts place significant emphasis on the need for proper training in order to create unbiased results.

There are also a number of differences between litigation and regulatory contexts. First, thus far, litigation has involved private parties and the potential for information and monetary imbalances, [185] whereas regulatory bodies can require private parties to produce information regardless of the individuals’ finances and comfort. [186] This dynamic relegates the greater incentives to private parties, rather than government agencies, for incorporation of this new technology. [187] However, other factors make agencies seem to be the more attractive candidates for incorporating coding technologies. For example, agencies simply have better opportunities to gain experience around predictive coding: they deal with more eligible data caches than courts, they can use coding technology on their own initiatives, they are less bound by rigid rules of evidence and procedure, and the parties they work with have a high self-interest in promoting coding. [188] The interaction with agencies is distinguishable in that the agencies themselves must peruse the produced data, whereas judges only act as umpires between the parties exchanging the data. [189] Judges, on the other hand, have strong interests in justice and efficiency, but mere context may render them less experienced or willing to make a determinative endorsement of particular technologies. [190] Additionally, courts are bound by the rules of evidence and procedure. [191] Judicial decisions that integrate predictive coding into a traditional and foreseeable structure have important ramifications both for the technology and how it develops—for example, what limits on its use will stunt growth in some areas and foster growth in others? A significant positive, additionally, is the great transparency judicial decisions have in this area. [192]

As such, courts and agencies should share their strengths with the other and complement the other’s weaknesses. Both groups should also consider how and where to accept sufficient quality control levels—at what point is the program sufficiently trained and reliable? As previously discussed, this may vary significantly depending on the legal context. This means that while agencies and courts can learn from each other’s experiments, they also should be careful to acknowledge the differences between them and how these differences may make certain rules either inapplicable or materially harmful to a particular context.


Predictive coding is still in its trial stages—in all likelihood, it will not work in all contexts, and it will need guidelines in order to achieve its highest potential in appropriate cases. Because it is still developing, many questions remain as to what predictive coding can do and what it should do. The debate over how to answer these questions will only strengthen understandings of its possibilities and highlight areas of concern. Raising questions is a positive thing: by identifying early on the areas where tension and issues may arise, courts and agencies can create rules in light of these issues and play a preventative role. They can also highlight the issue areas that they may be most concerned about—such as transparency, companies’ and attorneys’ desire to protect confidential, non-responsive documents, and cooperation. When determining both what areas to focus their energies on and how to achieve these goals, courts and agencies should also strongly consider research by other groups, such as the Sedona Conference’s Sedona Principles Addressing Electronic Document Production,[193] as the informational exchange will be best served through this sort of transparent exchange.[194] By showing each other what predictive coding has achieved and in what contexts it has done so, both courts and regulatory bodies can use this knowledge to promote rules that identify and create the best guidelines that allow predictive coding to grow without stifling justice, impartiality, or efficiency. In order to maintain existing legal principles, or to make a conscious choice to overhaul existing norms, all bodies should be hyper-vigilant when addressing new technology to be sure that all regulatory and guiding choices prospectively promote desired future reforms, rather than resulting in forced, piecemeal changes. Predictive coding holds great promise for the future of the legal field, but it is up to the community as a whole to ensure a healthy growth environment so it can achieve its full potential while not alienating existing moral and procedural standards.

Christina T. Nasuti**


                     *   © 2014 Christina T. Nasuti.

                  **   Special thanks go to my loving family and fiancé, who have supported me through every stage of this project. I would also especially like to thank Professor Dana Remus, who introduced me to predictive coding and was the inspiration for this Comment’s topic. Finally, thank you to the North Carolina Law Review Board and Staff, and Jennifer Little in particular, for your thorough edits and help along the way. I am sincerely grateful to you all.

DOWNLOAD PDF | 93 N.C. L. Rev.222 (2014)

Related Content

  • Plausibly Willful—Tightening Pleading Standards in FACTA Credit Card Receipt Litigation Where Only an Expiration Date Is Present
  • J. Patrick Redmon
  • The Uninvited Guest: The Unexpected Damage to Privacy from the Expansion of Implied Licenses
  • Isaac A. Rank
  • An “Insurmountable Hurdle” to Class Action Certification? The Heightened Ascertainability Requirement’s Effect on Small Consumer Claims
  • Sarah R. Cansler

Most Popular Pieces