Two-Filter Document Culling

April 27, 2015 by Ralph Losey

Large document review projects can maximize efficiency by employing a two-filter method to cull documents from costly manual review. This method helps reduce costs and maximize recall. I introduced this method, and the diagram shown here illustrating it, at the conclusion of my blog series, Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part Three. I use the two-filter method in most large projects as part of my overall multimodal, bottom line driven, AI-Enhanced (i.w. – predictive coding) method of review. I have described this multimodal method many times here, and you will find summaries of it elsewhere, including my CAR page, and Legal Search Science, and the work in progress, the EDBP outlining best practices for lawyers doing e-discovery.

My two-filter method of course employs deduplication and deNisting in the First Filter. (I always do full horizontal deduplication across all custodians.) Deduplication and deNisting are, however, mere technical, non-legal filters. They are already well established industry standards and so I see no need to discuss them further in this article.

Some think those two technical methods are the end-all of ESI culling, but, as this two-part blog will explain, they are just the beginning. The other methods require legal judgment, and so you cannot just hire a vendor to do it, as you can with deduplication and deNisting. This is why I am taking pains to explain two-filter document culling, so that it can be used by other legal teams to reduce wasted review expenses.

This blog is the first time I have gone into the two-filter culling component in any depth. This method has been proven effective in attaining high recall at low cost in at least one open scientific experiment, although I cannot go into that. You will just have to trust me on that. Insiders know anyway. For the rest, just look around and see I have no products to sell here, and accept no ads. This is all part of an old lawyer’s payback to a profession that has been very good to him over the years.

My thirty-five years of experience in law have shown me that most reliable way for the magic of justice to happen is by finding the key documents. You find the truth, the whole truth, and nothing but the truth when you find the key documents and use them to keep the witnesses honest. Deciding cases on the basis of the facts is the way our system of justice tries to decide all cases on the merits, in an impartial and fair manner. In today’s information flooded world, that can only happen if we use technology to find relevant evidence quickly and inexpensively. The days of finding the truth by simple witness interviews are long gone. Thus I share my search and review methods as a kind of payback and pay it forward. For now, as I have for the past eight years, I will try to make the explanations accessible to beginners and eLeet alike.

We need cases to be decided on the merits, on the facts. Hopefully my writing and rants will help make that happen in some small way. Hopefully it will help stem the tide of over-settlement, where many cases are decided on the basis of settlement value, not merits. Too many frivolous cases are filed that drown out the few with great merit. Judges are overwhelmed and often do not have the time needed to get down to the truth and render judgments that advance the cause of justice.

BULGER-VERDICT Most of the time the judges, and the juries they assemble, are never even given the chance to do their job. The cases all settle out instead. As a result only one percent of federal civil cases actually go to trial. This is a big loss for society, and for the so-called “trial lawyers” in our profession, a group I once prided myself to be a part. Now I just focus on getting the facts from computers, to help keep the witnesses honest, and cases decided on the true facts, the evidence. That is where all the real action is nowadays anyway.

By the way, I expect to get another chance to prove the value of the methods I share here in the 2015 TREC experiment on recall. We will see, again, how it stacks up to other approaches. This time I may even have one or two people assist me, instead of doing it alone as I did before. The Army of One approach, which I have also described here many times, although effective, is very hard and time-consuming. My preference now is a small team approach, kind of like a nerdy swat team, or Seal Team Six approach, but without guns and killing people and stuff. I swear! Really.

I do try to cooperate whenever possible. I preach it and I try hard to walk my talk. I have always endorsed Richard Braman’s Cooperation Proclamation, unlike some. You know who you are.

Some Software is Far Better than Others

One word of warning, although this method is software agnostic, in order to emulate the two-filter method, your document review software must have certain basic capabilities. That includes effective, and easy, bulk coding features for the first filter. This is the multimodal broad-based culling. Some of the multiple methods do not require software features, just attorney judgment, such as excluding custodians, but other do require software features, like domain searches or similarity searches. If your software does not have the features that will be discussed here for the first filter, then you probably should switch right away, but, for most, that will not be a problem. The multimodal culling methods used in the first filter are, for the most part, pretty basic.

Some of the software features needed to implement the second filter, are, however, more advanced. The second filter works best when using predictive coding and probability ranking. You review the various strata of the ranked documents. The Second Filter can still be used with other, less advanced multimodal methods, i.e. keywords. Moreover, even when you use bona fide active machine learning software features, you continue to use a smattering of other multimodal search methods in the Second Filter. But now you do so not to cull, but to help find relevant and highly relevant documents to improve training. I do not rely on probability searches alone, although sometimes in the Second Filter I rely almost entirely on predictive coding based searches to continue the training.

If you are using software without AI-enhanced active learning features, then you are forced to only use other multimodal methods in the second filter, such as keywords. Warning, true active learning features are not present in most review software, or are very weak. That is true even with software that claims to have predictive coding features, but really just has dressed-up passive learning, i.e. concept searches with latent semantic indexing. You handicap yourself, and your client, by continuing to use such less expensive programs. Good software, like everything else, does not come cheap, but should pay for itself many times over if used correctly. The same comment goes for lawyers too.

First Filter – Keyword Collection Culling

Some first stage filtering takes place as part of the ESI collection process. The documents are preserved, but not collected nor ingested into the review database. The most popular collection filter as of 2015 is still keyword, even though this is very risky in some cases and inappropriate in many. Typically such keyword filtering is driven by vendor costs to avoid processing and hosting charges.

Some types of collection filtering are appropriate and necessary, for instance, in the case of custodian filters, where you broadly preserve the ESI of many custodians, just in case, but only collect and review a few of them. It is, however, often inappropriate to use keywords to filter out the collection of ESI from admittedly key custodians. This is a situation where an attorney determines that a custodian’s data needs to be reviewed for relevant evidence, but does not want to incur the expense to have all of their ESI ingested into the review database. For that reason they decide to only review data that contains certain keywords.

I am not a fan of keyword filtered collections. The obvious danger of keyword filtering is that important documents may not have the keywords. Since they will not even be placed in the review platform, you will never know that the relevant ESI was missed. You have no chance of finding them.

See eg, William Webber’s analysis of the Biomet case where this kind of keyword filtering was use before predictive coding began. What is the maximum recall in re Biomet?, Evaluating e-Discovery (4/24/13). Webber shows that in Biomet this method First Filtered out over 40% of the relevant documents. This doomed the Second Filter predictive coding review to a maximum possible recall of 60%, even if was perfect, meaning it would otherwise have attained 100% recall, which never happens. The Biomet case very clearly shows the dangers of over-reliance on keyword filtering.

Nevertheless, sometimes keyword collection may work, and may be appropriate. In some simple disputes, and with some data collections, obvious keywords may work just fine to unlock the truth. For instance, sometimes the use of names is an effective method to identify all, or almost all, documents that may be relevant. This is especially true in smaller and simpler cases. This method can, for instance, often work in employment cases, especially where unusual names are involved. It becomes an even more effective method when the keywords have been tested. I just love it, for instance, when the plaintiff’s name is something like the famous Mister Mxyzptlk.

In some cases keyword collections may be as risky as in the complex Biomet case, but may still be necessary because of the proportionality constraints of the case. The law does not require unreasonably excessive search and review, and what is reasonable in a particular case depends on the facts of the case, including its value. See my many writings on proportionality, including my law review article Predictive Coding and Proportionality: A Marriage Made In Heaven, 26 Regent U. Law Review 1 (2013-2014). Sometimes you have to try for rough justice with the facts that you can afford to find given the budgetary constraints of the case.

The danger of missing evidence is magnified when the keywords are selected on the basis of educated guesses or just limited research. This technique, if you can call it that, is, sadly, still the dominant method used by lawyers today to come up with keywords. I have long thought it is equivalent to a child’s game of Go Fish. If keywords are dreamed up like that, as mere educated guesses, then keyword filtering is a high risk method of culling out irrelevant data. There is a significant danger that it will exclude many important documents that do not happen to contain the selected keywords. No matter how good your predictive coding may be after that, you will never find these key documents.

If the keywords are not based on a mere guessing, but are instead tested, then it becomes a real technique that is less risky for culling. But how do you test possible keywords without first collecting and ingesting all of the documents to determine which are effective? It is the old cart before the horse problem.

One partial answer is that you could ask the witnesses, and do some partial reviews before collection. Testing and witness interviews is required by Judge Andrew Peck’s famous wake up call case. William A. Gross Constr. Assocs., Inc. v. Am. Mfrs. Mut. Ins. Co., 256 F.R.D. 134, 134, 136 (S.D.N.Y. 2009). I recommend that opinion often, as many attorneys still need to wake up about how to do e-discovery. They need to add ESI use, storage, and keyword questions to their usual new case witness interviews.

Interviews do help, but there is nothing better than actual hands on reading and testing of the documents. This is what I like to call getting your hands dirty in the digital mud of the actual ESI collected. Only then will you know for sure the best way to mass-filter out documents. For that reason my strong preference in all significant size cases is to collect in bulk, and not filter out by keywords. Once you have documents in the database, then you can then effectively screen them out by using parametric Boolean keyword techniques. See your particular vendor for various ways on how to do that.

By the way, parametric is just a reference to the various parameters of a computer file that all good software allows you to search. You could search the text and all metadata fields, the entire document. Or you could limit your search to various metadata fields, such as date, prepared by, or the to and from in an email. Everyone knows what Boolean means, but you may not know all of the many variations that your particular software offers to create highly customized searches. While predictive coding is beyond the grasp of most vendors and case managers, the intricacies of keyword search are not. They can be a good source of information on keyword methods.

First Filter – Date Range and Custodian Culling

Even when you collect in bulk, and do not keyword filter before you put custodian ESI in the review database, in most cases you should filter for date range and custodian. It is often possible for an attorney to know, for instance, that no emails before or after a certain date could possibly be relevant. That is often not a highly speculative guessing game. It is reasonable to filter on this time-line basis before the ESI goes in the database. Whenever possible, try to get agreement on date range screening from the requesting party. You may have to widen it a little, but it is worth the effort to establish a line of communication and begin a cooperative dialogue.

The second thing to talk about is which custodians you are going to include in the database. You may put 50 custodians on hold, and actually collect the ESI of 25, but that does not mean you have to load all 25 into the database for review. Here your interviews and knowledge of the case should allow you to know who the key, key custodians are. You rank them by your evaluation of the likely importance of the data they hold to the facts disputed in the case. Maybe, for instance, in your evaluation you only need to review the mailboxes of 10 of the 25 collected.

Again, disclose and try to work that out. The requesting party can reserve rights to ask for more, that is fine. They rarely do after production has been made, especially if you were careful and picked the right 10 to start with, and if you were careful during review to drop and add custodians based on what you see. If you are using predictive coding in the second filter stage, the addition or deletion of data mid-course is still possible with most software. It should be robust enough to handle such mid-course corrections. It may just slow down the ranking for a few iterations, that’s all.

First Filter – Other MultiModal Culling

There are many other bulk coding techniques that can be used in the first filter stage. This is not intended to be an exhaustive search. Like all complex tasks in the law, simple black letter rules are for amateurs. The law, which mirrors the real world, does not work like that. The same holds true for legal search. There may be many Gilbert’s for search books and articles, but they are just 1L types guides. For true legal search professionals they are mere starting points. Use my culling advice here in the same manner. Use your own judgment to mix and match the right kind of culling tools for the particular case and data encountered. Every project is slightly different, even in the world of repeat litigation, like employment law disputes where I currently spend much of my time.

Team_Triangle Legal search is at core a heuristic activity, but one that should be informed by science and technology. The knowledge triangle is a key concept for today’s effective e-Discovery Team. Although e-Discovery Teams should be led by attorneys skilled in evidence discovery, they should include scientists and engineers in some way. Effective team leaders should be able to understand and communicate with technology experts and information scientists. That does not mean all e-discovery lawyers need to become engineers and scientists too. That effort would likely diminish your legal skills based on the time demands involved. It just means you should know enough to work with these experts. That includes the ability to see through the vendor sales propaganda, and to incorporate the knowledge of the bona fide experts into your legal work.

One culling method that many overlook is file size. Some collections have thousands of very small files, just a few bits, that are nothing but backgrounds, tiny images, or just plain empty space. They are too small to have any relevant information. Still, you need to be cautious and look out for very small emails, for instance, ones that just says “yes.” Depending on context it could be relevant and important. But for most other types of very small files, there is little risk. You can go ahead a bulk code them irrelevant and filter them out.

Even more subtle is filtering out files based on their being very large. Sort your files by size, and then look at both ends, small and big. They may reveal certain files and file types that could not possibly be relevant. There is one more characteristic of big files that you should consider. Many of them have millions of lines of text. Big files are confusing to machine learning when, as typical, only a few lines of the text are relevant, and the rest are just noise. That is another reason to filter them out, perhaps not entirely, but for special treatment and review outside of predictive coding. In other projects where you have many large files like that, and you need the help of AI ranking, you may want to hold them in reserve. You may only want to throw them into the ranking mix after your AI algorithms have acquired a pretty good idea of what you are looking for. A maturely trained system is better able to handle big noisy files.

File type is a well-known and often highly effective method to exclude large numbers of files of a same type after only looking at a few of them. For instance, there may be database files automatically generated, all of the same type. You look at a few to verify these databases could not possibly be relevant to your case, and then you bulk code them all irrelevant. There are many types of files like that in some data sets. The First Filter is all about being a smart gatekeeper.

File type is also used to eliminate, or at least divert, non-text files, such as audio files or most graphics. Since most Second Filter culling is going to be based on text analytics of some kind, there is no point for anything other than files with text to go into that filter. In some cases, and some datasets, this may mean bulk coding them all irrelevant. This might happen, for instance, where you know that no music or other audio files, including voice messages, could possibly be relevant. We also see this commonly where we know that photographs and other images could not possibly be relevant. Exclude them from the review database.

You must, however, be careful with all such gatekeeper activities, and never do bulk coding without some judgmental sampling first. Large unknown data collections can always contain a few unexpected surprises, no matter how many document reviews you have done before. Be cautious. Look before you leap. Skim a few of the ESI file types you are about to bulk code as irrelevant.

This directive applies to all First Filter activities. Never do it blind on just logic or principle alone. Get you hands in the digital mud. Do not over-delegate all of the dirty work to others. Do not rely too much on your contract review lawyers and vendors, especially when it comes to search. Look at the documents yourself and do not just rely on high level summaries. Every real trial lawyer knows the importance of that. The devil is always in the details. This is especially true when you are doing judgmental search. The client wants your judgment, not that of a less qualified associate, paralegal, or minimum wage contract review lawyer. Good lawyers remain hands-on, to some extent. They know the details, but are also comfortable with appropriate delegation to trained team members.

There is a constant danger of too much delegation in big data review. The lawyer signing the Rule 26(g) statement has a legal and ethical duty to closely supervise document review done in response to a request for production. That means you cannot just hire a vendor to do that, although you can hire outside counsel with special expertise in the field.

Some non-text file types will need to be diverted for different treatment than the rest of your text-based dataset. For instance, some of the best review software allows you to keyword search audio files. It is based on phonetics and wave forms. At least one company I know has had that feature since 2007. In some cases you will have to carefully review the image files, or at least certain kinds of them. Sorting based on file size and custodian can often speed up that exercise.

Remember the goal is always efficiency, and caution, but not over cautious. The more experienced you get the better you become at evaluating risks and knowing where you can safely take chances to bulk code, and where you cannot. Another thing to remember is that many image files have text in them too, such as in the metadata, or in ASCII transmissions. They are usually not important and do not provide good training for second stage predictive coding.

Text can also be hidden in dead Tiff files, if they have not been OCR’ed. Scanned documents Tiffs, for instance, may very well be relevant and deserve special treatment, including full manual review, but they may not show in your review tool as text, because they have never been OCR text recognized.

Concept searches have only rarely been of great value to me, but should still be tried out. Some software has better capacities with concepts and latent semantic indexing than others. You may find it to be a helpful way to find groupings of obviously irrelevant, or relevant documents. If nothing else, you can always learn something about your dataset from these kind of searches.

Similarity searches of all kinds are among my favorite. If you find some files groups that cannot be relevant, find more like that. They are probably bulk irrelevant (or relevant) too. A similarity search, such as find every document that is 80% or more the same as this one, is often a good way to enlarge your carve outs and thus safely improve your efficiency.

Another favorite of mine is domain culling of email. It is kind of like a spam filter. That is a great way to catch the junk mail, newsletters, and other purveyors of general mail that cannot possibly be relevant to your case. I have never seen a mail collection that did not have dozens of domains that could be eliminated. You can sometimes cull-out as much as 10% of your collection that way, sometimes more when you start diving down into senders with otherwise safe domains. A good example of this is the IT department with their constant mass mailings, reminders and warnings. Many departments are guilty of this, and after examining a few, it is usually safe to bulk code them all irrelevant.

Second Filter – Predictive Culling and Coding

The second filter begins where the first leaves off. The ESI has already been purged of unwanted custodians, date ranges, spam, and other obvious irrelevant files and file types. Think of the First Filter as a rough, coarse filter, and the Second Filter as fine grained. The Second Filter requires a much deeper dive into file contents to cull out irrelevance. The most effective way to do that is to use predictive coding, by which I mean active machine learning, supplemented somewhat by using a variety of methods to find good training documents. That is what I call a multimodal approach that places primary reliance on the Artificial Intelligence at the top of the search pyramid. If you do not have active machine learning type of predictive coding with ranking abilities, you can still do fine grained Second Level filtering, but it will be harder, and probably less effective and more expensive.

All kinds of Second Filter search methods should be used to find highly relevant and relevant documents for AI training. Stay away from any process that uses just one search method, even if the one method is predictive ranking. Stay far away if the one method is rolling dice. Relying on random chance alone has been proven to be an inefficient and ineffective way to select training documents. Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One, Two, Three and Four. No one should be surprised by that.

The first round of training begins with the documents reviewed and coded relevant incidental to the First Filter coding. You may also want to defer the first round until you have done more active searches for relevant and highly relevant from the pool remaining after First Filter culling. In that case you also include irrelevant in the first training round, which is also important. Note that even though the first round of training is the only round of training that has a special name – seed set – there is nothing all that important or special about it. All rounds of training are important.

There is so much misunderstanding about that, and seed sets, that I no longer like to even use the term. The only thing special in my mind about the first round of training is that it is often a very large training set. That happens when the First Filter turns up a large amount of relevant files, or they are otherwise known and coded before the Second Filter training begins. The sheer volume of training documents in many first rounds thus makes it special, not the fact that it came first.

No good predictive coding software is going to give special significance to a training document just because it came first in time. The software I use has no trouble at all disregarding any early training if it later finds that it is inconsistent with the total training input. It is, admittedly, somewhat aggravating to have a machine tell you that your earlier coding was wrong. But I would rather have an emotionless machine tell me that, than another gloating attorney (or judge), especially when the computer is correct, which is often (not always) the case.

That is, after all, the whole point of using good software with artificial intelligence. You do that to enhance your own abilities. There is no way I could attain the level of recall I have been able to manage lately in large document review projects by reliance on my own, limited intelligence alone. That is another one of my search and review secrets. Get help from a higher intelligence, even if you have to create it yourself by following proper training protocols.

Maybe someday the AI will come prepackaged, and not require training, as I imagine in PreSuit. I know it can be done. I can do it with existing commercial software. But apparently from the lack of demand I have seen in reaction to my offer of Presuit as a legal service, the world is not ready to go there yet. I for one do not intend to push for PreSuit, at least not until the privacy aspects of information governance are worked out. Should Lawyers Be Big Data Cops?

Information governance in general is something that concerns me, and is another reason I hold back on Presuit. Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part One and Part Two. Also see: e-Discovery Industry Reaction to Microsoft’s Offer to Purchase Equivio for $200 Million – Part Two. I do not want my information governed, even assuming that’s possible. I want it secured, protected, and findable, but only by me, unless I give my express written assent (no contracts of adhesion permitted). By the way, even though I am cautious, I see no problem in requiring that consent as a condition of employment, so long as it is reasonable in scope and limited to only business communications.

I am wary of Big Brother emerging from Big Data. You should be too. I want AIs under our own individual control where they each have a real big off switch. That is the way it is now with legal search and I want it to stay that way. I want the AIs to remain under my control, not visa versa. Not only that, like all Europeans, I want a right to be forgotten by AIs and humans alike.

But wait, there’s still more to my vision of a free future, one where the ideals of America triumph. I want AIs smart enough to protect individuals from out of control governments, for instance, from any government, including the Obama administration, that ignores the Constitutional prohibition against General Warrants. See: Fourth Amendment to the U.S. Constitution. Now that Judge Facciola has retired, who on the DC bench is brave enough to protect us? See: Judge John Facciola Exposes Justice Department’s Unconstitutional Search and Seizure of Personal Email.

Perhaps quantum entanglement encryption is the ultimate solution? See eg.: Entangled Photons on Silicon Chip: Secure Communications & Ultrafast Computers, The Hacker News, 1/27/15. Truth is far stranger than fiction. Quantum Physics may seem irrational, but it has been repeatedly proven true. The fact that it may seem irrational for two electrons to interact instantly over any distance just means that our sense of reason is not keeping up. There may soon be spooky ways for private communications to be forever private.

At the same time that I want unentangled freedom and privacy, I want a government that can protect us from crooks, crazies, foreign governments, and black hats. I just do not want to give up my Constitutional rights to receive that protection. We should not have to trade privacy for security. Once we lay down our Constitutional rights in the name of security, the terrorists have already won. Why do we not have people in the Justice Department clear-headed enough to see that?

Getting back to legal search, and how to find out what you need to know inside the law by using the latest AI-enhanced search methods, there are three kinds of probability ranked search engines now in use for predictive coding.

Three Kinds of Second Filter Probability Based Search Engines

SAL After the first round of training, you can begin to harness the AI features in your software. You can begin to use its probability ranking to find relevant documents. There are currently three kinds of ranking search and review strategies in use: uncertainty, high probability, and random. The uncertainty search, sometimes called SAL for Simple Active Learning, looks at middle ranking documents where the code is unsure of relevance, typically the 40%-60% range. The high probability search looks at documents where the AI thinks it knows about whether documents are relevant or irrelevant. You can also use some random searches, if you want, both simple and judgmental, just be careful not to rely too much on chance.

The 2014 Cormack Grossman comparative study of various methods has shown that the high probability search, which they called CAL, for Continuous Active Learning using high ranking documents, is very effective. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014. Also see: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part Two.

My own experience also confirms their experiments. High probability searches usually involve SME training and review of the upper strata, the documents with a 90% or higher probability of relevance. I will, however, also check out the low strata, but will not spend as much time on that end. I like to use both uncertainty and high probability searches, but typically with a strong emphasis on the high probability searches. And again, I supplement these ranking searches with other multimodal methods, especially when I encounter strong, new, or highly relevant type documents.

Sometimes I will even use a little random sampling, but the mentioned Cormack Grossman study shows that it is not effective, especially on its own. They call such chance based search Simple Passive Learning, or SPL. Ever since reading the Cormack Grossman study I have cut back on my reliance on random searches. You should too. It was small before, it is even smaller now.

Irrelevant Training Documents Are Important Too

In the second filer you are on a search for the gold, the highly relevant, and, to a lesser extent, the strong and merely relevant. As part of this Second Filter search you will naturally come upon many irrelevant documents too. Some of these documents should also be added to the training. In fact, is not uncommon to have more irrelevant documents in training than relevant, especially with low prevalence collections. If you judge a document, then go ahead and code it and let the computer know your judgment. That is how it learns. There are some documents that you judge that you may not want to train on – such as the very large, or very odd – but they are few and far between,

Of course, if you have culled out a document altogether in the First Filter, you do not need to code it, because these documents will not be part of the documents included in the Second Filter. In other words, they will not be among the documents ranked in predictive coding. The will either be excluded from possible production altogether as irrelevant, or will be diverted to a non-predictive coding track for final determinations. The later is the case for non-text file types like graphics and audio in cases where they might have relevant information.

How To Do Second Filter Culling Without Predictive Ranking

When you have software with active machine learning features that allow you to do predictive ranking, then you find documents for training, and from that point forward you incorporate ranking searches into your review. If you do not have such features, you still sort out documents in the Second Filter for manual review, you just do not use ranking with SAL and CAL to do so. Instead, you rely on keyword selections, enhanced with concept searches and similarity searches.

When you find an effective parametric Boolean keyword combination, which is done by a process of party negotiation, then testing, educated guessing, trial and error, and judgmental sampling, then you submit the documents containing proven hits to full manual review. Ranking by keywords can also be tried for document batching, but be careful of large files having many keyword hits just on the basis of file size, not relevance. Some software compensates for that, but most do not. So ranking by keywords can be a risky process.

I am not going to go into detail on the old fashioned ways of batching out documents for manual review. Most e-discovery lawyers already have a good idea of how to do that. So too do most vendors. Just one word of advice. When you start the manual review based on keyword or other non-predictive coding processes, check in daily with the contract reviewer work and calculate what kind of precision the various keyword and other assignment folders are creating. If it is terrible, which I would say is less than 50% precision, then I suggest you try to improve the selection matrix. Change the Boolean, or key words, or something. Do not just keep plodding ahead and wasting client money.

I once took over a review project that was using negotiated, then tested and modified keywords. After two days of manual review we realized that only 2% of the documents selected for review by this method were relevant. After I came in and spent three days with training to add predictive ranking we were able to increase that to 80% precision. If you use these multimodal methods, you can expect similar results.

Basic Idea of Two Filter Search and Review

Whether you use predictive ranking or not, the basic idea behind the two filter method is to start with a very large pool of documents, reduce the size by a coarse First Filter, then reduce it again by a much finer Second Filter. The result should be a much, much small pool that is human reviewed, and an even smaller pool that is actually produced or logged. Of course, some of the documents subject to the final human review may be overturned, that is, found to be irrelevant, False Positives. That means they will not make it to the very bottom production pool after manual review in the diagram right.

In multimodal projects where predictive coding is used the precision rates can often be very high. Lately I have been seeing that the second pool of documents, subject to the manual review has precision rates of at least 80%, sometimes even as high as 95% near the end of a CAL project. That means the final pool of documents produced is almost as large as the pool after the Second Filter.

Please remember that almost every document that is manually reviewed and coded after the Second Filter gets recycled back into the machine training process. This is known as Continuous Active Learning or CAL, and in my version of it at least, is multimodal and not limited to only high probability ranking searches. See: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training– Part Two. In some projects you may just train for multiple iterations and then stop training and transition to pure manual review, but in most you will want to continue training as you do manual review. Thus you set up a CAL constant feedback loop until you are done, or nearly done, with manual review.

As mentioned, active machine learning trains on both relevance and irrelevance. Although, in my opinion, the documents found that are Highly Relevant, the hot documents, are the most important of all for training purposes. The idea is to use predictive coding to segregate your data into two separate camps, relevant and irrelevant. You not only separate them, but you also rank them according to probable relevance. The software I use has a percentage system from .01% to 99.9% probable relevant and visa versa. A near perfect segregation-ranking project should end up looking like an upside down champagne glass.

After you have segregated the document collection into two groups, and gone as far as you can, or as far as your budget allows, then you cull out the probable irrelevant. The most logical place for the Second Filter cut-off point in most projects in the 49.9% and less probable relevant. They are the documents that are more likely than not to be irrelevant. But do not take the 50% plus dividing line as an absolute rule in every case. There are no hard and fast rules to predictive culling. In some cases you may have to cut off at 90% probable relevant. Much depends on the overall distribution of the rankings and the proportionality constraints of the case. Like I said before, if you are looking for Gilbert’s black-letter law solutions to legal search, you are in the wrong type of law.

Almost all of the documents in the production set (the red top half of the diagram) will be reviewed by a lawyer or paralegal. Of course, there are shortcuts to that too, like duplicate and near-duplicate syncing. Some of the documents in the irrelevant low ranked documents will have been reviewed too. That is all part of the CAL process where both relevant and irrelevant documents are used in training. But only a very low percentage of the probable irrelevant documents need to be reviewed.

Limiting Final Manual Review

In some cases you can, with client permission (often insistence), dispense with attorney review of all or near all of the documents in the upper half. You might, for instance, stop after the manual review has attained a well defined and stable ranking structure. You might, for instance, only have reviewed 10% of the probable relevant documents (top half of the diagram), but decide to produce the other 90% of the probable relevant documents without attorney eyes ever looking at them. There are, of course, obvious problems with privilege and confidentiality to such a strategy. Still, in some cases, where appropriate clawback and other confidentiality orders are in place, the client may want to risk disclosure of secrets to save the costs of final manual review.

In such productions there are also dangers of imprecision where a significant percentage of irrelevant documents are included. This in turn raises concerns that an adversarial view of the other documents could engender other suits, even if there is some agreement for return of irrelevant. Once the bell has been rung, privileged or hot, it cannot be un-rung.

Case Example of Production With No Final Manual Review

In spite of the dangers of the unringable bell, the allure of extreme cost savings can be strong to some clients in some cases. For instance, I did one experiment using multimodal CAL with no final review at all, where I still attained fairly high recall, and the cost per document was only seven cents. I did all of the review myself acting as the sole SME. The visualization of this project would look like the below figure.

Note that if the SME review pool were drawn to scale according to number of documents read, then, in most cases, it would be much smaller than shown. In the review where I brought the cost down to $0.07 per document I started with a document pool of about 1.7 Million, and ended with a production of about 400,000. The SME review pool in the middle was only 3,400 documents.

As far as legal search projects go it was an unusually high prevalence, and thus the production of 400,000 documents was very large. Four hundred thousand was the number of documents ranked with a 50% or higher probable prevalence when I stopped the training. I only personally reviewed about 3,400 documents during the SME review, plus another 1,745 after I decided to stop training in a quality assurance sample. To be clear, I worked alone, and no one other than me reviewed any documents. This was an Army of One type project.

Although I only personally reviewed 3,400 documents for training, and I actually instructed the machine to train on many more documents than that. I just selected them for training without actually reviewing them first. I did so on the basis of ranking and judgmental sampling of the ranked categories. It was somewhat risky, but it did speed up the process considerably, and in the end worked out very well. I later found out that information scientists often use this technique as well.

My goal in this project was recall, not precision, nor even F1, and I was careful not to overtrain on irrelevance. The requesting party was much more concerned with recall than precision, especially since the relevancy standard here was so loose. (Precision was still important, and was attained too. Indeed, there were no complaints about that.) In situations like that the slight over-inclusion of relevant training documents is not terribly risky, especially if you check out your decisions with careful judgmental sampling, and quasi-random sampling.

I accomplished this review in two weeks, spending 65 hours on the project. Interestingly, my time broke down into 46 hours of actual document review time, plus another 19 hours of analysis. Yes, about one hour of thinking and measuring for every two and a half hours of review. If you want the secret of my success, that is it.

I stopped after 65 hours, and two weeks of calendar time, primarily because I ran out of time. I had a deadline to meet and I met it. I am not sure how much longer I would have had to continue the training before the training fully stabilized in the traditional sense. I doubt it would have been more than another two or three rounds; four or five more rounds at most.

Typically I have the luxury to keep training in a large project like this until I no longer find any significant new relevant document types, and do not see any significant changes in document rankings. I did not think at the time that my culling out of irrelevant documents had been ideal, but I was confident it was good, and certainly reasonable. (I had not yet uncovered my ideal upside down champagne glass shape visualization.) I saw a slow down in probability shifts, and thought I was close to the end.

I had completed a total of sixteen rounds of training by that time. I think I could have improved the recall somewhat had I done a few more rounds of training, and spent more time looking at the mid-ranked documents (40%-60% probable relevant). The precision would have improved somewhat too, but I did not have the time. I am also sure I could have improved the identification of privileged documents, as I had only trained for that in the last three rounds. (It would have been a partial waste of time to do that training from the beginning.)

The sampling I did after the decision to stop suggested that I had exceeded my recall goals, but still, the project was much more rushed than I would have liked. I was also comforted by the fact that the elusion sample test at the end passed my accept on zero error quality assurance test. I did not find any hot documents. For those reasons (plus great weariness with the whole project), I decided not to pull some all-nighters to run a few more rounds of training. Instead, I went ahead and completed my report, added graphics and more analysis, and made my production with a few hours to spare.

A scientist hired after the production did some post-hoc testing that confirmed an approximate 95% confidence level recall achievement of between 83% to 94%. My work also confirmed all subsequent challenges. I am not at liberty to disclose further details.

In post hoc analysis I found that the probability distribution was close to the ideal shape that I now know to look for. The below diagram represents an approximate depiction of the ranking distribution of the 1.7 Million documents at the end of the project. The 400,000 documents produced (obviously I am rounding off all these numbers) were 50% plus, and 1,300,000 not produced were less than 50%. Of the 1,300,000 Negatives, 480,000 documents were ranked with only 1% or less probable relevance. On the other end, the high side, 245,000 documents had a probable relevance ranking of 99% or more. There were another 155,000 documents with a ranking between 99% and 50% probable relevant. Finally, there were 820,000 documents ranked between 49% and 01% probable relevant.

The file review speed here realized of about 35,000 files per hour, and extremely low cost of about $0.07 per document, would not have been possible without the client’s agreement to forgo full document review of the 400,000 documents produced. A group of contract lawyers could have been brought in for second pass review, but that would have greatly increased the cost, even assuming a billing rate for them of only $50 per hour, which was 1/10th my rate at the time (it is now much higher.)

The client here was comfortable with reliance on confidentiality agreements for reasons that I cannot disclose. In most cases litigants are not, and insist on eyes on review of every document produced. I well understand this, and in today’s harsh world of hard ball litigation it is usually prudent to do so, clawback or no.

Another reason the review was so cheap and fast in this project is because there were very little opposing counsel transactional costs involved, and everyone was hands off. I just did my thing, on my own, and with no interference. I did not have to talk to anybody; just read a few guidance memorandums. My task was to find the relevant documents, make the production, and prepare a detailed report – 41 pages, including diagrams – that described my review. Someone else prepared a privilege log for the 2,500 documents withheld on the basis of privilege.

I am proud of what I was able to accomplish with the two-filter multimodal methods, especially as it was subject to the mentioned post-review analysis and recall validation. But, as mentioned, I would not want to do it again. Working alone like that was very challenging and demanding. Further, it was only possible at all because I happened to be a subject matter expert of the type of legal dispute involved. There are only a few fields where I am competent to act alone as an SME. Moreover, virtually no legal SMEs are also experienced ESI searchers and software power users. In fact, most legal SMEs are technophobes. I have even had to print out key documents to paper to work with some of them.

Even if I have adequate SME abilities on a legal dispute, I now prefer to do a small team approach, rather than a solo approach. I now prefer to have one of two attorneys assisting me on the document reading, and a couple more assisting me as SMEs. In fact, I can act as the conductor of a predictive coding project where I have very little or no subject matter expertise at all. That is not uncommon. I just work as the software and methodology expert; the Experienced Searcher.

Right now I am working on a project where I do not even speak the language used in most of the documents. I could not read most of them, even if I tried. I just work on procedure and numbers alone, where others get their hands in the digital mud and report to me and the SMEs. I am confident this will work fine. I have good bilingual SMEs and contract reviewers doing most of the hands-on work.

Conclusion

There is much more to efficient, effective review than just using software with predictive coding features. The methodology of how you do the review is critical. The two filter method described here has been used for years to cull away irrelevant documents before manual review, but it has typically just been used with keywords. I have tried to show here how this method can be employed in a multimodal method that includes predictive coding in the Second Filter.

Keywords can be an effective method to both cull out presumptively irrelevant files, and cull in presumptively relevant, but keywords are only one method, among many. In most projects it is not even the most effective method. AI-enhanced review with predictive coding is usually a much more powerful method to cull out the irrelevant and cull in the relevant and highly relevant.

If you are using a one-filter method, where you just do a rough cut and filter out by keywords, date, and custodians, and then manually review the rest, you are reviewing too much. It is especially ineffective when you collect based on keywords. As shown in Biomet, that can doom you to low recall, no matter how good your later predictive coding may be.

If you are using a two-filter method, but are not using predictive coding in the Second Filter, you are still reviewing too much. The two-filter method is far more effective when you use relevance probability ranking to cull out documents from final manual review.

Legal Search Science

April 22, 2015 by Ralph Losey

Legal Search Science is an interdisciplinary field concerned with the search, review, and classification of large collections of electronic documents to find information for use as evidence in legal proceedings, for compliance to avoid litigation, or for general business intelligence. See PreSuit.com and Computer Assisted Review. Legal Search Science as practiced today uses software with artificial intelligence features to help lawyers to find electronic evidence in a systematic, repeatable, and verifiable manner. The hybrid search method of AI human computer interaction developed in this field that will inevitably have a dramatic impact on the future practice of law. Lawyers will never be replaced entirely by robots embodying AI search algorithms, but some lawyers are already using them to significantly enhance their abilities and singlehandedly do the work of dozens, if not hundreds, of lawyers.

My own experience (Ralph Losey) provides an example. I participated in a study in 2013 where I searched and reviewed over 1.6 Millions documents by myself, with only the assistance of one computer – one robot, so to speak – running AI-enhanced software by Kroll Ontrack. I was able to do so more accurately and faster than large teams of lawyers working without artificial intelligence software. I was even able to work faster and more accurately than all other teams of lawyers and vendors that used AI-enhanced software, but did not use the science-based search methods described here. I do not attribute my success to my own intelligence, or any special gifts or talents. I was able to succeed by applying the established scientific methods described here. They allowed me to augment my own small intelligence with that of the machine. If I have any special skills, it is in human-computer interaction, and legal search intuition. They are based on my long experience in the law with evidence (over 34 years), and in my experience in the last few years using predictive coding software.

Legal Search Science as I understand it is a combination and subset of three fields of study: Information Science, the legal field of Electronic Discovery, and the engineering field concerned with the design and creation of Search Software. Its primary concern is with information retrieval and the unique problems faced by lawyers in the discovery of relevant evidence.

Most specialists in legal search science use a variety of search methods when searching large datasets. The use of multiple methods of search is referred to here as a multimodal approach. Although many search methods are used at the same time, the primary, or controlling search method in large projects is typically what is known as supervised or semi-supervised machine learning. Semi-supervised learning is a type of artificial intelligence (AI) that uses an active learning approach. I refer to this as AI-enhanced review or AI-enhanced search. In information science it is often referred to as active machine learning, and in legal circles as Predictive Coding.

For reliable introductory information on Legal Search Science see the works of attorney, Maura Grossman, and her information scientist partner, Professor Gordon Cormack, including:

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, at pg. 9.
Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, Richmond Journal of Law and Technology, Vol. XVII, Issue 3, Article 11 (2011);
The Grossman-Cormack Glossary of Technology-Assisted Review, 2013 Fed. Cts. L. Rev. 7 (January 2013);
Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery (SIGIR 2014).

The Grossman-Cormack Glossary explains that in machine learning:

Supervised Learning Algorithms (e.g., Support Vector Machines, Logistic Regression, Nearest Neighbor, and Bayesian Classifiers) are used to infer Relevance or Non-Relevance of Documents based on the Coding of Documents in a Training Set. In Electronic Discovery generally, Unsupervised Learning Algorithms are used for Clustering, Near-Duplicate Detection, and Concept Search.

Multimodal search uses both machine learning algorithms and unsupervised learning search tools (clustering, near-duplicates and concept), as well as keyword search and even some limited use of traditional linear search. This is further explained here in the section below entitled, Hybrid Multimodal Bottom Line Driven Review. The hybrid multimodal aspects described represent the consensus view among information search scientists. The bottom line driven aspects represent my legal overlay on the search methods. All of these components together make up what I call Legal Search Science. It represents a synthesis of knowledge and search methods from science, law, and software engineering.

The key definition of the Glossary is for Technology Assisted Review, their term for AI-enhanced review.

Technology-Assisted Review (TAR): A process for Prioritizing or Coding a Collection of Documents using a computerized system that harnesses human judgments of one or more Subject Matter Expert(s) on a smaller set of Documents and then extrapolates those judgments to the remaining Document Collection. Some TAR methods use Machine Learning Algorithms to distinguish Relevant from Non-Relevant Documents, based on Training Examples Coded as Relevant or Non-Relevant by the Subject Matter Experts(s), …. TAR processes generally incorporate Statistical Models and/or Sampling techniques to guide the process and to measure overall system effectiveness.

The Grossman-Cormack Glossary makes clear the importance of Subject Matter Experts (SMEs) by including their use as the document trainer into the very definition of TAR. Nevertheless, experts agree that good predictive coding software is able to tolerate some errors made in the training documents. For this reason experiments are being done on ways to minimize the central role of the SMEs, to see if lesser-qualified persons could also be used in document training, at least to some degree. See Webber & Pickens, Assessor Disagreement and Text Classifier Accuracy (SIGIR, 2013); John Tredennick, Subject Matter Experts: What Role Should They Play in TAR 2.0 Training? (2013). These experiments are of special concern to software developers and others who would like to increase the utilization of AI-enhanced software because, at the current time, very few SMEs in the law have the skills or time necessary to conduct AI-enhanced searches. This is one reason that predictive coding is still not widely used, even though it has been proven effective in multiple experiments and adopted by several courts.

Professor Doug Oard

For in-depth information on key experiments already performed in the field of Legal Search Science, see the TREC Legal Track reports whose home page is maintained by a leader in the field, information scientist, Doug Oard. Professor Oard is a co-founder of the TREC Legal track. Also see the research and reports of Herb Rotiblat and the Electronic Discovery Institute, and my papers on TREC (and otherwise as listed below): Analysis of the Official Report on the 2011 TREC Legal Track – Part One, Part Two and Part Three; and Secrets of Search: Parts One, Two, and Three.

For general legal background on the field of Legal Search Science see the works of the attorney co-founder of TREC Legal Track, Jason R. Baron, including:

Baron & Grossman, The Sedona Conference® Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (2013).pdf (December 2013).
Baron & Borden, Finding the Signal in the Noise: Information Governance, Analytics and the Future of Legal Practice (2014).
Baron & Freeman, Quick Peek at the Math Behind the Black Box of Predictive Coding (2013)
Baron, DESI, Sedona and Barcelona (2013)
Baron & Paul, Information Inflation: Can The Legal System Adapt? 13 Rich J.L. & Tech 10 (2007).

As explained in Baron and Freeman’s Quick Peek at the Math, and my blog introduction thereto, the supervised learning algorithms behind predictive coding utilize a hyper-dimensional space. Each each document in the dataset, including its metadata, represent a different dimension mapped in trans-Cartesian space, called hyper-planes. Each document is placed according to a multi-dimensional dividing line of relevant and irrelevant. The important document ranking feature of predictive coding is performed by measure as to how far from the dividing line a particular document lies. Each time a training session is run the line moves and the ranking fluctuates in accordance with the new information provided. The below diagram attempts to portray this hyperplane division and document placement. The points shown in red designate irrelevant documents and the blue points relevant documents. The dividing line would run through multiple dimensions, not just the usual two of a Cartesian graph. This is depicted in this diagram by folding fields. For more read the entire Quick Peek article.

For a scientific and statistical view of Legal Search Science that is often at least somewhat intelligible to lawyers and other non-scientists, see the blog of information scientist and consultant, William Webber, Evaluating e-Discovery. For writings designed for the general reader on the subject of predictive coding, see the many articles by attorney Karl Scheineman, another pioneer in the field.

AI-Enhanced Search Methods

AI-enhanced search represents an entirely new method of legal search, which requires a completely new approach to large document reviews. Below is the diagram that I created to show the new workflow that I use in a typical predictive coding project.

For a more detailed description of the eight steps see the Electronic Discovery Best Practices page on predictive coding. For another somewhat similar workflow description see the diagram below of Kroll Ontrack, the vendor whose predictive coding software I now frequently use. Their seven-step model is described at pages 3-4 of Kroll Ontrack’s white paper, Technology Assisted Review: Driving Ediscovery Efficiencies in the Era of Big Data (2013).

I have found that proper AI-enhanced review requires the highest skill levels and is, for me at least, the most challenging activity in electronic discovery law. See: Electronic Discovery Best Practices for a description of the ten types of legal services involved in e-discovery. I am convinced that predictive coding is The big new tool that we have all been waiting for. When used properly, good AI-enhanced software allows attorneys to find the information they need in vast stores of ESI, and to do so in an effective and affordable manner.

In my experience the best software and training methods use AI type active learning process in steps four and five of my chart above and steps 2-5 of Kroll Ontrack’s chart. My preferred active learning process in the iterative machine learning steps is threefold:

The computer selects documents for review where the software classifier is uncertain of the correct classification. This helps the classifier algorithms to learn by adding diversity to the documents presented for review. This in turn helps to locate outliers of a type your initial judgmental searches in step two and five have missed. This is machine selected sampling, and, according to a basic text in information retrieval engineering, a process is not a bona fide active learning search without this ability. Manning, Raghavan and Schutze, Introduction to Information Retrieval, (Cambridge, 2008) at pg. 309.
Some reasonable percentage of the documents presented for human review in step five are selected at random. This again helps maximize recall and premature focus on the relevant documents initially retrieved.
Other relevant documents that a skilled reviewer can find using a variety of search techniques. This is called judgmental sampling. See Baron, Jason, Co-Editor, The Sedona Conference® Commentary on Achieving Quality in the E-Discovery Process (2009). Judgmental sampling can use a variety of search tools, including both the mentioned Supervised and Unsupervised Learning Algorithms, and is more further described below.

The initial seed set generation, step two in my chart, should also use some random samples, plus judgmental multimodal searches. Steps three and six in my chart always use pure random samples and rely on statistical analysis. For background on the three types of sampling see my article, Three-Cylinder Multimodal Approach To Predictive Coding.

Judgmental Sampling

After the first round of training, aka the seed set, judgmental sampling continues along with random and machine selected sampling in steps four and five. In judgmental sampling the human reviewer often selects additional documents for review and training that are based on the machine selected or random selected documents presented for review. Sometimes, however, the SME human reviewer follows a new search idea unrelated to the new documents seen. When an experienced searcher sees new documents this often leads to new search ideas.

All kinds of searches can be used for judgmental sampling, which is why I call it a multimodal search. This may include some linear review of selected custodians or selected date ranges, parametric Boolean keyword searches, similarity searches of all kinds, clustering searches, concept searches, as well as several unique predictive coding probability searches. These document probability searches are based primarily on the unique document ranking capabilities of most AI-enhanced search software. I find the ranking based searches to be extremely helpful to maximize efficiency and effectiveness.

I also often use informal random sampling of returned search sets as part of the judgmental sampling review and evaluation process. This is a process where I browse through search results, both according to ranking and random views. I will also use various document sorting views to get a better understanding of the documents returned. I will also use different search methods and document views according to the type of data, the custodian, where it was originally stored, or even any sub-search-goal or issue I might be focused on at any one point in time. A good search follows systems and is repeatable, but it is also fluid. It is adaptable to new information uncovered from the documents searched, or from new information received elsewhere about the case, or from new documents added to the search in mid-course. Good search and review software is designed to allow for both flexibility and a systems approach.

All of these methods allow an experienced legal searcher to get a feel for the underlying data. This obviates the need for full linear review of the multimodal search results in the judgmental sampling process. Still, it is sometimes appropriate to read a few hundred documents in a linear fashion, for instance, all email a key witness received on a critical day.

AI featured multimodal search represents a move from traditional legal art to science. See: Seven Years of the e-Discovery Team Blog, Art to Science. But there is still room for art in the sense of the deeply engrained skills and intuition that lawyers can only gain from years of experience with legal search. Knowledgable lawyers have unique insights into the evidence and witnesses involved in a particular case or type of dispute. This background knowledge and experience allows a skilled SME to improvise. The searcher can change directions depending on the documents found, and depending on new documents added to the dataset, or even new issues and a changed scope of relevance. Indeed, the search methods used in multimodal judgmental sampling vary considerably, both on a project by project basis, and over time in the same project, as understanding of the data, targets, and search develops. This is where years of legal search experience can be extremely valuable. All well designed predictive coding software allows for such a flexible approach to empower the attorneys conducting the search.

The CAL Variation

After study of the 2014 experiments performed by Professor Cormack and Maura Grossman, I have added a variation to the predictive coding work flow, which they call CAL, for Continuous Active Learning. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, at pg. 9. Also see Latest Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training – Parts One, Two, Three and Four. The part that intrigued me about there study was the use of continuous machine training as part of the entire review. This is explained in detail in Part Three of my lengthy blog series on the Cormack Grossman study.

My practical takeaway from their experiments and 2014 SIGR report is that focusing on high ranking documents is a powerful search method, whereas a random only search is pure folly. The form of CAL that they tested trained using high probable relevant documents in all but the first training round. (In the first round, the so called seed set, they trained using documents found by keyword search.) This experiment showed that the method of review of the documents with the highest rankings works well, and should be given significant weight in any multimodal approach, especially when the goal is to quickly find as many relevant documents as possible.

The “continuous” training aspects of the CAL approach means that you keep doing machine training throughout the review project and batch reviews accordingly. This could become a project management issue. But, if you can pull it off within proportionality and requesting party constraints, it just makes common sense to do so. You might as well get as much help from the machine as possible and keep getting its probability predictions for as long as you are still doing reviews and can make last minute batch assignments accordingly.

I have done several reviews in such a continuous training manner without really thinking about the fact that the machine input was continuous, including my first Enron experiment. Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. But the Cormack Grossman study on the continuous active learning approach caused me to rethink the flow chart shown above that I usually use to explain the predictive coding process. The work flows shown before do not use a CAL approach, but rather an approach the Cormack Grossman report calls a simple approach, where you review and train, but then at some point stop training and final review is done. Under the simple approach there is a distinct stop in training after step five, and the review work in step seven is based on the last rankings established in step five.

The continuous work flow is slightly more difficult to show in a diagram, and to implement, but it does make good common sense if you are in a position to pull it off. Below is the revised workflow that illustrates how the training continues throughout the review.

Machine training is still done in steps four and five, but then continues in steps four, five and seven. There are other ways it could be implemented of course, but this is the CAL approach I would use in a review project where such complex batching and continuous training otherwise makes sense. Of course, it is not necessary in any project were the review in steps four and five effectively finds all of the relevant documents required. This is what happened in my Enron experiment. Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. There was no need to do a proportional final review, step seven, because all the relevant documents had already been reviewed as part of the machine training review in steps four and five. In the Enron experiment I skipped step seven and when right from step six to step eight, production. I have been able to do this is other projects as well.

Multimodal

My insistence on the use of multimodal judgmental sampling in steps two and five to locate relevant documents follows the consensus view of information scientists specializing in information retrieval, but is not followed by several prominent predictive coding vendors. They instead rely entirely on machine selected documents for training, or even worse, rely entirely on random selected documents to train the software. In my writings I call these processes the Borg approach, after the infamous villains in Star Trek, the Borg, a race half-human robots that assimilates people into machines. (I further differentiate between three types of Borg in Three-Cylinder Multimodal Approach To Predictive Coding.) Like the Borg, these approaches unnecessarily minimize the role of individuals, the SMEs. They exclude other types of search to supplement an active learning process. I advocate the use of all types of search, not just predictive coding.

Hybrid Human Computer Information Retrieval

Further, in contradistinction to Borg approaches, where the machine controls the learning process, I advocate a hybrid approach where Man and Machine work together. In my hybrid search and review projects the expert reviewer remains in control of the process, and their expertise is leveraged for greater accuracy and speed. The human intelligence of the SME is a key part of the search process. In the scholarly literature of information science this hybrid approach is known as Human–computer information retrieval (HCIR). (My thanks to information scientist Jeremy Pickens for pointing out this literature to me.)

The classic text in the area of HCIR, which I endorse, is Information Seeking in Electronic Environments (Cambridge 1995) by Gary Marchionini, Professor and Dean of the School of Information and Library Sciences of U.N.C. at Chapel Hill. Professor Marchionini speaks of three types of expertise needed for a successful information seeker:

Domain Expertise. This is equivalent to what we now call SME, subject matter expertise. It refers to a domain of knowledge. In the context of law the domain would refer to the particular type of lawsuit or legal investigation, such as antitrust, patent, ERISA, discrimination, trade-secrets, breach of contract, Qui Tam, etc. The knowledge of the SME on the particular search goal is extrapolated by the software algorithms to guide the search. If the SME also has the next described System Expertise and Information Seeking Expertise, they can run the search project themselves. That is what I like to call the Army of One approach. Otherwise, they will need a chauffeur or surrogate with such expertise, one who is capable of learning enough from the SME to recognize the relevant documents.
System Expertise. This refers to expertise in the technology system used for the search. A system expert in predictive coding would have a deep and detailed knowledge of the software they are using, including the ability to customize the software and use all of its features. In computer circles a person with such skills is often called a power-user. Ideally a power-user would have expertise in several different software systems. They would also be an expert in one or more particular method of search.
Information Seeking Expertise. This is a skill that is often overlooked in legal search. It refers to a general cognitive skills related to information seeking. It is based on both experience and innate talents. For instance, “capabilities such as superior memory and visual scanning abilities interact to support broader and more purposive examination of text.” Professor Marchionini goes on to say that: “One goal of human-computer interaction research is to apply computing power to amplify and augment these human abilities.” Some lawyers seem to have a gift for search, which they refine with experience, broaden with knowledge of different tools, and enhance with technologies. Others do not.

Id. at pgs.66-69, with the quotes from pg. 69.

All three of these skills are required for an attorney to attain expertise in legal search today, which is one reason I find this new area of legal practice so challenging. It is difficult, but not impossible like this Penrose triangle.

It is not enough to be an SME, or a power-user, or have a special knack for search. You have to be able to do it all, and so does your software. However, studies have shown that of the three skill-sets, System Expertise, which in legal search primarily means mastery of the particular software used, is the least important. Id. at 67. The SMEs are more important, those who have mastered a domain of knowledge. In Professor Marchionini’s words:

Thus, experts in a domain have greater facility and experience related to information-seeking factors specific to the domain and are able to execute the subprocesses of information seeking with speed, confidence, and accuracy.

Id. That is one reason that the Grossman Cormack glossary quoted before builds in the role of SMEs as part of their base definition of technology assisted review. Glossary at pg. 21 defining TAR.

According to Marchionini, Information Seeking Expertise, much like Subject Matter Expertise, is also more important than specific software mastery. Id. This may seem counter-intuitive in the age of Google, where an illusion of simplicity is created by typing in words to find websites. But legal search of user-created data is a completely different type of search task than looking for information from popular websites. In the search for evidence in a litigation, or as part of a legal investigation, special expertise in information seeking is critical, including especially knowledge of multiple search techniques and methods. Again quoting Professor Marchionini:

Expert information seekers possess substantial knowledge related to the factors of information seeking, have developed distinct patterns of searching, and use a variety of strategies, tactics and moves.

Id. at 70.

In the field of law this kind of information seeking expertise includes the ability to understand and clarify what the information need is, in other words, to know what you are looking for, and articulate the need into specific search topics. This important step precedes the actual search, but is an integral part of the process. As one of the basic texts on information retrieval written by Gordon Cormack, et al, explains:

Before conducting a search, a user has an information need, which underlies and drives the search process. We sometimes refer to this information need as a topic …

Buttcher, Clarke & Cormack, Information Retrieval: Implementation and Evaluation of Search Engines (MIT Press, 2010) at pg. 5. The importance of pre-search refining of the information need is stressed in the first step of the above diagram of my methods, ESI Discovery Communications. It seems very basic, but is often under appreciated, or overlooked entirely in the litigation context where information needs are often vague and ill-defined, lost in overly long requests for production and adversarial hostility.

Hybrid Multimodal Bottom Line Driven Review

I have a long descriptive name for what Marchionini calls the variety of strategies, tactics and moves that I have developed for legal search: Hybrid Multimodal AI-Enhanced Review using a Bottom Line Driven Proportional Strategy. See eg. Bottom Line Driven Proportional Review (2013). I refer to it as a multimodal method because, although the predictive coding type of searches predominate (shown on the below diagram as AI-enhanced review – AI), I also use the other modes of search, including the mentioned Unsupervised Learning Algorithms (clustering and concept), keyword search, and even some traditional linear review (although usually very limited). As described, I do not rely entirely on random documents, or computer selected documents for the AI-enhanced searches, but use a three-cylinder approach that includes human judgment sampling and AI document ranking. The various types of legal search methods used in a multimodal process are shown in this search pyramid.

Most information scientists I have spoken to agree that it makes sense to use multiple methods in legal search and not just rely on any single method, even the best AI method. UCLA Professor Marcia J. Bates first advocated for using multiple search methods back in 1989, which she called it berrypicking. Bates, Marcia J. The Design of Browsing and Berrypicking Techniques for the Online Search Interface, Online Review 13 (October 1989): 407-424. As Professor Bates explained in 2011 in Quora:

An important thing we learned early on is that successful searching requires what I called “berrypicking.” … Berrypicking involves 1) searching many different places/sources, 2) using different search techniques in different places, and 3) changing your search goal as you go along and learn things along the way. This may seem fairly obvious when stated this way, but, in fact, many searchers erroneously think they will find everything they want in just one place, and second, many information systems have been designed to permit only one kind of searching, and inhibit the searcher from using the more effective berrypicking technique.

This berrypicking approach, combined with HCIR, is what I have found from practical experience works best with legal search. They are the Hybrid Multimodal aspects of my AI-Enhanced Review Bottom Line Driven Review method.

Why AI-Enhanced Search and Review Is Important

I focus on this sub-niche area of e-discovery because I am convinced that it is critical to advancement of the law in the 21st Century. The new search and review methods that I have developed from my studies and experiments in legal search science allow a skilled attorney using readily available predictive coding type software to review at remarkable rates of speed and cost. Review rates are more than 250-times faster than traditional linear review, and costs less than a tenth as much. See eg Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron, and the report by the Rand Corporation, Where The Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery.

Thanks to the new software and methods, what was considered impossible, even absurd, just a few short years ago, namely one attorney accurately reviewing over a million documents by him or herself in 14-days, is attainable by many experts. I have done it. That is when I came up with the Army of One motto and realized that we were at a John Henry moment in Legal Search. Maura tells me that she once did a seven-million document review by herself. Maura and Gordon were correct to refer to TAR as a disruptive technology in the Preface to their Glossary. Technology that can empower one skilled lawyer to do the work of hundreds of unskilled attorneys is certainly a big deal, one for which we have Legal Search Science to thank.

More Information On legal Search Science

For further information on Legal Search Science see all of the articles cited above, along with my articles listed below. Most of my articles were written for the general reader, some are highly technical but still accessible with study. All have been peer-reviewed in my blog by most of the founders of this field who are regular readers and thousands of other readers. Also see the CAR procedures described on Electronic Discovery Best Practices.

I am especially proud of the legal search experiments I have done using AI-enhanced search software provided to me by Kroll Ontrack to review the 699,083 public Enron documents and my reports on these reviews. Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two); A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One). I have been told by scientists in the field that my over 100 hours of search, consisting of two fifty-hour search projects using different methods, is the largest search project by a single reviewer that has ever been undertaken, not only in Legal Search, but in any kind of search. I do not expect this record will last for long, as others begin to understand the importance of Information Science in general, and Legal Search Science in particular. But for now I will enjoy both the record and lessons learned from the hard work involved. I may also attempt a third search project soon to continue to make contributions to Legal Search Science. Stay tuned. I may extend my record to 150 hours.

April 2014 Slide Presentation by Ralph Losey on Predictive Coding

Articles by Ralph Losey on Legal Search

Two-Filter Document Culling – Part One and Part Two.
Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part One, Part Two and Part Three.
In Legal Search Exact Recall Can Never Be Known.
Visualizing Data in a Predictive Coding Project – Part One, Part Two and Part Three.
Guest Blog: Talking Turkey by Maura Grossman and Gordon Cormack, edited and published by RCL.
Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One, Part Two, Part Three, and Part Four.
The “If-Only” Vegas Blues: Predictive Coding Rejected in Las Vegas, But Only Because It Was Chosen Too Late. Part One and Part Two.
IT-Lex Discovers a Previously Unknown Predictive Coding Case: “FHFA v. JP Morgan, et al”
Beware of the TAR Pits! Part One and Part Two.
PreSuit: How Corporate Counsel Could Use “Smart Data” to Predict and Prevent Litigation. Also see PreSuit.com.
Predictive Coding and the Proportionality Doctrine: a Marriage Made in Big Data, 26 Regent U. Law Review 1 (2013-2014).
Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three.
My Basic Plan for Document Reviews: The “Bottom Line Driven” Approach, PDF version suitable for print, or HTML version that combines the blogs published in four parts.
Relevancy Ranking is the Key Feature of Predictive Coding Software.
Why a Receiving Party Would Want to Use Predictive Coding?
Vendor CEOs: Stop Being Empty Suits & Embrace the Hacker Way
Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two).
A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One).
Introduction to Guest Blog: Quick Peek at the Math Behind the Black Box of Predictive Coding that pertains to the higher-dimensional geometry that makes predictive coding support vector machines possible.
Keywords and Search Methods Should Be Disclosed, But Not Irrelevant Documents.
Reinventing the Wheel: My Discovery of Scientific Support for “Hybrid Multimodal” Search.
There Can Be No Justice Without Truth, And No Truth Without Search (statement of my core values as a lawyer explaining why I think predictive coding is important).
Three-Cylinder Multimodal Approach To Predictive Coding.
Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search. Video Animation.
Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (a five-part written and video series comparing two different kinds of predictive coding search methods).
Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron in PDF form for easy distribution and the blog introducing this 82-page narrative, with second blog regarding an update.
Journey into the Borg Hive: a Predictive Coding Narrative in science fiction form.
The Many Types of Legal Search Software in the CAR Market Today.
Georgetown Part One: Most Advanced Students of e-Discovery Want a New CAR for Christmas.
Escape From Babel: The Grossman-Cormack Glossary.
NEWS FLASH: Surprise Ruling by Delaware Judge Orders Both Sides To Use Predictive Coding.
Does Your CAR (“Computer Assisted Review”) Have a Full Tank of Gas? (and you can also click here for the alternate PDF version for easy distribution).
Analysis of the Official Report on the 2011 TREC Legal Track – Part One.
Analysis of the Official Report on the 2011 TREC Legal Track – Part Two.
Analysis of the Official Report on the 2011 TREC Legal Track – Part Three
An Elusive Dialogue on Legal Search: Part One where the Search Quadrant is Explained.
An Elusive Dialogue on Legal Search: Part Two – Hunger Games and Hybrid Multimodal Quality Controls.
Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022.
Second Ever Order Entered Approving Predictive Coding.
Predictive Coding Based Legal Methods for Search and Review.
New Methods for Legal Search and Review.
Perspective on Legal Search and Document Review.
LegalTech Interview of Dean Gonsowski on Predictive Coding and My Mission to Make Predictive Coding Software More Affordable.
My Impromptu Video Interview at NY LegalTech on Predictive Coding and Some Hopeful Thoughts for the Future.
The Legal Implications of What Science Says About Recall.
Reply to an Information Scientist’s Critique of My “Secrets of Search” Article.
Secrets of Search – Part I.
Secrets of Search – Part II.
Secrets of Search – Part III. (All three parts consolidated into one PDF document.)
Information Scientist William Webber Posts Good Comment on the Secrets of Search Blog.
Judge Peck Calls Upon Lawyers to Use Artificial Intelligence and Jason Baron Warns of a Dark Future of Information Burn-Out If We Don’t.
The Information Explosion and a Great Article by Grossman and Cormack on Legal Search.

Please contact me at Ralph.Losey@gmail.com for any private comments, questions.

AI-Enhanced Review

April 22, 2015 by Ralph Losey

Zero Error Numerics uses computer assisted review (CAR) software with active machine learning algorithms. Active machine learning is a type of artificial intelligence (AI). When used in legal search these AI algorithms significantly improve the search, review, and classification of electronically stored information (ESI). For this reason we prefer to call predictive coding by the name AI-enhanced review or AI-enhanced search. For more background on the science involved see LegalSearchScience.com.

In CARs with AI-enhanced review and search engines, attorneys train a computer to find documents identified by the attorney as a target. The target is typically relevance to a particular lawsuit or legal issue, or some other legal classification, such as privilege. This kind of AI-enhanced review, along with general e-discovery training, are now my primary interests as a lawyer.

Personal Legal Search Background

In 2006 I dropped my civil litigation practice and limited my work to e-discovery. That is also when I started this blog. At that time I could not even imagine specializing more than that. In 2006 I was interested in all aspects of electronic discovery, including computer assisted review. AI-enhanced CARs were still just a dream that I hoped would someday come true.

The use of software in legal practice has always been a compelling interest for me. I have been an avid user of computer software of all kinds since the late 1970s, both legal and entertainment. I even did some game software design and programming work in the early 1980s. My now-grown kids still remember the computer games I made for them.

I carefully followed the legal search and review software scene my whole career, but especially since 2006. It was not until 2011 that I began to be impressed by the new types of predictive coding CAR software entering the market. After I got my hands on the new software, I began to do what had once been unimaginable. I started to limit my legal practice even further. I began to spend more and more of my time on predictive coding types of review work. Since 2012 my work as an e-discovery lawyer and researcher has focused almost exclusively on using predictive coding driven CARs in large document production projects, and on e-discovery training, another passion of mine. In that year one of my cases produced a landmark decision by Judge Andrew Peck that first approved the use of predictive coding, Da Silva Moore. (I do not write about it because it is still ongoing.)

Attorney Maura R. Grossman and I are among the first attorneys in the world to specialize in predictive coding as an e-discovery sub-niche. Maura is a colleague who is both a practicing attorney and an expert in the new field of Legal Search Science. We have frequently presented on CLE panels as a kind of technology evangelists for these new methods of legal review. Maura, and her partner, ProfessorGordon Cormack, who is one of the most esteemed information scientists in the field, wrote the seminal scholarly paper on the subject, and more recently an excellent glossary of terms used in CAR (they prefer to call it TAR). Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, Richmond Journal of Law and Technology, Vol. XVII, Issue 3, Article 11 (2011); The Grossman-Cormack Glossary of Technology-Assisted Review, with Foreword by John M. Facciola, U.S. Magistrate Judge, 2013 Fed. Cts. L. Rev. 7 (January 2013); Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014.

I recommend your reading of all of their works. I also recommend your study of the LegalSearchScience.com website that I put together, and the many references and citations included at Legal Search Science, including the writings of other pioneers in the field, such as the founders of TREC Legal Track, Jason R. Baron, Doug Oard, and David Lewis, and other key figures in the field, such as information scientists William Webber and EDI’s Herb Roitblat. Also see Baron and Grossman, The Sedona Conference® Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (2013).pdf (December 2013).

Advanced CARs Require Completely New Driving Methods

CAR or TAR is more than just new software. It entails a whole new legal method, a new approach to large document reviews. Below is the diagram that I created to show the new workflow I use in a typical CAR project. This is the standard version of the workflow. Further below you will find a variation that uses a slightly more complicated process called continuous active learning (CAL).

For a basic description of the eight steps see the Electronic Discovery Best Practices page on predictive coding.

I have found that driving a CAR properly requires the highest skill levels and is, for me at least, the most challenging activity in electronic discovery. It also shows the promise of being the new tool that we have all been waiting for. When used properly, good predictive coding type software allows attorneys to find the information they need in vast stores of ESI, and to do so in an effective and affordable manner.

In my experience the best software and training methods use what is known as an active learning process in steps four and five in the chart above. My preferred active learning process in the iterative machine learning steps is threefold:

The computer selects documents for review where the software classifier is uncertain of the correct classification. This helps the classifier algorithms to learn by adding diversity to the documents presented for review. This in turn helps to locate outliers of a type your initial judgmental searches in step two and five have missed. This is machine selected sampling, and, according to a basic text in information retrieval engineering, a process is not a bona fide active learning search without this ability. Manning, Raghavan and Schutze, Introduction to Information Retrieval, (Cambridge, 2008) at pg. 309.
Some reasonable percentage of the documents presented for human review in step five are selected at random. This again helps maximize recall and premature focus on the relevant documents initially retrieved.
Other relevant documents that a skilled reviewer can find using a variety of search techniques. This is called judgmental sampling. After the first round of training, aka the seed set, the judgmental sampling by a variety of search methods is used based on the machine selected or random selected documents presented for review, but sometimes the subject matter expert (“SME”) human reviewer follows a new search idea unrelated to the new documents seen. Any kind of searches can be used for judgmental sampling, which is why I call it a multimodal search. This may include some linear review of selected custodians or dates, parametric Boolean keyword searches, similarity searches of all kinds, concept searches, as well as several unique predictive coding probability searches.

The initial seed set generation, step two in the chart, should also use some random samples, plus judgmental multimodal searches. Steps three and six in the chart always use pure random samples and rely on statistical analysis. For more on the three types of sampling see my blog, Three-Cylinder Multimodal Approach To Predictive Coding.

Professor Cormack and Maura Grossman also performed experiments, which, among other things, tested the efficacy of random only based search.Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014. They reached the same conclusions that I did, and showed that this random only – Borg approach – is far less effective than even the most simplistic judgmental methods. I reported on this study in full in a series of blogs in the Summer of 2014, Latest Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training, see especially Part One of the series.

The CAL Variation

After study of the 2014 experiments by Professor Cormack and Maura Grossman reported at the SIGIR conference, I created a variation to the predictive coding work flow, which they call CAL, for Continuous Active Learning. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, at pg. 9. Also see Latest Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training – Parts One, Two, Three and Four. The part that intrigued me about there study was the use of continuous machine training as part of the entire review. This is explained in detail in Part Three of my lengthy blog series on the Cormack Grossman study. I had already known about the ineffectiveness of random only machine training from my own experiments, but had never experimented with the continuous training aspects included in their experiments.

The form of CAL that Cormack and Grossman tested used high probable relevant documents in all but the first training round. (In the first round, the so called seed set, they trained using documents found by keyword search.) This experiment showed that the method of review of the documents with the highest rankings works well, and should be given significant weight in any multimodal approach, especially when the goal is to quickly find as many relevant documents as possible. This is another take-away from this important experiment.

I have done several reviews in such a continuous training manner without really thinking about the fact that the machine input was continuous, including my first Enron experiment. Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. But the Cormack Grossman study on the continuous active learning approach caused me to rethink the standard flow chart shown above that I usually use to explain the predictive coding process. The standard work flow that does not use a CAL approach is referred to in the Cormack Grossman report as the simple approach, where you review and train, but then at some point stop training and final review is done. Under the simple approach there is a distinct stop in training after step five, and the review work in step seven is based on the last rankings established in step five.

predictive.coding_CAL

Hybrid Human Computer Information Retrieval

In further contradistinction to the Borg, or random only approaches, where the machine controls the learning process, I advocate a hybrid approach where Man and Machine work together. In my hybrid CARs the expert reviewer remains in control of the process, and their expertise is leveraged for greater accuracy and speed. The human intelligence of the SME is a key part of the search process. In the scholarly literature of information science this hybrid approach is known as Human–computer information retrieval (HCIR).

Domain Expertise. This is equivalent to what we now call SME, subject matter expertise. It refers to a domain of knowledge. In the context of law the domain would refer to particular types of lawsuits or legal investigations, such as antitrust, patent, ERISA, discrimination, trade-secrets, breach of contract, Qui Tam, etc. The knowledge of the SME on the particular search goal is extrapolated by the software algorithms to guide the search. If the SME also has System Expertise, and Information Seeking Expertise, they can drive the CAR themselves. Otherwise, they will need a chauffeur with such expertise, one who is capable of learning enough from the SME to recognize the relevant documents.
System Expertise. This refers to expertise in the technology system used for the search. A system expert in predictive coding would have a deep and detailed knowledge of the software they are using, including the ability to customize the software and use all of its features. In computer circles a person with such skills is often called a power-user. Ideally a power-user would have expertise in several different software systems. They would also be an expert in a particular method of search.
Information Seeking Expertise. This is a skill that is often overlooked in legal search. It refers to a general cognitive skill related to information seeking. It is based on both experience and innate talents. For instance, “capabilities such as superior memory and visual scanning abilities interact to support broader and more purposive examination of text.” Professor Marchionini goes on to say that: “One goal of human-computer interaction research is to apply computing power to amplify and augment these human abilities.” Some lawyers seem to have a gift for search, which they refine with experience, broaden with knowledge of different tools, and enhance with technologies. Others do not.

Id. at pgs.66-69, with the quotes from pg. 69.

It is not enough to be an SME, or a power-user, or have a special knack for search. You have to be able to do it all, and so does your software. However, studies have shown that of the three skill-sets, System Expertise, which in legal search primarily means mastery of the particular software used, is the least important. Id. at 67. The SMEs are more important, those who have mastered a domain of knowledge. In Professor Marchionini’s words:

Id. That is one reason that the Grossman Cormack glossary builds in the role of SMEs as part of their base definition of computer assisted review:

A process for Prioritizing or Coding a Collection of electronic Documents using a computerized system that harnesses human judgments of one or more Subject Matter Expert(s) on a smaller set of Documents and then extrapolates those judgments to the remaining Document Collection.

Glossary at pg. 21 defining TAR.

According to Marchionini, Information Seeking Expertise, much like Subject Matter Expertise, is also more important than specific software mastery. Id. This may seem counterintuitive in the age of Google, where an illusion of simplicity is created by typing in words to find websites. But legal search of user-created data is a completely different type of search task than looking for information from popular websites. In the search for evidence in a litigation, or as part of a legal investigation, special expertise in information seeking is critical, including especially knowledge of multiple search techniques and methods. Again quoting Professor Marchionini:

Id. at 70.

Before conducting a search, a user has an information need, which underlies and drives the search process. We sometimes refer to this information need as a topic …

Hybrid Multimodal Bottom Line Driven Review

I have a long descriptive name for what Marchionini calls the variety of strategies, tactics and moves that I have developed for legal search: Hybrid Multimodal AI-Enhanced Review using a Bottom Line Driven Proportional Strategy. See eg. Bottom Line Driven Proportional Review (2013). I refer to it as a multimodal method because, although the predictive coding type of searches predominate (shown on the below diagram as AI-enhanced review – AI), I also use the other modes of search, including Unsupervised Learning Algorithms (explained in LegalSearchScience.com) (often called clustering or near-duplication searches), keyword search, and even some traditional linear review (although usually very limited). As described, I do not rely entirely on random documents, or computer selected documents for the AI-enhanced searches, but use a three-cylinder approach that includes human judgment sampling and AI document ranking. The various types of legal search methods used in a multimodal process are shown in this search pyramid.

Most information scientists I have spoken to agree that it makes sense to use multiple methods in legal search and not just rely on any single method. UCLA Professor Marcia J. Bates first advocated for using multiple search methods back in 1989, which she called it berrypicking. Bates, Marcia J. The Design of Browsing and Berrypicking Techniques for the Online Search Interface, Online Review 13 (October 1989): 407-424. As Professor Bates explained in 2011 in Quora:

My Battles in Court Over Predictive Coding

In 2012 my case became the first in the country where the use of predictive coding was approved. See Judge Peck’s landmark decision Da Silva Moore v. Publicis, 11 Civ. 1279, _ FRD _, 2012 WL 607412 (SDNY Feb. 24, 2012). In that case my methods of using Recommind’s Axcelerate software were approved. Later in 2012, in another first, an AAA arbitration approved our use of predictive coding in a large document production. In that case I used Kroll Ontrack’s Inview software over the vigorous objections of the plaintiff, which, after hearings, were all rejected. These and other decisions have helped pave the way for the use of predictive coding search methods in litigation.

Scientific Research

In addition to these activities in court I have focused on scientific research on legal search, especially machine learning. I have, for instance, become one of the primary outside reporters on the legal search experiments conducted by TREC Legal Track of the National Institute of Science and Technology. See eg. Analysis of the Official Report on the 2011 TREC Legal Track – Part One, Part Two and Part Three; Secrets of Search: Parts One, Two, and Three. Also see Jason Baron, DESI, Sedona and Barcelona.

After the TREC Legal Track closed down in 2011 the only group participant scientific study to test the efficacy of various predictive coding software, and search methods, is the one sponsored by Oracle, the Electronic Discovery Institute and Stanford. This search of a 1,639,311 document database was conducted in early 2013, with the results reported in Monica Bay’s article, EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013). Here is the below chart published by LTN that summarizes the results.

1202628778400_chart

Monica Bay summaries the findings of the research as follows:

Phase I of the study shows that older lawyers still have e-discovery chops and you don’t want to turn EDD over to robots.

Penrose_triangle_Expertise With respect to my dear friend Monica, I must disagree with her conclusion. The age of the lawyers is irrelevant. The best predictive coding trainers do not have to be old, they just have to be SMEs, power users of good software, and have good search skills. In fact, not all SMEs are old, although many may be. It is the expertise and skills that matter, not age per se. It is true as Monica reports that the lawyer, a team of one, who did better in this experiment than all of the other much larger participant groups, was chronologically old. But that fact is irrelevant. The skill set and small group size, namely one, is what made the difference. See: Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three.

Moreover, although Monica is correct to say we do not want to”turn over” review to robots, this assertion misses the point. We certainly do want to turn over review to robot-human teams. We want our predictive coding software, our robots, to hook up with our experienced lawyers. We want our lawyers to enhance their own limited intelligence with artificial intelligence – the Hybrid approach. Robots are the future, but only if and as they work hand-in-hand with our top human trainers. Then they are unbeatable, as the EDI-Oracle study shows.

Secret Shh! For the time being the details of the EDI-Oracle scientific study are still closed, and even though Monica Bay was permitted to publicize the results, and make her own summary and conclusions, participants are prohibited from discussion and public disclosures. For this reason I can say no more on this study, and only assert without facts that Monica’s conclusions are in some respects incorrect, that age is not critical, that the hybrid multimodal method is what is important. I hope and expect that someday soon the gag order for participants will be lifted, the full findings of this most interesting scientific experiment will be released, and a free dialogue will commence. Truth only thrives in the open, and science concealed is merely occult.

Why Predictive Coding Driven CARs Are Important

I continue to focus on this sub-niche area of e-discovery as I am convinced that it is critical to advancement of the law in the 21st Century. Our own intelligence and search skills must be enhanced by the latest AI software. The new search and review methods I have developed allow a skilled attorney using readily available predictive coding type software to review at remarkable rates of speed and cost. The CAR review rates are more than 250-times faster than traditional linear review, and the costs less than a tenth as much. See eg Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron; EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013).

My Life as a Limo Driver and Trainer

I have spoken on this subject at many CLEs around the country since 2011. I explain the theory and practice of this new breakthrough technology. I also consult on a hands-on basis to help others learn the new methods. As an old software lover who has been doing legal document reviews since 1980, I also continue to like to do these review projects myself. I like to drive the CARs myself, not just teach others how to drive. I enjoy the interaction and enhancements from the hybrid, human-robot approach. Certainly I need an appreciate the artificial intelligence boosts to my own limited capacities.

I also like to serve as a kind of limo driver for trial lawyers from time to time. The top SMEs in the world (I prefer to work with the best), are almost never also software power-users, nor do they have special skills or talents for information seeking outside of depositions. For that reason they need me to drive the CAR for them. To switch to the robot analogy again, I like and can work with the bots, they cannot.

I can only do my job as a limo driver – robot friend in an effective manner if the SME first teaches me enough of their domain to know where I am going; to know what documents would be relevant or hot or not. That is where decades of legal experience handling a variety of cases is quite helpful. It makes it easer to get a download of the SME’s concept of relevance into my head, and then into the machine. Then I can act as a surrogate SME and do the machine training for them in an accurate and consistent manner.

Working as a driver for an SME presents many special communication challenges. I have had to devise a number of techniques to facilitate a new kind of SME surrogate agency process. Of course, it is easier to do the search when you are also the SME. For instance, in one project I reviewed almost two million documents, by myself, in only two-weeks. That’s right. By myself. (There was no redaction or privilege logging, which are tasks that I always delegate anyway.) A quality assurance test at the end of the review based on random sampling showed a very high accuracy rate was attained. There is no question that it met the reasonability standards required by law and rules of procedure.

It was only possible to do a project of this size so quickly because I happened to be an SME on the legal issues under review, and, just as important, I was a power-user of the software, and have, at this point, mastered my own search and review methods. I also like to think I have a certain knack for information seeking.

Thanks to the new software and methods, what was considered impossible, even absurd, just a few short years ago, namely one attorney accurately reviewing two million documents by him or herself in 14-days, is attainable by many experts. My story is not unique. Maura tells me that she once did a seven-million document review by herself. That is why Maura and Gordon were correct to refer to TAR as a disruptive technology in the Preface to their Glossary. Technology that can empower one skilled lawyer to do the work of hundreds of unskilled attorneys is certainly a big deal, one for which we have Legal Search Science to thank. It is also why I urge you to study this subject more carefully and learn to drive these new CARs yourself. Either that, or hire a limo driver.

Before you begin to actually carry out a predictive coding project, with or without an expert chauffeur to drive your CAR, you need to plan for it. This is critical to the success of the project. Here is detailed outline of a Form Plan for a Predictive Coding Project that I use as a complete checklist.

My Writings on CAR

A good way to continue your study in this area is to read the articles by Grossman and Cormack, and the over forty or so articles on the subject that I have written since mid-2011. They are listed here in rough chronological order, with the most recent on top. Also see the CAR procedures described on Electronic Discovery Best Practices.

I am especially proud of the legal search experiments I have done using AI-enhanced search software provided to me by Kroll Ontrack to review the 699,083 public Enron documents and my reports on these reviews. Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two); A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One). I have been told by scientists that my over 100 hours of search, comprised of two fifty-hour search projects using different methods, is the largest search project by a single reviewer that has ever been undertaken, not only in Legal Search, but in any kind of search. I do not expect this record will last for long, as others begin to understand the importance of Information Science in general, and Legal Search Science in particular. But for now I will enjoy both the record and lessons learned from the hard work involved.

Articles by Ralph Losey on Legal Search

Form Plan of a Predictive Coding Project. Detailed Outline for project planning purposes.
Two-Filter Document Culling – Part One and Part Two.
Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part One, Part Two and Part Three.
In Legal Search Exact Recall Can Never Be Known.
Visualizing Data in a Predictive Coding Project – Part One, Part Two and Part Three.
Guest Blog: Talking Turkey by Maura Grossman and Gordon Cormack, edited and published by RCL.
Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One, Part Two, Part Three, and Part Four.
The “If-Only” Vegas Blues: Predictive Coding Rejected in Las Vegas, But Only Because It Was Chosen Too Late. Part One and Part Two.
IT-Lex Discovers a Previously Unknown Predictive Coding Case: “FHFA v. JP Morgan, et al”
Beware of the TAR Pits! Part One and Part Two.
PreSuit: How Corporate Counsel Could Use “Smart Data” to Predict and Prevent Litigation. Also see PreSuit.com.
Predictive Coding and the Proportionality Doctrine: a Marriage Made in Big Data, 26 Regent U. Law Review 1 (2013-2014).
Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three.
My Basic Plan for Document Reviews: The “Bottom Line Driven” Approach, PDF version suitable for print, or HTML version that combines the blogs published in four parts.
Relevancy Ranking is the Key Feature of Predictive Coding Software.
Why a Receiving Party Would Want to Use Predictive Coding?
Vendor CEOs: Stop Being Empty Suits & Embrace the Hacker Way
Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two).
A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One).
Introduction to Guest Blog: Quick Peek at the Math Behind the Black Box of Predictive Coding that pertains to the higher-dimensional geometry that makes predictive coding support vector machines possible.
Keywords and Search Methods Should Be Disclosed, But Not Irrelevant Documents.
Reinventing the Wheel: My Discovery of Scientific Support for “Hybrid Multimodal” Search.
There Can Be No Justice Without Truth, And No Truth Without Search (statement of my core values as a lawyer explaining why I think predictive coding is important).
Three-Cylinder Multimodal Approach To Predictive Coding.
Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search. Video Animation.
Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (a five-part written and video series comparing two different kinds of predictive coding search methods).
Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron in PDF form for easy distribution and the blog introducing this 82-page narrative, with second blog regarding an update.
Journey into the Borg Hive: a Predictive Coding Narrative in science fiction form.
The Many Types of Legal Search Software in the CAR Market Today.
Georgetown Part One: Most Advanced Students of e-Discovery Want a New CAR for Christmas.
Escape From Babel: The Grossman-Cormack Glossary.
NEWS FLASH: Surprise Ruling by Delaware Judge Orders Both Sides To Use Predictive Coding.
Does Your CAR (“Computer Assisted Review”) Have a Full Tank of Gas? (and you can also click here for the alternate PDF version for easy distribution).
Analysis of the Official Report on the 2011 TREC Legal Track – Part One.
Analysis of the Official Report on the 2011 TREC Legal Track – Part Two.
Analysis of the Official Report on the 2011 TREC Legal Track – Part Three
An Elusive Dialogue on Legal Search: Part One where the Search Quadrant is Explained.
An Elusive Dialogue on Legal Search: Part Two – Hunger Games and Hybrid Multimodal Quality Controls.
Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022.
Second Ever Order Entered Approving Predictive Coding.
Predictive Coding Based Legal Methods for Search and Review.
New Methods for Legal Search and Review.
Perspective on Legal Search and Document Review.
LegalTech Interview of Dean Gonsowski on Predictive Coding and My Mission to Make Predictive Coding Software More Affordable.
My Impromptu Video Interview at NY LegalTech on Predictive Coding and Some Hopeful Thoughts for the Future.
The Legal Implications of What Science Says About Recall.
Reply to an Information Scientist’s Critique of My “Secrets of Search” Article.
Secrets of Search – Part I.
Secrets of Search – Part II.
Secrets of Search – Part III. (All three parts consolidated into one PDF document.)
Information Scientist William Webber Posts Good Comment on the Secrets of Search Blog.
Judge Peck Calls Upon Lawyers to Use Artificial Intelligence and Jason Baron Warns of a Dark Future of Information Burn-Out If We Don’t.
The Information Explosion and a Great Article by Grossman and Cormack on Legal Search.

Please contact me at Ralph.Losey@gmail.com if you have any questions.

ei-Recall

April 18, 2015 by Ralph Losey

The backbone of ZEN document review is a new method for calculating recall in legal search projects using random sampling that we call ei-Recall. This stands for elusion interval recall. We offer this to everyone in the e-discovery community in the hope that it will replace the hodgepodge of methods currently used, most of which are statistically invalid. Our goal is to standardize a new best practice for calculating recall. This lengthy essay will describe the formula in detail, and explain why we think it is the new gold standard. Then we will provide a series of examples as to how ei-Recall works.

We have received feedback on these ideas and experiments from the top two scientists in the world with special expertise in this area, William Webber and Gordon Cormack. Our thanks and gratitude to them both, especially to William, who must have reviewed and responded to a dozen earlier drafts of this blog. He not only corrected initial logic flaws, and there were many, but also typos. As usual any errors remaining are purely our own, and these are our opinions, not theirs.

ei-Recall is preferable to all other commonly used methods of recall calculation, including Herb Roitbalt’s eRecall, for two reasons. First, ei-Recall includes interval based range values, and, unlike eRecall, and other simplistic ratio methods, is not based on point projections. Second, and this is critical, ei-Recall is only calculated at the end of a project, and depends on a known, verified count of True Positives in a production. It is thus unlike eRecall, and all other recall calculation methods that depend on an estimated value for the number of True Positives found.

Yes, this does limit the application of ei-Recall to projects in which great care is taken to bring the precision of the production to near 100%, including second reviews, and many quality control cross-checks. But this is anyway part of the workflow in many Continuous Active Learning (CAL) predictive coding projects today. At least it is in mine, where we take great pains to meet the client’s concern to maintain the confidentiality of their data. See: Step 8 of the EDBP (Electronic Discovery Best Practices), which I call Protections and is the step after first pass review by CAR (computer assisted review, multimodal predictive coding).

Advanced Summary of ei-Recall

We begin with a high level summary of this method for my more advanced readers. Do not be concerned if this seems fractured and obtuse at first. It will come into clear 3-D focus later as we describe the process in multiple ways and conclude with examples.

ei-Recall calculates recall range with two fractions. The numerator of both fractions is the actual number of True Positives found in the course of the review project and verified as relevant. The denominator of both fractions is based on a random sample of the documents presumed irrelevant that will not be produced, the Negatives. The percentage of False Negatives found in the sample allows for a calculation of a binomial range of the total number of False Negatives in the Negative set. The denominator of the low end recall range fraction is the high end number of the projected range of False Negatives, plus the number of True Positives. The denominator of the high end recall range fraction is the low end number of the projected range of False Negatives, plus the number of True Positives.

Here is the full algebraic explanation of ei-Recall, starting with the definitions for the symbols in the formula.

Rl stands for the low end of recall range.
Rh stands for high end of recall range
TP is the verified total number of relevant documents found in the course of the review project.
FNl is the low end of the False Negatives projection range based on the low end of the exact binomial confidence interval.
FNh is the high end of the False Negatives projection range based on the high end of the exact binomial confidence interval.

Formula for the low end of the recall range:
Rl = TP / (TP+FNh).

Formula for the high end of the recall range:
Rh = TP / (TP+FNl).

This formula essentially adds the extreme probability ranges to the standard formula for recall, which is: R = TP / (TP+FN).

ei-recall_sphere

Quest for the Holy Grail of Recall Calculations

holy.grail.chalice I have spent the last few months in intense efforts to bring this project to conclusion. I have also spent more time writing and rewriting this blog than any I have ever written in my eight plus years of blogging. I wanted to find the best possible recall calculation method for e-discovery work. I convinced myself that I needed to find a new method in order to take my work as a legal search and review lawyer to the next level. I was not satisfied with my old ways and methods of quality control of large legal search projects. I was not comfortable with my prevalence based recall calculations. I was not satisfied with anyone else’s recall methods either. I had heard the message of Gordon Cormack and Maura Grossman clearly stated right here in their guest blog of September 7, 2014: Talking Turkey. In their conclusion they stated:

We hope that our studies so far—and our approach, as embodied in our TAR Evaluation Toolkit—will inspire others, as we have been inspired, to seek even more effective and more efficient approaches to TAR, and better methods to validate those approaches through scientific inquiry.

I had already been inspired to find better methods of predictive coding, and have uncovered an efficient approach with my multimodal CAL method. But I was still not satisfied with my recall validation approach, I wanted to find a better method to scientifically validate my review work.

Like almost everyone else in legal search, including Cormack and Grossman, I had earlier rejected the so called Direct Method of recall calculation. It is unworkable and very costly, especially in low prevalence collections where it requires sample sizes in the tens of thousands of documents. See Eg. Grossman & Cormack, Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ Federal Courts Law Review, Vol. 7, Issue 1 (2014) at 306-307 (“The Direct Method is statistically sound, but is quite burdensome, especially when richness is low.”)

Like Grossman and Cormack, I did not much like any of the other sampling alternatives either. Their excellent Comments articles discusses and rejects Roitblat’s eRecall, and two other methods by Karl Schieneman and Thomas C. Gricks III, which Grossman and and Cormack call the Basic Ratio Method and Global Method. Supra at 307-308.

I was on a quest of sorts for the Holy Grail of recall calculations. I knew there had to be a better way. I wanted a method that used sampling with interval ranges as a tool to assure the quality of a legal search project. I wanted a method that created as accurate an estimate as possible. I also wanted a method that relied on simple fraction calculations and did not depend on advanced math to narrow the binomial ranges, such as William Webber’s favorite recall equation: the Beta-binomial Half formula, shown below.

Webber, W., Approximate Recall Confidence Intervals, ACM Transactions on Information Systems, Vol. V, No. N, Article A, Equation 18, at pg. A:13 (October 2012).

Before settling on my much simpler algebraic formula I experimented with many other methods to calculate recall ranges. Most were much more complex and included two or more samples, not just one. I wanted to try to include a sample that I usually take at the beginning of a project to get a rough idea of prevalence with interval ranges. These were the examples shown by my article, In Legal Search Exact Recall Can Never Be Known, and described in the section, Calculating Recall from Prevalence. I wanted to include the first sample, and prevalence based recall calculations based on that first sample, with a second sample of excluded documents taken at the end of the project. Then I wanted to kind of average them somehow, including the confidence interval ranges. Good idea, but bad science. It does not work, statistically or mathematically, especially in low prevalence.

I found a number of other methods, which, at first, looked like the Holy Grail. But I was wrong. They were made of lead, not gold. Some of the one’s that I dreamed up were made of fools gold! A couple of the most promising methods I tried and rejected used multiple samples of various stratas. That is called stratified random sampling as compared to simple sampling.

My questionable, but inspired research method for this very time consuming development work consisted of background reading, aimless pondering, sleepless nights, intuition, trial and error (appropriate I suppose for a former trial lawyer), and many consults with the top experts in the field (another old trial lawyer trick). I ran though many other alternative formulas. I did the math in several standard review project scenarios, only to see the flaws of these other methods in certain circumstances, primarily low prevalence.

Every experiment I tried with added complexity, and added effort of multiple samples, proved to be fruitless. Indeed, most of this work was an exercise in frustration. (It turns out that noted search expert Bill Dimm is right. There is no free lunch in recall.) My experiments, and especially the expert input I received from Webber and Cormack, all showed that the extra complexities were not worth the extra effort, at least not for purposes of recall estimation. Instead, my work confirmed that the best way to channel additional efforts that might be appropriate in larger cases is simply to increase the sample size. This, and my use of confirmed True Positives, are the only sure-fire methods to improve the reliability of recall range estimates. They are the best ways to lower the size of the interval spread that all probability estimates must include.

Finding the New Gold Standard

ei-Recall meets all of my goals for recall calculation. It maintains mathematical and statistical integrity by including probable ranges in the estimate. The size of the range depends on the size of the sample. It is simple and easy to use, and easy to understand. It can thus be completely transparent and easy to disclose. It is also relatively inexpensive and you control the costs by controlling the sample size (although I would not recommend a sample size of less than 1,500 in any legal search project of significant size and value).

Finally, by using verified True Positives, and basing the recall range calculation on only one random sample, one of the null set, instead of two samples, the chance factor inherent to all random sampling is reduced. I described these chance factors in detail in In Legal Search Exact Recall Can Never Be Known, in the section on Outliers and Luck of Random Draws. The possibility of outlier events is still possible using ei-Recall, but is minimized by limiting the sample to the null set and only estimating a projected range of False Positives. While it is true that the prevalence based recall calculations described in In Legal Search Exact Recall Can Never Be Known, also only use one random sample, that is a sample of the entire document collection to estimate a projected range of relevant documents, True Positives. The number of relevant documents found will (or at least should be in any half-way decent search) be a far larger number than the number of False Negatives. For that reason alone the variability range (interval spread) of the straight elusion recall method should typically be smaller and more reliable.

Focus Your Sampling Efforts on Finding Errors of Omission

The number of documents presumed irrelevant, the Negatives, or null set, will always be smaller than the total document collection, unless of course you found no relevant documents at all! This means you will always be sampling a smaller dataset when doing an elusion sample, than when doing a prevalence sample of the entire collection. Therefore, if you are trying to find your mistakes, the False Negatives, look for them where they might lie, in the smaller Negative set, the null set. Do not look for them in the larger complete collection, which includes the documents you are going to produce, the Positive set. Your errors of omission, which is what you are trying to measure, could not possibly be there. So why include that set of documents in the random sample? That is why I reject the idea of taking a sample at the end of the entire collection, including the Positives.

The Positives, the documents to be produced, have already been verified enough under my two-pass system. They have been touched multiple times by machines and humans. It is highly unlikely there will be False Positives. Even if there are, the requesting party will not complain about that. Their concern should be on completeness, or recall, especially if any precision errors are minor.

There is no reason to include the Positives in a final recall search in any project with verified True Positives. That just unnecessarily increases the total population size and thereby increases the possibility of an inaccurate sample. Estimates made from a sample of 1,500 documents of a collection of 150,000 documents will always be more accurate, more reliable, than estimates made from a sample of 1,500 documents of a collection of 1,500,000. The only exception is when there is an even distribution of target documents making up half of the total collection – 50% prevalence.

Sample size does not scale perfectly, only roughly, and the lower the prevalence, the more inaccurate it becomes. That is why sampling is not a miracle tool in legal search, and recall measures are range estimates, not certainties. In Legal Search Exact Recall Can Never Be Known. Recall measure when done right, as it is in ei-Recall, is a powerful quality assurance tool, to be sure, but it is not the end-all of quality control measures. It should be part of a larger tool kit that includes several other quality measures and techniques. The other quality control methods should be employed throughout the review, not just at the end like ei-Recall. Maura Grossman and Gordon Cormack agree with me on this. Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ supra at 285. They recommend that validation:

consider all available evidence concerning the effectiveness of the end-to-end review process, including prior scientific evaluation of the TAR method, its proper application by qualified individuals, and proportionate post hoc sampling for confirmation purposes.

Ambiguity in the Scope of the Null Set

There is an open-question in my proposal as to exactly how you define the Negatives, the presumed irrelevant documents that you sample. This may be varied somewhat depending on the circumstances of the review project. In my definition above I said the Negatives were the documents presumed to be irrelevant that will not be produced. That was intentionally somewhat ambiguous. I will later state with less ambiguity that Negatives are the documents not produced (or logged for privilege). Still, I think this application should be varied sometimes according to the circumstances.

In some circumstances you could improve the reliability of an elusion search by excluding from the null set all documents coded irrelevant by an attorney, either with or without actual review. The improvement would arise from shrinking the size of the number of documents to be sampled. This would allow you to focus your sample on the documents most likely to have an error.

For example, you could have 50,000 documents out of 900,000 not produced, that have actually been read or skimmed by an attorney, and coded irrelevant. You could have yet another 150,000 that have not been actually been read or skimmed by an attorney, but have been bulked coded irrelevant by an attorney. This would not be uncommon in some projects. So even though you are not producing 900,000 documents, you may have manually coded 200,000 of those, and only 700,000 have been presumed irrelevant on the basis of computer search. Typically in predictive coding driven search that would be because their ranking at the end of the CAL review was too low to warrant further consideration. In a simplistic keyword search they would be documents omitted from attorney review because they did not contain a keyword.

In other circumstances you might want to include the documents attorneys reviewed and coded as irrelevant, for instance, where you were not sure of the accuracy of their coding for one reason or another. Even then you might want to exclude other sets of documents for other grounds. For instance, in predictive coding projects you may want to exclude some bottom strata of the rankings of probable relevance. For example, you could exclude the bottom 25%, or maybe the bottom 10%, or bottom 2%, where it is highly unlikely that any error has been made in predicting irrelevance of those documents.

In the data visualization diagram I explained in Visualizing Data in a Predictive Coding Project – Part Two (shown right) you could exclude some bottom portion of the ranked documents shown in blue. You could, for instance, limit the Negatives searched to those few documents in the 25% to 50% probable relevance range. Of course, whenever you limit the null set, you have to be careful to adjust the projections accordingly. Thus, if you find 1% False Negatives in a sample of a presumably enriched sub-collection of 10,000 out of 100,000 total Negatives, you cannot just project 1% of 100,000 and assume there are a total of 1,000 False Negatives (plus or minus of course). You have to project the 1% from the sample of the size of the sub-collection sampled, and so it would be 1% of 10,000, or 100 False Negatives, not 1,000, again subject to the confidence interval range, a range that varies according to your sample size.

Remember, the idea is to focus your random search to find mistakes on the group of documents that are most likely to have mistakes. There are many possibilities.

In still other scenarios you might want to enlarge the Negatives to include documents that were never included in the review project at all. For instance, if you collected emails from ten custodians, but eliminated three as unlikely to have relevant information as per Step 6 of the EDBP (culling), and only reviewed the email of seven custodians, then you might want to include select documents from the three excluded custodians in the final elusion test.

There are many other variations and issues pertaining to the scope of the Negatives set searched in ei-Recall. There are too many to discuss in this already long article. I just want to point out in this introduction that the makeup and content of the Negatives sampled at the end of the project is not necessarily cut and dry.

Advantage of End Project Sample Reviews

Basing recall calculations on a sample made at the end of a review project is always better than relying on a sample made at the beginning. This is because final relevance standards will have been determined and fully articulated by the end of a project. Whereas at the beginning of any review project, the initial relevance standards will be tentative. They will typically change in the course of the review. This is known as relevance shift, where the understanding of relevance changes and matures during the course of the project.

This variance of adjudication between samples can be corrected during and at the end of the project by careful re-review and correction of initial sample relevance adjudications. This also requires correction of changes of all codings made during the review in the same way, not just inconsistencies in sample codings.

The time and effort spent to reconcile the adjudications might be better spent on a larger sample size of the final elusion sample. Except for major changes in relevance, where you would anyway have to go back and make corrections as part of quality control, it may not be worth the effort to remediate the first sample, just so you can still use it again at the end of the project with an elusion sample. That is because of the unfortunate statistical fact of life, that the two recall methods cannot be added to one another to create a third, more reliable number. I know. I tried. The two recall calculations are apples and oranges. Although a comparison between the two range values is interesting, they cannot somehow be stacked together to improve the reliability of either or both of them.

Prevalence Samples May Still Help Guide Search, Even Though They Cannot Be Reliably Used to Calculate Recall

I like to make a prevalence sample at the beginning of a project to get a general idea of the number of relevant documents there might be, and I emphasize general and might, in order to help with my search. I used to make recall calculation from that initial sample too, but no longer (except in small cases under the theory it is better than nothing), because it is simply too unreliable. The prevalence samples can help with search, but not with recall calculations used to test the quality of the search results. For quality testing it is better to sample the null set and calculate recall using the ei-Recall method.

Still, if you are like me, and like to take a sample at the start of a project for search guidance purposes, then you might as well do the math at the end of the project to see what the recall range estimate is using the prevalence method described in In Legal Search Exact Recall Can Never Be Known. It is interesting to compare the two recall ranges, especially if you take the time and trouble to go back and correct the first prevalence sample adjudications to match those of calls made in your second null set sample (that can eliminate the problem of concept drift and reviewer inconsistencies). Still, go with the recall range values of the ei-Recall, not prevalence. It is more reliable. Moreover, do not waste your time, as I did for weeks, trying to somehow average out the results. I traveled down that road and it is a dead-end.

Claim for ei-Recall

My claim is that ei-Recall is the most accurate recall range estimate method possible that uses only algebraic math within everyone’s grasp. (This statement is not exactly true because binomial confidence interval calculations are not simple algebra, but we avoid these calculations by use of an online calculator. Many are available.) I also claim that ei-Recall is more reliable, and less prone to error in more situations, than a standard prevalence based recall calculation, even if the prevalence recall includes ranges as I did in In Legal Search Exact Recall Can Never Be Known.

I also claim that my range based method of recall calculation is far more accurate and reliable than any simple point based recall calculations that ignore or hide interval ranges, including the popular eRecall. This later claim is based on what I proved in In Legal Search Exact Recall Can Never Be Known, and is not novel. It has long been known and accepted by all experts in random sampling, that recall projections that do not include high-low ranges are inexact and often worthless and misleading. And yet attorneys and judges are still relying on point projections of recall to certify the reasonableness of search efforts. The legal profession and our courts need to stop relying on such bogus science and turn instead to ei-Recall.

I am happy to concede that scientists who specialize in this area of knowledge like Dr. Webber and Professor Cormack can make slightly more accurate and robust calculations of binomial recall range estimates by using extremely complex calculations such as Webber’s Beta-binomial formula.

Such alternative black box type approaches are, however, disadvantaged by the additional expense from expert consultations and testimony to implement and explain. (Besides, at the present time, neither Webber nor Cormack are available for such consultations.) My approach is based on multiplication and division, and simple logic. It is well within the grasp of any attorney or judge (or anyone else) who takes the time to study it. My relatively simple system thus has the advantage of ease of use, ease of understanding, and transparency. These factors are very important in legal search.

Although the ei-Recall formula may seem complex at first glance, it is really just ratios and proportions. I reject the argument some make that calculations like this are too complex for the average lawyer. Ratios and proportions are part of the Grade 6 Common Core Curriculum. Reducing word problems to ratios and proportions is part of the Grade 7 Common Core, so too is basic statistics and probability.

Overview of How ei-Recall Works

ei-Recall is designed for use at the end of a search project as a final quality assurance test. A single random sample is taken of the documents that are not marked relevant and so will not be produced or privileged-logged – the Negatives. (As mentioned, definition and scope of the Negatives can be varied depending on project circumstances.) The sample is taken to estimate the total number of False Negatives, documents falsely presumed irrelevant that are in fact relevant. The estimate projects a range of the probable total number of False Negatives using a binomial interval range in accordance with the sample size. A simplistic and illusory point value projection is not used. The high end of the range of probable False Negatives is shown in the formula and graphic as FNh. The low end of the projected range of False Negatives is FNl.

This type of search is generally called an elusion based recall search. As will be discussed here in some detail, well-known software expert and entrepreneur, Herb Rotiblat, who has a PhD in psychology, advocates for the use of a similar elusion based recall calculation that uses only the point projection of the total False Negatives. He has popularized a name for this method: eRecall, and uses it with his company’s software.

I here offer a more accurate alternative that avoids the statistical fallacies of point projections. Rotiblat’s eRecall, and other ratio calculations like it, ignore the interval high and low range range inherent in all sampling. My version includes interval range, and for this reason an “i” is added to the name: ei-Recall.

ei-Recall is more accurate than eRecall, especially when working with low prevalence datasets, and, unlike eRecall, is not misleading because it shows the total range of recall. It is also more accurate because it uses the exact count of the documents verified as relevant at the end of the project, and does not estimate the True Positives value. I offer ei-Recall to the e-discovery community as a statistically valid alternative, and urge its speedy adoption.

Contingency Table Background

A review some of the basic concepts and terminology used in this article may be helpful before going further. It is also important to remember that ei-Recall is a method for measuring recall, not attaining recall. There is a fundamental difference. Many of my other articles have discussed search and review methods to achieve recall, but this one does not. See eg.

Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One, Part Two, Part Three, and Part Four.
Predictive Coding and the Proportionality Doctrine: a Marriage Made in Big Data, 26 Regent U. Law Review 1 (2013-2014).
Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three.
Three-Cylinder Multimodal Approach To Predictive Coding.

This article is focused on the very different topic of measuring recall as one method among many to assure quality in large-scale document reviews.

Everyone should know that in legal search analysis False Negatives are documents that were falsely predicted to be irrelevant, that are in fact relevant. They are mistakes. Conversely, documents predicted irrelevant, that are in fact irrelevant, are called True Negatives. Documents predicted relevant that are in fact relevant are called True Positives. Documents predicted relevant that are in fact irrelevant are called False Positives.

These terms and formulas derived therefrom are set forth in the Contingency Table, a/k/a Confusion Matrix, a tool widely used in information science. Recall using these terms is the total number of relevant documents found, the True Positives (TP), divided by that same number, plus the total number of relevant documents not found, the False Negatives (FN). Recall is the percentage of total target documents found in any search.

CONTINGENCY TABLE

	Truly Non-Relevant	Truly Relevant
Coded Non-Relevant	True Negatives (“TN”)	False Negatives (“FN”)
Coded Relevant	False Positives (“FP”)	True Positives (“TP”)

The standard formula for Recall using contingency table values is: R = TP / (TP+FN).

The standard formula for Prevalence is: P = (TP + FN) / (TP + TN + FP + FN).

The Grossman-Cormack Glossary of Technology Assisted Review. Also see: LingPipe Toolkit class on PrecisionRecallEvaluation.

General Background on Recall Formulas

Before I get into the examples and math for ei-Recall, I want to provide more general background. In addition, I suggest that you re-read my short description of an elusion test at the end of Part Three of Visualizing Data in a Predictive Coding Project. It provides a brief description of the other quality control applications of the elusion test for False Negatives. If you have not already done so, you should also read my entire article, In Legal Search Exact Recall Can Never Be Known.

I also suggest that you read John Tredennick’s excellent article: Measuring Recall in E-Discovery Review: A Tougher Problem Than You Might Realize, especially Part Two of that article. I give a big Amen to John’s tough problem insights.

For the more technical and mathematically minded, I suggest you read the works of William Webber, including his key paper on this topic, Approximate Recall Confidence Intervals (January 2013, Volume 31, Issue 1, pages 2:1–33) (free version in arXiv), and his many less formal and easier to understand blogs on the topic: Why confidence intervals in e-discovery validation? (12/9/12); Why training and review (partly) break control sets, (10/20/14); Why 95% +/- 2% makes little sense for e-discovery certification, (5/25/13); Stratified sampling in e-discovery evaluation, (4/18/13); What is the maximum recall in re Biomet?, (4/24/13). Special attention should be given to Webber’s recent article on Roitblat’s eRecall, Confidence intervals on recall and eRecall (1/4/15), where it is tested and found deficient on several grounds,

My idea for a recall calculation that includes binomial confidence intervals, like most ideas, is not truly original. It is, as our friend Voltaire puts it, a judicious imitation. For instance, I am told that my proposal to use comparative binomial calculations to determine approximate confidence interval ranges follows somewhat the work of an obscure Dutch medical statistician, P. A. R. Koopman, in the 1980s. See: Koopman, Confidence intervals for the ratio of two binomial proportions, Biometrics 40: 513–517 (1984). Also see: Webber, William, Approximate Recall Confidence Intervals, ACM Transactions on Information Systems, Vol. V, No. N, Article A (October 2012); Duolao Wang, Confidence intervals for the ratio of two binomial proportions by Koopman’s method, Stata Technical Bulletin, 10-58, 2001.

As mentioned, the recall method I propose here is also similar to that promoted by Herb Roitbalt – eRecall – except that avoids its fundamental defect. I include binomial intervals in the calculations to provide an elusion recall range, and his method does not. Measurement in eDiscovery (2013). Herb’s method relies solely on point projections and disregards the ranges of both the Prevalence and False Negative projections. That is why no statistician will accept Rotibalt’s eRecall, whereas ei-Recall has already been reviewed without objection by two of the leading authorities in the field, William Webber and Gordon Cormack.

EDBP_5-9

ei-Recall is also a superior method because it is based on a specific number of relevant documents found at the end of the project, the True Positives (TP). That is not an estimated number. It is not a projection based on sampling where a confidence interval range and more uncertainty are necessarily created. True Positives in ei-Recall is the number of relevant documents in a legal document production (or privilege log). It is an exact number verified by multiple reviews and other quality control efforts set forth in steps six, seven and eight in Electronic Discovery Best Practices (EDBP), and then produced in step nine (or logged).

In a predictive coding review the True Positives as defined by ei-Recall are the documents predicted relevant, and then confirmed to be relevant in second pass reviews, etc., and produced and logged. (Again see: Step 8 of the EDBP, which I call Protections.) The production is presumed to be a 100% precise production, or at least as close as is humanly possible, and contain no False Positives. For that reason ei-Recall may not be appropriate in all projects. Still, it could also work, if need be, by estimating the True Positives. The fact that ei-Recall includes interval ranges in and of itself make it superior and more accurate that any other ratio method.

In the usual application of ei-Recall, only the number of relevant documents missed, the False Negatives, is estimated. The actual number of relevant documents found (TP) is divided by the sum of the projected range of False Negatives from the samples of the null set of each strata, both high (FNh) and low (FNl), and the number of relevant documents found (TP). This method is summarized by the following formulas:

Formula for the lowest end of the recall range from the null set sample: Rl = TP / (TP+FNh).

Formula for the highest end of the recall range from the null set sample: Rh = TP / (TP+FNl).

This is a very different from the approach used by Herb Roitblat for eRecall. Herb’s approach is to sample the entire collection to calculate a point projection of the probable total number of relevant documents in the collection, which I will here call P. He then takes a second random sample of the null set to calculate the point projection of the probable total False Negatives contained in the null set (FN). Roitblat’s approach only uses point projections and ignores the interval ranges inherent in each sample. My approach uses one sample and includes its confidence interval range. Also, as mentioned, my approach uses a validated number of True Positives found at the end of a review project, and not a projection of the probable total number of relevant documents found (P). Although Herb never uses a formula per se in his paper, Measurement in eDiscovery, to describe his approach, if we use the above described definitions the formula for eRecall would seem to be: eR = P / (P + FN). (Note there are other speculations as to what Roitblat’s really intends here, as discussed in the comments to Webber’s blog on eRecall. One thing we know for sure, is that although he may change the details to his approach, it never includes a recall range, just a spot projection.)

My approach of making two recall calculations, one for the low end, and another for the high end, is well worth the slight additional time to create a range. Overall the effort and cost of ei-Recall is significantly less than eRecall because only one sample is used in my method, not two. My method significantly improves the reliability of recall estimates and overcomes the defects inherent in ignoring confidence intervals found in eRecall and other methods such as the Basic Ratio Method and Global Method. See Eg: Grossman & Cormack, Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ Federal Courts Law Review, Vol. 7, Issue 1 (2014) at 306-310.

The use of range values avoids the trap of using a point projection that may be very inaccurate. The point projections of eRecall may be way off from the true value, as was explained in detail by In Legal Search Exact Recall Can Never Be Known. Moreover, ei-Recall fits in well with the overall work flow of my current two-pass, CAL-based (continuous active learning), hybrid, multimodal search and review method.

Recall Calculation Methods Must Include Range

A fuller explanation of Herb Rotiblat’s eRecall proposal, and other similar point projection based proposals, should help clarify the larger policy issues at play in the proposed alternative ei-Recall approach.

Again, I cannot accept Herb Roitblat’s approach to using an Elusion sample to calculate recall because he uses the point projection of prevalence and elusion only, and does not factor in the recall interval ranges. My reason for opposing this simplification was set out in detail In Legal Search Exact Recall Can Never Be Known. It is scientifically and mathematically wrong to use point projections and not include ranges.

I note that industry leader John Tredennick also disagrees with Herb’s approach. See his recent article: Measuring Recall in E-Discovery Review: A Tougher Problem Than You Might Realize, Part Two. After explaining Herb’s eRecall John says this:

Does this work? Not so far as I can see. The formula relies on the initial point estimate for richness and then a point estimate for elusion.

I agree with John Tredennick in this criticism of Herb’s method. So too does Bill Dimm, who has a PhD in Physics and is the founder and CEO of Hot Neuron. Bill summarizes Herb’s eRecall method in his article, eRecall: No Free Lunch. He uses an example to show that eRecall does not work at all in low prevalence situations. Of course, all sampling is challenged by extremely low prevalence, even ei-Recall, but at least my interval approach does not hide the limitations of such recall estimates. There is no free lunch. Recall estimates are just one quality control effort among many.

Maura Grossman and Gordon Cormack also challenge the validity of Herb’s method. They refer to Roitblat’s eRecall as a specious argument. Grossman and Cormack make the same judgment about several other approaches that compare the ratios of point projections and show how they all suffer from a basic mathematical statistical error, which they call the Ratio Method Fallacy. Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ supra at 308-309.

Missed_target In Grossman & Cormack’s, Guest Blog: Talking Turkey (e-Discovery Team, 2014) they explained an experiment that they did and reported on in the Comments article where they repeatedly used Roitblat’s eRecall, the direct method, and other methods to estimate recall. They used a review known to have achieved 75% recall and 83% precision, from a collection with 1% prevalence. They results showed that in this review “eRecall provides an estimate that is no better than chance.” That means eRecall was a complete failure as a quality assurance measure.

Although my proposed range method is a comparative Ratio Method, it avoids the fallacy of other methods criticized by Grossman and Cormack. It does so because it includes binomial probability ranges in the recall calculations and eschews the errors of point projection reliance. It is true that the range of recall estimates using ei-Recall may be still uncomfortably large in some low yield projects, but at least it will be real and honest, and, unlike eRecall, it is better than nothing.

No Legal Economic Arguments Justify the Errors of Simplified Point Projections

arrows missing target The oversimplified point projection ratio approach can lead to a false belief of certainty for those who do not understand probability ranges inherent in random samples. We presume that Herb Roitblat understands the probability range issues, but he chooses to simplify anyway on the basis of what appears to me to be essentially legal-economic arguments, namely proportionality cost-savings, and the inherent vagaries of legal relevance. Roitblat, The Pendulum Swings: Practical Measurement in eDiscovery.

I disagree strongly with Roitblat’s logic. As one scholar in private correspondence pointed out, Herb appears to fall victim to the classic fallacy of the converse. Herb asserts that “if the point estimate is X, there is a 50% probability that the true value is greater than X.” What *is* true (for an unbiased estimate) is that “if the true value is X, there is a 50% probability that the estimate is greater than X.” Assuming the latter implies the former is classic fallacy of the converse. Think about it. It is a very good point. For a more obvious example of the fallacy of the converse consider this: “Most accidents occur within 25 miles from home; therefore, you are safest when you are far from home.”

Although I disagree with Herb Roitblat’s logic, I do basically agree with many of his non-statistical arguments and observations on document review, including, for instance, the following:

Depending on the prevalence of responsive documents and the desired margin-of-error, the effort needed to measure the accuracy of predictive coding can be more than the effort needed to conduct predictive coding.

Until a few years ago, there was basically no effort expended to measure the efficacy of eDiscovery. As computer-assisted review and other technologies became more widespread, an interest in measurement grew, in large part to convince a skeptical audience that these technologies actually worked. Now, I fear, the pendulum has swung too far in the other direction and it seems that measurement has taken over the agenda.

There is sometimes a feeling that our measurement should be as precise as possible. But when the measure is more precise than the underlying thing we are measuring, that precision gives a false sense of security. Sure, I can measure the length of a road using a yardstick and I can report that length to within a fraction of an inch, but it is dubious whether the measured distance is accurate to within even a half of a yard.

Although I agree with many of the points of Herb’s legal economic analysis in his article, The Pendulum Swings: Practical Measurement in eDiscovery, I disagree with the conclusion. The quality of the search software, and legal search skills of attorney-users of this software, have both improved significantly in the past few years. It is now possible for relatively high recall levels to be attained, even including ranges, and even without incurring extraordinary efforts and costs as Herb and others suggest. (As a side note, please notice that I am not opining on a specific minimum recall number. That is not helpful because it depends on too many variable factors unique to particular search projects. However, I would point out that in the TREC Legal Track studies in 2008 and 2009 the participants, expert searchers all, attained verified recall levels of only 20% to 70%. See The Legal Implications of What Science Says About Recall. All I am saying is that in my experience our recall efforts have improved and are continually improving as our software and skills improve.)

Further, although relevance and responsiveness can sometimes be vague and elusive as Roitblat points out, and human judgments can be wrong and inconsistent, there are quality control process steps that can be taken to significantly mitigate these problems, including the often overlooked better dialogues with the requesting party. Legal search is not an arbitrary exercise such that it is a complete waste of time to try to accurately measure recall.

I disagree with Herb’s suggestion to the contrary based on his evaluation of legal relevance judgments. He reaches this conclusion based on the very interesting study he did with Anne Kershaw and Patrick Oot on a large-scale document review that Verizon did nearly a decade ago. Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review. In that review Verizon employed 225 contract reviewers and a Twentieth Century linear review method wherein low paid contract lawyers sat in isolated cubicles and read one document after another. The study showed, as Herb summarizes it, that the reviewers agree with one another on relevance calls only about 50% of the time.” Measurement in eDiscovery at pg. 6. He takes that finding as support for his contention that consistent legal review is impossible and so there is no need to bother with finer points of recall intervals.

coin_flip I disagree. My experience as an attorney making judgments on the relevancy of documents since 1980 tells me otherwise. It is absurd, even insulting, to call legal judgment a mere matter of coin flipping. Yes, there are well-known issues with consistency in legal review judgments in large-scale reviews, but this just makes the process more challenging, more difficult, not impossible.

Although consistent review may be impossible if large teams of contract lawyers do linear review in isolation using yesterday’s technology, that does not mean consistent legal judgments are impossible. It just means the large team linear review process is deeply flawed. That is why the industry has moved away from the approaches used by the Verizon team review nearly ten years ago. We are now using predictive coding, small teams of SMEs and contract lawyers, and many new innovative quality control procedures, including soon, I hope, ei-Recall. The large team linear review approach of a decade ago, and other quality factors, were the primary causes of the inconsistencies seen in the Verizon approach, not the inherent impossibility of determining legal relevance.

Good Recall Results Are Possible Without Heroic Efforts
But You Do Need Good Software and Good Methods

Even with the consistency and human error challenges inherent in all legal review, and even with the ranges of error inherent in any valid recall calculation, it is, I insist, still possible to attain relatively high recall ranges in most projects. (Again, note that I will not commit to a specific general minimum range.) I am seeing better recall ranges attained in more and more of my projects and I am certainly not a mythical TAR-whisperer, as Grossman and Cormack somewhat tongue in cheek described lawyers who may have extraordinary predictive coding search skills. Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ at pg. 298. Any experienced lawyer with technology aptitude can attain impressive results in large-scale document reviews. They just need to use hybrid, multimodal, CAL-type, quality controlled, search and review methods. They also need to use proven, high quality, bona fide predictive coding software. I am able to teach this in practice with bright, motivated, hard-working, technology savvy lawyers.

Legal search is a new legal skill to be sure, just like countless others in e-discovery and other legal fields. I happen to find the search and review challenges more interesting than the large enterprise preservation problems, but they are both equally difficult and complex. TAR-whispering is probably an easier skill to learn than many others required today in the law. (It is certainly easier than becoming a dog whisperer like Cesar Millan. I know. I’ve tried and failed many times.)

Think of the many arcane choice of law issues U.S. lawyers have faced for over a century in our 50-state, plus federal law system. Those intellectual problems are more difficult than predictive coding. Think of the tax code, securities, M&A, government regulations, class actions. It is all hard. All difficult. But it can all be learned. Like everything else in the law, large-scale document review just requires a little aptitude, hard work and lots of legal practice. It is no different from any other challenge lawyers face. It just happens to require more software skills, sampling, basic math, and AI intuition than any other legal field.

On the other point of bona fide predictive coding software, while I will not name names, as far as I am concerned the only bona fide software on the market today uses active machine learning algorithms. It does not depend instead on some kind of passive learning process (although they too can be quite effective, they are not predictive coding algorithms, and, in my experience, do not provide as powerful a search tool). I am sorry to say that some legal review software on the market today falsely claims to have predictive coding features, when, in fact, it does not. It is only passive learning, more like concept search, than AI-enhanced search. With software like that, or even with good software where the lawyers use poor search and review methods, or do not really know what they are searching for (poor relevance scope), then the efforts required to attain high recall ranges may indeed be very extensive and thus cost prohibitive as Herb Roitblat argues. If your tools and or methods are poor, it takes much longer to reach your goals.

One final point regarding Herb’s argument, I do not think sampling really needs to be as cost prohibitive as he and others suggest. As noted before in In Legal Search Exact Recall Can Never Be Known, one good SME and skilled contract review attorney can carefully review a sample of 1,534 documents for between $1,000 and $2,000. In large review projects that is hardly a cost prohibitive barrier. There is no need to be thinking in terms of small 385 document sample sizes, which create a huge margin of error of 5%. This is what Herb Rotiblat and others do when suggesting that all sampling is anyway ineffective, so just ignore intervals and ranges. Any large project can afford a full sample of 1,534 documents to cut the interval in half to a 2.5% margin of error. Many can afford much larger samples to narrow the interval range even further, especially if the tools and methods used allow them to attain their recall range goals in a fast and effective manner.

John Tredennick, who, like me, is an attorney, also disagrees with Herb’s legal-economic analysis in favor of eRecall, but John proposes a solution involving larger sample sizes, wherein the increased cost burden would be shifted onto the requesting party. Recall in E-Discovery Review: A Tougher Problem Than You Might Realize, Part Two. I do not disagree with John’s assertions in his article, and cost shifting may be appropriate in some cases. It is not, however, my intention to address the cost-shifting arguments here, or the other good points made in John’s article. Instead, my focus in the remaining section of this article will be to provide a series of examples of ei-Recall in action. For me, and I suspect for many of you, seeing a method in action is the best way to understand it.

Summary of the Five Reasons ei-Recall is the new Gold Standard

Before moving onto the samples, I wanted to summarize what we have covered so far and go over the five main reasons ei-Recall is superior to all other recall methods. First, and most important, is the fact ei-Recall calculates a recall range, and not just one number. As shown by In Legal Search Exact Recall Can Never Be Known, recall statements must include confidence interval range values to be meaningful. Recall should not be based on point projections alone. Therefore any recall calculation method must calculate both a high and low value. The ei-Recall method I offer here is designed for the correct high low interval range calculations. That, in itself, makes it a significant improvement over all point projection recall methods.

The second advantage of ei-Recall is that is only uses one random sample, not two, or more. This avoids the compounding of variables, uncertainties, and outlier events inherent in any system that uses multiple chance events, multiple random samples. The costs are also controlled better in a one sample method like this, especially since the one sample is of reasonable size. This contrasts with the Direct Method, which also uses one sample, but the sample has to be insanely large. That is not only very costly, but also introduces a probability of more human error in inconsistent relevancy adjudications.

The timing of the one sample in ei-Recall is another of its advantages. It is taken at the end of the project when the relevance scope has been fully articulated.

Another key advantage of ei-Recall is that the True Positives used for the calculation are not estimated, and are not projected by random samples. They are documents confirmed to be relevant by multiple quality control measures, including multiple reviews of these documents by humans, or computer, and often both.

Finally, ei-Recall has the advantage of simplicity, and ease of use. It can be carried out by any attorney who knows fractions. The only higher math required, the calculation of binomial confidence intervals, can be done by easily available online calculators. You do not need to hire a statistician to make the recall range calculations using ei-Recall.

To be continued.

First Example of How to Calculate Recall Using the ei-Recall Method

Let us begin with the same simple hypothetical used in In Legal Search Exact Recall Can Never Be Known. Here we assume a review project of 100,000 documents. By the end of the search and review, when we could no longer find any more relevant documents, we decided to stop and run our ei-Recall quality assurance test. We had by then found and verified 8,000 relevant documents, the True Positives. That left 92,000 documents presumed irrelevant that would not be produced, the Negatives.

As a side note, the decision to stop may be somewhat informed by running estimates of possible recall range attained based on early prevalence assumptions from a sample of all documents at or near the beginning of the project. The prevalence based recall range estimate would not, however, be the sole driver of the decision to stop and test. The prevalence based recall estimates alone can be very unreliable as shown In Legal Search Exact Recall Can Never Be Known. That is one of the main reasons for developing the ei-Recall alternative. I explained the thinking behind the decision to stop in Visualizing Data in a Predictive Coding Project – Part Three.

I will not have stopped the review in most projects (proportionality constraints aside), unless I was confident that I had already found all of those (highly relevant) types of documents; already found all types of strong relevant documents, and already found all highly relevant document, even if they are cumulative. I want to find each and every instance of all hot (highly relevant) documents that exists in the entire collection. I will only stop (proportionality constraints aside) when I think the only relevant documents I have not recalled are of an unimportant, cumulative type; the merely relevant. The truth is, most documents found in e-discovery are of this type; they are merely relevant, and of little to no use to anybody except to find the strong relevant, new types of relevant evidence, or highly relevant evidence.

Back to our hypothetical. We take a sample of 1,534 (95%+/-2.5%) documents, creating a 95% confidence level and 2.5% confidence interval, from the 92,000 Negatives. This allows us to estimate how many relevant documents had been missed, the False Negatives.

Assume we found only 5 False Negatives. Conversely, we found that 1,529 of the documents picked at random from the Negatives were in fact irrelevant as expected. They were True Negatives.

The percentage of False Negatives in this sample was thus a low 0.33% (5/1534). Using the Normal, but wrong, Gaussian confidence interval the projected total number of False Negatives in the entire 92,000 Negatives would thus be between 5 and 2,604 documents (0.33%+2.5%= 2.83% * 92,000). Using the binomial interval calculation the range would be from 0.11% to 0.76%. The more accurate binomial calculation eliminates the absurd result of a negative interval on the low recall range (.33% -2.5%= -2.17). The fact that negative recall arises from using the Gaussian normal distribution demonstrates why the binomial interval calculation should always be used, not Gaussian, especially in low prevalence. From this point forward, in accordance with the ei-Recall method, we will only use the more accurate Binomial range calculations. Here the correct range generated by the binomial interval is from between 101 (92,000 * 0.11%) and 699 (92,000 * 0.76%) False Negatives. Thus the FNh value is 699, and FNl is 101.

The calculation of the lowest end of the recall range is based on the high end of the False Negatives projection: Rl = TP / (TP+FNh) = 8,000 / (8,000 + 699) = 91.96%

The calculation of the highest end of the recall range is based on the low end of the False Negatives projection: Rh = TP / (TP+FNl) = 8,000 / (8,000 + 101) = 98.75%.

Our final recall range values for this first hypothetical is thus from 92%- 99% recall. It was an unusually good result.

Ex. 1 – 92% – 99%

It is important to note that we could have still failed this quality assurance test, in spite of the high recall range shown, if any of the five False Negatives found was a highly relevant, or unique-strong relevant document. That is in accord with the accept on zero error standard that I always apply to the final elusion sample, a standard having nothing directly to do with ei-Recall. Still, I recommend that the e-discovery community also accept this as a corollary to implement ei-Recall. I have previously explained this zero error quality assurance protocol on this blog several times, most recently in Visualizing Data in a Predictive Coding Project – Part Three where I explained:

I always use what is called an accept on zero error protocol for the elusion test when it comes to highly relevant documents. If any are highly relevant, then the quality assurance test automatically fails. In that case you must go back and search for more documents like the one that eluded you and must train the system some more. I have only had that happen once, and it was easy to see from the document found why it happened. It was a black swan type document. It used odd language. It qualified as a highly relevant under the rules we had developed, but just barely, and it was cumulative. Still, we tried to find more like it and ran another round of training. No more were found, but still we did a third sample of the null set just to be sure. The second time it passed.

Variations of First Example with Higher False Negatives Ranges

I want to provide two variations of this hypothetical where the sample of the null set, Negatives, finds more mistakes, more False Negatives. Variations like this will provide a better idea of the impact of the False Negatives range on the recall calculations. Further, the first example wherein I assumed that only five mistakes were found in a sample of 1,534 is somewhat unusual. A point projection ratio of 0.33% for elusion is on the low side for a typical legal search project. In my experience in most projects a higher rate of False Negatives will be found, say in the 0.5% to 2% range.

Let us assume for the first variation that instead of finding 5 False Negatives, we find 20. That is a quadrupling of the False Negatives. It means that we found 1,514 True Negatives and 20 False Negatives in the sample of 1,534 documents from the 92,000 document discard pile. This creates a point projection of 1.30% (20 / 1534), and a binomial range of 0.8% to 2.01%. This generates a projected range of total False Negatives of from 736 (92,000 * .8%) to 1,849 (92,000 * 2.01%).

Now let’s see how this quadrupling of errors found in the sample impacts the recall range calculation.

The calculation of the low end of the recall range is based on the high end of the False Negatives projection: Rl = TP / (TP+FNh) = 8,000 / (8,000 + 1,849) = 81.23%

The calculation of the high end of the recall range is based on the low end of the False Negatives projection: Rh = TP / (TP+FNl) = 8,000 / (8,000 + 736) = 91.58%.

Our final recall range values for this variation of the first hypothetical is thus 81% – 92%.

In this first variation the quadrupling of the number of False Negatives found at the end of the project, from 5 to 20, caused an approximate 10% decrease in recall values from the first hypothetical where we attained a recall range of 92% to 99%.

Ex. 2 – 81% – 87%

Let us assume a second variation that instead of finding 5 False Negatives, finds 40. That is eight times the number of False Negatives found in the first hypothetical. It means that we found 1,494 True Negatives and 40 False Negatives in the sample of 1,534 documents from the 92,000 document discard pile. This creates a point projection of 2.61% (40/1534), and a binomial range of 1.87% to 3.53%. This generates a projected range of total False Negatives of from 1,720 (92,000*1.87%) to 3,248 (92,000*3.53%).

ei-recall_example3 The calculation of the low end of the recall range is based on the high end of the False Negatives projection: Rl2 = TP / TP+FNh = 8,000 / (8,000 + 3,248) = 71.12%

The calculation of the high end of the recall range is based on the low end of the False Negatives projection: Rh2 = TP / TP+FNl = 8,000 / (8,000 + 1,720) = 82.30%.

Our recall range values for this variation of the first hypothetical is thus 71% – 82%.

In this second variation the eightfold increase of the number of False Negatives found at the end of the project, from 5 to 20, caused an approximate 20% decrease in recall values from the first hypothetical where we attained a recall range of 92% to 99%.

Ex. 3 – 71% – 82%

Second Example of How to Calculate Recall Using the ei-Recall Method

We will again go back to the second example used in In Legal Search Exact Recall Can Never Be Known. The second hypothetical assumes a total collection of 1,000,000 documents and that 210,000 relevant documents were found and verified.

In the random sample of 1,534 documents (95%+/-2.5%) from the 790,000 documents withheld as irrelevant (1,000,000 – 210,000) we assume that only ten mistakes were uncovered, in other words, 10 False Negatives. Conversely, we found that 1,524 of the documents picked at random from the discard pile (another name for the Negatives) were in fact irrelevant as expected; they were True Negatives.

The percentage of False Negatives in this sample was thus 0.65% (10/1534). Using the binomial interval calculation the range would be from 0.31% to 1.2%. The range generated by the binomial interval is from 2,449 (790,000*0.31%) to 9,480 (790,000*1.2%) False Negatives.

ei-recall_example4 The calculation of the lowest end of the recall range is based on the high end of the False Negatives projection: Rl2 = TP / TP+FNh = 210,000 / (210,000 + 9,480) = 95.68%

The calculation of the highest end of the recall range is based on the low end of the False Negatives projection: Rh2 = TP / TP+FNl = 210,000 / (210,000 + 2,449) = 98.85%.

Our recall range for this second hypothetical is thus 96% – 99% recall. This is a highly unusual, truly outstanding result. It is, of course, still subject to the outlier result uncertainty inherent in the confidence level. In that sense my labels on the diagram below of “worst” or “best” case scenario are not correct. It could be better or worse in five times out of one hundred times the sample is drawn in accord with the 95% confidence level. See the discussion near the end of my article In Legal Search Exact Recall Can Never Be Known, regarding the role that luck necessarily plays in any random sample. This could have been a lucky draw, but nevertheless, it is just one quality assurance factor among many, and is still an extremely good recall range achievement.

Ex.4 – 96% – 99%

Variations of Second Example with Higher False Negatives Ranges

I now offer three variations of the second hypothetical where each has a higher False Negative rate. These examples should better illustrate the impact of the elusion sample on the overall recall calculation.

Let us first assume that instead of finding 10 False Negatives, we find 20, a doubling of the rate. This means that we found 1,514 True Negatives and 20 False Negatives in the sample of 1,534 documents in the 790,000 document discard pile. This creates a point projection of 1.30% (20/1534), and a binomial range of 0.8% to 2.01%. This generates a projected range of total False Negatives of from 6,320 (790,000*.8%) to 15,879 (790,000*2.01%).

ei-recall_example5

Now let us see how this doubling of errors in the second sample impacts the recall range calculation.

The calculation of the low end of the recall range is: Rl = TP / (TP+FNh) = 210,000 / (210,000 + 15,879) = 92.97%

The calculation of the high end of the recall range is: Rh = TP / (TP+FNl) = 210,000 / (210,000 + 6,320) = 97.08%.

Our recall range for this first variation of the second hypothetical is thus 93% – 97%

The doubling of the number of False Negatives from 10 to 20, caused an approximate 2.5% decrease in recall values from the second hypothetical where we attained a recall range of 96% to 99%.

Ex. 5 – 93% – 97%

Let us assume a second variation where instead of finding 10 False Negatives at the end of the project, we find 40. That is a quadrupling of the number of False Negatives found in the first hypothetical. It means that we found 1,494 True Negatives and 40 False Negatives in the sample of 1,534 documents from the 790,000 document discard pile. This creates a point projection of 2.61% (40/1534), and a binomial range of 1.87% to 3.53%. This generates a projected range of total False Negatives of from 14,773 (790,000*1.87%) to 27,887 (790,000*3.53%).

The calculation of the low end of the recall range is now: Rl = TP / (TP+FNh) = 210,000 / (210,000 + 27,887) = 88.28%

The calculation of the high end of the recall range is now: Rh = TP / (TP+FNl) = 210,000 / (210,000 + 14,773) = 93.43%.

Our recall range for this second variation of second hypothetical is thus 88% – 93%.

The quadrupling of the number of False Negatives from 10 to 40, caused an approximate 7% decrease in recall values from the original where we attained a recall range of 96% to 99%.

Ex. 6 – 88% – 93%

If we do a third variation and increase the number of False Positives found by eight-times, from 10 to 80, this changes the point projection to 5.22% (80/1534), with a binomial range of 4.16% to 6.45%. This generates a projected range of total False Negatives of from 32,864 (790,000*4.16%) to 50,955 (790,000*6.45%).

The calculation of the low end of the recall range is: Rl = TP / (TP+FNh) = 210,000 / (210,000 + 50,955) = 80.47%.

The calculation of the high end of the recall range is: Rh = TP / (TP+FNl) = 210,000 / (210,000 + 32,864) = 86.47%.

Our recall range for this third variation of the second hypothetical is thus 80% – 86%.

The eightfold increase of the number of False Negatives, from 10 to 80, caused an approximate 15% decrease in recall values from the second hypothetical where we attained a recall range of 96% to 99%.

Ex. 7 – 80% – 86%

By now you should have a pretty good idea of how the ei-Recall calculation works, and a feel for how the number of False Negatives found impacts the overall recall range.

Third Example of How to Calculate Recall Using the ei-Recall Method where there is Very Low Prevalence

A criticism of many recall calculation methods is that they fail and become completely useless in very low prevalence situations, say 1%, or sometimes even less. Such low prevalence is considered by many to be common in legal search projects.

upside_down_plane_stamp Obviously it is much harder to find things that are very rare, such as the famous, and very valuable, Inverted Jenny postage stamp with the upside down plane. These stamps exist, but not many. Still, it is at least possible to find them (or buy them), as opposed to a search for a Unicorn or other complete fiction. (Please, Unicorn lovers, no hate mail!) These creatures cannot be found no matter how many searches and samples you take because they do not exist. There is absolute zero prevalence.

unicorn This circumstance sometimes happens in legal search, where one side claims that mythical documents must exist because they want them to. They have a strong suspicion of their existence, but no proof. More like hope, or wishful thinking. No matter how hard you look for such smoking guns, you cannot find them. You cannot find something that does not exist. All you can do is show that you made reasonable, good faith efforts to find the Unicorn documents, and they did not appear. Recall calculations make no sense in crazy situations like that because there is nothing to recall. Fortunately that does not happen too often, but it does happen, especially in the wonderful world of employment litigation.

We are not going to talk further about a search for something that does not exist, like a Unicorn, the zero prevalence. We will not even talk about the extremely, extremely rare, like the Inverted Jenny. Instead we are going to talk about prevalence of about 1%, which is still very low.

In many cases, but not all, very low prevalence like 1%, or less, can be avoided, or at least mitigated, by intelligent culling. This certainly does not mean filtering out all documents that do not have certain keywords. There are other, more reliable methods than simple keywords to eliminate superfluous irrelevant documents, including elimination by file type, date ranges, custodians, and email domains, among other things.

When there is a very low prevalence of relevant documents, this necessarily means that there will be a very large Negatives pool, thus diluting the sampling. There are ways to address the large Negatives sample pool, as I discussed previously. The most promising method is to cull out the low end of the probability rankings where relevant documents should anyway be non-existent.

Even with the smartest culling possible, low prevalence is often still a problem in legal search. For that reason, and because it is the hardest test for any recall calculation method, I will end this series of examples with a completely new hypothetical that considers a very low prevalence situation of only 1%. This means that there will be a large size Negatives pool: 99% of the total collection.

We will again assume a 1,000,000 document collection, and again assume sample sizes using 95% +/-2.5% confidence level and interval parameters. An initial sample of all documents taken at the beginning of the project to give us a rough sense of prevalence for search guidance purposes (not recall calculations), projected a range of relevant documents of from 5,500 to 16,100.

The lawyers in this hypothetical legal search project plodded away for a couple of weeks and found and confirmed 9,000 relevant documents, True Positives all. At this point they are finding it very difficult and time consuming to find more relevant documents. What they do find is just more of the same. They are sophisticated lawyers who read my blog and have a good grasp of the nuances of sampling. So they know better than to simply rely on a point projection of prevalence to calculate recall, especially one based on a relatively small sample of a million documents taken at the beginning of the project. See In Legal Search Exact Recall Can Never Be Known. They know that their recall level could be only a 56% recall 9,000/16,100 (or perhaps far less, in the event the one sample they took was a confidence level outlier event, or there was more concept drift than they thought). It could also be near perfect, 100% recall, when they consider the binomial interval range going the other way. The 9,000 documents they had found was way more than the low range of 5,500. But they did not really consider that too likely.

They decide to stop the search and take a second 1,534 document sample, but this time of the 991,000 null set (1,000,000 – 9,000). They want to follow the ei-Recall method, and they also want to test for any highly relevant or unique strong relevant documents by following the accept on zero error quality assurance test. They find -1- relevant document in that sample. It is just a more of the same type merely relevant document. They had seen many like it before. Finding a document like that meant that they passed the quality assurance test they had set up for themselves. It also meant that using the binomial intervals for 1/1534, which is from 0.00% and 0.36%, there is a projected range of False Negatives of from between -0- and 3,568 documents (991,000*0.36%). (Actually, a binomial calculator that shows more decimal places than any I have found on the web (hopefully we can fix that soon) will not show zero percent, but some very small percentage less than one hundredth of a percent, and thus some documents, not -0- documents, and thus something slightly less than 100% recall.)

They then took out the ei-Recall formula and plugged in the values to see what recall range they ended up with. They were hoping it was tighter, and more reliable, than the 56% to 100% recall level they calculated from the first sample alone based on prevalence.

Calculation for the low end of the recall range: Rl = TP / (TP+FNh) = 9,000 / (9,000 + 3,568) = 71.61%.

Calculation for the high end of the recall range: Rh = TP / (TP+FNl) = 9,000 / (9,000 + 0) = 100%.

The recall range using ei-Recall was 72% – 100%.

Ex. 8 – 72% – 100%

The attorneys’ hopes in this extremely low prevalence hypothetical were met. The 72%-100% estimated recall range was much tighter than the original 56%-100%. It was also more reliable because it was based on a sample taken at the end of the project when relevance was well defined. Although this sample did not, of and by itself, prove that a reasonable legal effort had been made, it did strongly support that position. When considering all of the many other quality control efforts they could report, if challenged, they were comfortable with the results. Assuming that they did not miss a highly relevant document that later turns up in discovery, it is very unlikely they will ever have to redo, or even continue, this particular legal search and review project.

Would the result have been much different if they had doubled the sample size, and thus doubled the cost of this quality control effort? Let us do the math and find out, assuming that everything else was the same.

This time the sample is 3,068 documents from the 991,000 null set. They find two relevant documents, False Negatives, of a kind they had seen many times before. This created a binomial range of 0.01% to 0.24%, projecting a range of False Negatives from 99 to 2,378 (991,000 * 0.01% — 991,000 * 0.24%). That creates a recall range of 79% – 99%.

Rl = TP / (TP+FNh) = 9,000 / (9,000 + 2,378) = 79.1%.

Rh = TP / (TP+FNl) = 9,000 / (9,000 + 99) = 98.91%.

Ex. 9 – 79% – 99%

In this situation by doubling the sample size the attorneys were able to narrow the recall range from 72% – 100% to 79% – 99%. But was it worth the effort and doubling of cost? I do not think so, at least not in most cases. But perhaps in larger cases, it would be worth the expense to tighten the range somewhat and so increase somewhat the defensibility of your efforts. After all, we are assuming in this hypothetical that the same proportional results would turn up in a sample size double that of the original. The results could have been much worse, or much better. Either way, your results would be more reliable than an estimate based on a sample half that size, and would have produced a tighter range. Also, you may sometimes want to take a second sample of the same size, if you suspect the first was an outlier.

Let is consider one more example, this time of an even smaller prevalence and larger document collection. This is the hardest challenge of all, a near Inverted Jenny puzzler. Assume a document collection of 2,000,000 and a prevalence based on a first random sample for search-help purposes, where again only one relevant was found in the sample of 1,534 sample. This suggested there could be as many as 7,200 relevant documents (0.36% * 2,000,000). So in this second hypothetical we are talking about a dataset where the prevalence may be far less than one percent.

Assume next that only 5,000 relevant documents were found, True Positives. A sample 1,534 of the remaining 1,995,000 documents found -3- relevant, False Negatives. The binomial intervals for 3/1534, is from 0.04% to 0.57%, producing a projected range of False Negatives of from between 798 and 11,372 documents (1,995,000 * .04% — 1,995,000 * 0.57%). Under ei-Recall the recall range measured is 31% – 86%.

Rl = TP / (TP+FNh) = 5,000 / (5,000 + 11,372) = 30.54%.

Rh = TP / (TP+FNl) = 5,000 / (5,000 + 798) = 86.24%.

31% – 86% is a big range. Most would think too big, but remember, it is just one quality assurance indicator among many.

Ex. 10 – 31% – 86%

The size of the range could be narrowed by a larger sample. (It is also possible to take two samples, and, with some adjustment, add them together as one sample. This is not mathematically perfect, but fairly close, if you adjust for any overlaps, which anyway would be unlikely.) Assume the same proportions where we sample 3,068 documents from 1,995,000 Negatives, and find -6- relevant, False Negatives. The binomial range is 0.07% – 0.43%. The projected number of False Negatives is 1,397 – 8,579 (1,995,000*.07% – 1,995,000*.43%). Under ei-Recall the range is 37% – 78%.

Rl = TP / (TP+FNh) = 5,000 / (5,000 + 8,579) = 36.82%.

Rh = TP / (TP+FNl) = 5,000 / (5,000 + 1,397) = 78.16%.

Ex. 11 – 37% – 78%

The range has been narrowed, but is still very large. In situations like this, where there is a very large Negatives set, I would suggest taking a different approach. As discussed in Part One, you may want to consider a rational culling down of the Negatives. The idea is similar to that behind stratified sampling. You create a subset or strata of the entire collection of Negatives that has a higher, hopefully much higher prevalence of False Negatives than the entire set. See eg. William Webber, Control samples in e-discovery (2013) at pg. 3

Although Webber’s paper only uses keywords as an example of an easy way to create a strata, in reality in modern legal search today there are a number of methods that could be used to create the stratas, only one of which is keywords. I use a combination of many methods that varies in accordance with the data set and other factors. I call that a multimodal method. In most cases (but not all), this is not too hard to do, even if you are doing the stratification before active machine learning begins. The non-AI based culling methods that I use, typically before active machine learning begins, include parametric Boolean keywords, concept, key player, key time, similarity, file type, file size, domains, etc.

After the predictive coding begins and ranking matures, you can also use probable relevance ranking as a method of dividing documents into strata. It is actually the most powerful of the culling methods, especially when it comes to predicting irrelevant documents. The second filter level is performed at or near the end of a search and review project. (This is all shown in the two-filter diagram above, which I may explain in greater detail in a future blog.) The second AI based filter can be especially effective in limiting the Negatives size for the ei-Recall quality assurance test. The last example will show how this works in practice.

We will begin this example as before, assuming again 2,000,000 documents where the search finds only 5,000. But this time before we take a sample of the Negatives we divide them into two strata. Assume, as we did in the example we considered in Part One, that the predictive coding resulted in a well defined distribution of ranked documents. Assume that all 5,000 documents found were in the 50%, or higher, probable relevance ranking (shown in red in the diagram). Assume that all of the 1,995,000 presumed irrelevant documents are ranked 49.9%, or less, probable relevant (shown in blue in the diagram). Finally assume that 1,900,000 of these documents are ranked 10% or less probable relevant. Thus leaving 95,000 documents ranked between 10.1% and 49.9%.

Assume also that we have good reason to believe based on our experience with the software tool used, and the document collection itself, that all, or almost all, False Negatives are contained in the 95,000 group. We therefore limit our random sample of 1,534 documents to the 95,000 lower midsection of the Negatives. Finally, assume we now find -30- relevant, False Negatives, none of them important.

The binomial range is 0.80% – 2.01%, but this time the projected number of False Negatives is 1,254 – 2,641 (95,000*1.32% — 95,000*2.78%). Under ei-Recall the range is 72.37% – 80.06%.

Rl = TP / (TP+FNh) = 5,000 / (5,000 + 2,641) = 72.37%.

Rh = TP / (TP+FNl) = 5,000 / (5,000 + 1,245) = 80.06%.

We see that culling down the Negative set of documents in a defensible manner can lead to a much tighter recall range. Assuming we did the culling correctly, the resulting recall range would also be more accurate. On the other hand, if the culling was wrong, based on incorrect presumptions, then the resulting recall range would be less accurate.

Ex. 12 – 72% – 80%

The fact is, no random sampling techniques can provide completely reliable results in very low prevalence data sets. There is no free lunch, but, at least with ei-Recall the bill for your lunch is honest because it includes ranges. Moreover, with intelligent culling to increase the probable prevalence of False Negatives, you are more likely to get a good meal.

Conclusion

There are five basic advantages of ei-Recall over other recall calculation techniques:

Interval Range values are calculated, not just a deceptive point value. As shown by In Legal Search Exact Recall Can Never Be Known, recall statements must include confidence interval range values to be meaningful.
One Sample only is used, not two, or more. This limits the uncertainties inherent in multiple random samples.
End of Project is when the sample of the Negatives is taken for the calculation. At that time the relevance scope has been fully developed.
Confirmed Relevant documents that have been verified as relevant by iterative reviews, machine and human, are used for the True Positives. This eliminates another variable in the calculation.
Simplicity is maintained in the formula by reliance on basic fractions and common binomial confidence interval calculators. You do not need an expert to use it.

I suggest you try ei-Recall. It has been checked out by multiple information scientists and will no doubt be subject to more peer review here and elsewhere. Be cautious in evaluating any criticisms you may read of ei-Recall from persons with a vested monetary interest in the defense of a competitive formula, especially vendors, or experts hired by vendors. Their views may be colored by their monetary interests. I have no skin in the game. I offer no products that include this method. My only goal is to provide a better method to validate large legal search projects, and so, in some small way, to improve the quality of our system of justice. The law has given me much over the years. This method, and my other writings, are my personal payback.

I offer ei-Recall to anyone and everyone, no strings attached, no payments required. Vendors, you are encouraged to include it in your future product offerings. I do not want royalties, nor even insist on credit (although you can do so if you wish, assuming you do not make it seem like I endorse your product). ei-Recall is all part of the public domain now. I have no product to sell here, nor do I want one. Although I do hope to create an online calculator soon for ei-Recall. When I do, that too will be a give away.

My time and services as a lawyer to implement ei-Recall are not required. Simplicity is one of its strengths, although it helps if you are part of the eLeet. I think I have fully explained how it works in this lengthy article. Still, if you have any non-legal technical questions about its application, send me an email, and I will try to help you out. Gratis of course. Just realize that I cannot by law provide you with any legal advice. All articles in my blog, including this one, are purely for educational services, and are not legal advice, nor in any way a solicitation for legal services. Show this article to your own lawyer or e-discovery vendor. You do not have to be 1337 to figure it out (although it helps).

ZEN Document Review

Ralph (c) Losey 2015

Category Archives: recall

Two-Filter Document Culling

Legal Search Science

AI-Enhanced Review

ei-Recall