Managing Volume Through Near-Dups
Mining Value from Email Threading and Analytics
Several factors drive the explosion of electronically stored information (ESI) that forms the nucleus of any litigation.
One is the simple fact that it's easy and fairly inexpensive to store. Most companies' IT organizations make routine backup copies of all users' data, and with offline devices such as portable computers, local storage is essentially automatic.
Another is that email is the language of commerce, and "replying to" an email typically adds many copies of the same email to the trail.
A third is the challenge of developing comprehensive record plans, so companies end up implementing a "save everything" strategy.
Emails that are replies or copies of other mails don't necessarily show up as "duplicate" documents since metadata and other information has changed. But they are "near-dups" of an original document. The problem is compounded when attachments are sent to multiple individuals who make comments or edits. The result is a small population of documents that are very nearly identical.
It's expensive and inefficient to review every one of these documents separately, and different reviewers have been known to code nearly duplicate documents differently. The sheer volume of ESI makes it almost impossible to devote individual time and effort to every document, but the dangers of spoliation and privilege mandate that review is careful, methodical, and inclusive.
CAAT has several capabilities to handle this. It provides a high-volume capability to identify nearly duplicate documents based on textual similarity; e.g., they are edits, forwards, etc., of one another. It groups nearly duplicate documents together and identifies what has changed, so reviewers can look at one major central document and then need only look at the differences in documents that are near-dups.
CAAT takes this further with conceptual near-duplicate document identification. Such documents may be edits of a single, original document, but the order of sentences or paragraphs, or the inclusion of new material, makes them literally different. Or there may be different authors' notes on the same issues or discussions. These documents are highly similar on a conceptual level, and CAAT will identify them as such. This makes near-duplicate document review even more efficient by grouping all "very-like" documents into a common review set.
Near-duplicate document detection dramatically increases review speeds and greatly reduces—and possibly eliminates—the error rates of different reviewers coding nearly duplicate documents.

Copyright 2012 Content Analyst Company, LLC All rights reserved.