Using a sociological approach in “e-discovery technologies”

Legal cases can generate a tremendous amount of documents that each side needs to examine. With new searching technology, legal teams can now go through a lot more data for a lot less money. In one example, “Blackstone Discovery of Palo Alto, Calif., helped analyze 1.5 million documents for less than $100,000.” But within this discussion, the writer suggests that these searches can be done in two ways:

E-discovery technologies generally fall into two broad categories that can be described as “linguistic” and “sociological.”

The most basic linguistic approach uses specific search words to find and sort relevant documents. More advanced programs filter documents through a large web of word and phrase definitions. A user who types “dog” will also find documents that mention “man’s best friend” and even the notion of a “walk.”

The sociological approach adds an inferential layer of analysis, mimicking the deductive powers of a human Sherlock Holmes. Engineers and linguists at Cataphora, an information-sifting company based in Silicon Valley, have their software mine documents for the activities and interactions of people — who did what when, and who talks to whom. The software seeks to visualize chains of events. It identifies discussions that might have taken place across e-mail, instant messages and telephone calls…

The Cataphora software can also recognize the sentiment in an e-mail message — whether a person is positive or negative, or what the company calls “loud talking” — unusual emphasis that might give hints that a document is about a stressful situation. The software can also detect subtle changes in the style of an e-mail communication.

A shift in an author’s e-mail style, from breezy to unusually formal, can raise a red flag about illegal activity.

So this second technique gets branded as “sociological” because it is looking for patterns of behavior and interaction. If you wondered how the programmers set up their code in order to this kind of analysis, it sounds like some academics have been working on the problem for almost a decade:

[A computer scientist] bought a copy of the database [of Enron emails] for $10,000 and made it freely available to academic and corporate researchers. Since then, it has become the foundation of a wealth of new science — and its value has endured, since privacy constraints usually keep large collections of e-mail out of reach. “It’s made a massive difference in the research community,” Dr. McCallum said.

The Enron Corpus has led to a better understanding of how language is used and how social networks function, and it has improved efforts to uncover social groups based on e-mail communication.

Any sociologists involved in this project to provide input on what the programs should be looking for in human interactions?

This sort of analysis software could be very handy for sociological research when one has hundreds of documents or sources to look through. Of course, the algorithms might have be changed for specific projects or settings but I wonder if this sort of software might be widely available in a few years. Would this analysis be better than going through one by one through documents in coding software like Atlas.Ti or NVivo?

500 to 1

I contemplated the effects of technological changes on law jobs several weeks ago when I posted a link to news reports about IBM’s Watson winning Jeopardy.  The New York Times has written what essentially amounts to a follow-up article, and it’s eye opening:

Quantifying the employment impact of these new technologies [that help automate the legal discovery process] is difficult. Mike Lynch, the founder of Autonomy, is convinced that “legal is a sector that will likely employ fewer, not more, people in the U.S. in the future.” He estimated that the shift from manual document discovery to e-discovery would lead to a manpower reduction in which one lawyer would suffice for work that once required 500 and that the newest generation of software, which can detect duplicates and find clusters of important documents on a particular topic, could cut the head count by another 50 percent. [emphasis added]

To be sure, 500:1 may just be the talking point of a businessman who is trying to sell his particular solution. Nonetheless, it seems clear that technology like Mr. Lynch’s is already fundamentally altering the economics of the legal profession.  We probably are headed towards a future with fewer lawyers (at least, ones performing discovery-related tasks).

What are some of the broader economic implications?  The NYTimes piece also quotes from  David H. Autor, an economics professor at the Massachusetts Institute of Technology:

“There is no reason to think that technology creates unemployment,” Professor Autor said. “Over the long run we find things for people to do. The harder question is, does changing technology always lead to better jobs? The answer is no.”

Vast worlds of discovery

In case you thought the age of discovery was over, Wired’s Threat Level blog is reporting that a 21-year-old hacker George Hotz who released the PlayStation 3 jailbreak has been ordered to surrender

any and all computer hardware and peripherals containing circumvention devices, technologies, programs, parts thereof, or other unlawful material, including but not limited to code and software, hard disc drives, computer software, inventory of CD-ROMS, computer diskettes, or other material containing circumvention devices, technologies, programs, parts thereof, or other unlawful material.

As Hotz lawyer put it,

The information sought at issue [the jailbreak code] is less than 100 kilobytes of data. Mr. Hotz has terabytes of storage devices….Impounding his computers, it’s like starting a forest fire to cut down a single tree.

Though the court’s order does seem like overkill, it is unfortunately a typically broad discovery request.  Sony may simply be trying to harass Hotz and/or hamper any future work, a theory especially plausible insofar as the court also ordered that Hotz “shall retrieve” the jailbreak he posted.  Given the number of websites that have re-posted Hotz’s original code, this would seem to be impossible.  As Hotz’s lawyer rather cogently quipped, ““Mr. Hotz can’t retrieve the internet.”

Wired has posted the judge’s order here (PDF).