Data mining for red flags indicating corruption

Two sociologists have developed a method for finding red flags of corruption in public databases:

Researchers at the University of Cambridge has developed a series of algorithms than mine public procurement data for “red flags” — signs of the abuse of public finances. Scientists interviewed experts on public corruption to identify the kinds of anomalies that might indicate something fishy.

Their research allowed them to hone in on a series of red flags, like an unusually short tender period. If a request for proposal is issued by the government on a Friday and a contract is awarded on Monday — red flag…

Some of the other red flags identified by the researchers includes tender modifications that result in larger contracts, few bidders in a typically competitive industry, and inaccessible or unusually complex tender documents…

“Imagine a mobile app containing local CRI data, and a street that’s in bad need of repair. You can find out when public funds were allocated, who to, how the contract was awarded, how the company ranks for corruption,” explained Fazekas. “Then you can take a photo of the damaged street and add it to the database, tagging contracts and companies.”

This is a good use of data mining as it doesn’t require theoretical explanations after the fact. Why not make use of such public information?

At the same time, simply finding these red flags may not be enough. I could imagine websites that track all of these findings and dog officials and candidates. Yet, are these red flags proof of corruption or just indicative that more digging needs to be done? There could be situations where officials would justify these anomalies. It could still take persistent effort and media attention to push from just noting these anomalies to suggesting a response is required.

The real question after the 2012 presidential election: who gets Obama’s database?

President Obama has plenty to deal with in his second term but plenty of people want an answer to this question: who will be given access to the campaign’s database?

Democrats are now pressing to expand and redeploy the most sophisticated voter list in American political history, beginning with next year’s gubernatorial races in Virginia and New Jersey and extending to campaigns for years to come. The prospect already has some Republicans worried…

The database consists of voting records and political donation histories bolstered by vast amounts of personal but publicly available consumer data, say campaign officials and others familiar with the operation, which was capable of recording hundreds of fields for each voter.

Campaign workers added far more detail through a broad range of voter contacts — in person, on the phone, over e-mail or through visits to the campaign’s Web site. Those who used its Facebook app, for example, had their files updated with lists of their Facebook friends along with scores measuring the intensity of those relationships and whether they lived in swing states. If their last names seemed Hispanic, a key target group for the campaign, the database recorded that, too…

To maintain their advantage, Democrats say they must guard against the propensity of political data to deteriorate in off years, when funding and attention dwindles, while navigating the inevitable intra-party squabbles over who gets access now that the unifying forces of a billion-dollar presidential campaign are gone.

The Obama campaign spent countless hours developing this database and will not let it go lightly. I imagine this could become a more common legacy for winning politicians than getting things done while in office: passing on valuable data about voters and supporters to other candidates. If a winning candidate had good information, others will want to build on the same information. I don’t see much mention of one way to solve this issue: let political candidates or campaigns pay for the information!

What about the flip side: will anyone use or want the information collected by the Romney campaign? Would new candidates prefer to start over or are there important pieces of data that can be salvaged from a losing campaign?

Activist charged for downloading millions of JSTOR articles

Many academics use databases like JSTOR to find articles from academic journals. However, one user violated the terms of service by downloading millions of articles and is now being charged by the federal government:

Swartz, the 25-year-old executive director of Demand Progress, has a history of downloading massive data sets, both to use in research and to release public domain documents from behind paywalls. He surrendered in July 2011, remains free on bond and faces dozens of years in prison and a $1 million fine if convicted.

Like last year’s original grand jury indictment on four felony counts, (.pdf) the superseding indictment (.pdf) unveiled Thursday accuses Swartz of evading MIT’s attempts to kick his laptop off the network while downloading millions of documents from JSTOR, a not-for-profit company that provides searchable, digitized copies of academic journals that are normally inaccessible to the public…

“JSTOR authorizes users to download a limited number of journal articles at a time,” according to the latest indictment. “Before being given access to JSTOR’s digital archive, each user must agree and acknowledge that they cannot download or export content from JSTOR’s computer servers with automated programs such as web robots, spiders, and scrapers. JSTOR also uses computerized measures to prevent users from downloading an unauthorized number of articles using automated techniques.”

MIT authorizes guests to use the service, which was the case with Swartz, who at the time was a fellow at Harvard’s Safra Center for Ethics.

It sounds like there is some disconnect here: services like JSTOR want to maintain some control over the academic content they provide even as they exist to help researchers find printed scholarly articles. Services like JSTOR can make big money by collating journal articles and requiring libraries to pay for access. Thus, someone like Swartz could download a lot of the articles and then avoid paying for or using JSTOR down the road (though academic users are primarily paying through institutions who pass the costs along to users). But what is “a limited number of journal articles at a time”? Using an automated program is clearly out according to the terms of service but what if a team of undergraduates banded together, downloaded a similar number of articles, and pooled their downloads?

If we are indeed headed toward a world of “big data,” which presumably would include the thousands of scholarly articles published each year, we are likely in for some interesting battles in a number of areas over who gets to control, download, and access this data.

Another thought: does going to open access academic journals eliminate this issue?

If you want to model the world, look into these online databases

MIT’s Technology Review lists 70 online databases that one could look into in order to model our complex world.

Having used several of the social sciences databases listed here, I am impressed with several features of such databases:

1. The variety of data one can quickly find. (There is a lot of data being collected in the world today.)

2. The openness of this data to users rather than being restricted just to the people who collected the data.

3. The growing ability to do a quick analysis within the database websites.

To me, this is one of the primary functions of the Internet: making good data (information) on all sorts of subjects available to a wide cross-section of users.

Now, with all of this data out there and available, can we do complex modeling of all of social life or natural life or Earthly life? Helbing’s Earth Simulator, mentioned in this story, sounds interesting…