Searching through millions of paper records of guns, modern crime fighting, and large scale societies

Key to identifying the man who shot at Donald Trump was a large set of paper records:

Photo by Element5 Digital on Pexels.com

Bureau of Alcohol, Tobacco, Firearms and Explosives analysts at a facility in West Virginia search through millions of documents by hand every day to try to identify the provenance of guns used in crimes. Typically, the bureau takes around eight days to track a weapon, though for urgent traces that average falls to 24 hours…

In an era of high-tech evidence gathering, including location data and a trove of evidence from cell phones and other electronic devices used by shooting suspects, ATF agents have to search through paper records to find a gun’s history.

In some cases, those records have even been kept on microfiche or were held in shipping containers, sources told CNN, especially for some of the closed business records like in this case.

The outdated records-keeping system stems from congressional laws that prohibit the ATF from creating searchable digital records, in part because gun rights groups for years have fanned fears that the ATF could create a database of firearm owners and that it could eventually lead to confiscation.

But the urgent ATF trace Saturday proved indispensable in identifying the Pennsylvania shooter, giving authorities a key clue toward his identity in less than half an hour.

On one hand, searching through paper records could appear to be inefficient in the third decade of the twenty-first century. In today’s large-scale societies and systems, the ability to quickly search and retrieve digital records is essential in numerous social and economic sectors.

On the other hand, a large set of paper records is a reminder of the relatively recent shift humans have made to adjust to large populations, and in this case, specifically addressing crime. I recently read The Infernal Machine, a story about dynamite, anarchists at the turn of the twentieth century, and developing police efforts to address the threat of political violence. These changes included systems of records to identify suspects, such as having fingerprints or photos on file.

More broadly, the development of databases and filing systems helped people and institutions keep up with the data they wanted to collect and access. To do fairly basic things in our current world, from getting a driver’s license to voting to accessing health care, requires large databases.

Data mining for red flags indicating corruption

Two sociologists have developed a method for finding red flags of corruption in public databases:

Researchers at the University of Cambridge has developed a series of algorithms than mine public procurement data for “red flags” — signs of the abuse of public finances. Scientists interviewed experts on public corruption to identify the kinds of anomalies that might indicate something fishy.

Their research allowed them to hone in on a series of red flags, like an unusually short tender period. If a request for proposal is issued by the government on a Friday and a contract is awarded on Monday — red flag…

Some of the other red flags identified by the researchers includes tender modifications that result in larger contracts, few bidders in a typically competitive industry, and inaccessible or unusually complex tender documents…

“Imagine a mobile app containing local CRI data, and a street that’s in bad need of repair. You can find out when public funds were allocated, who to, how the contract was awarded, how the company ranks for corruption,” explained Fazekas. “Then you can take a photo of the damaged street and add it to the database, tagging contracts and companies.”

This is a good use of data mining as it doesn’t require theoretical explanations after the fact. Why not make use of such public information?

At the same time, simply finding these red flags may not be enough. I could imagine websites that track all of these findings and dog officials and candidates. Yet, are these red flags proof of corruption or just indicative that more digging needs to be done? There could be situations where officials would justify these anomalies. It could still take persistent effort and media attention to push from just noting these anomalies to suggesting a response is required.

The real question after the 2012 presidential election: who gets Obama’s database?

President Obama has plenty to deal with in his second term but plenty of people want an answer to this question: who will be given access to the campaign’s database?

Democrats are now pressing to expand and redeploy the most sophisticated voter list in American political history, beginning with next year’s gubernatorial races in Virginia and New Jersey and extending to campaigns for years to come. The prospect already has some Republicans worried…

The database consists of voting records and political donation histories bolstered by vast amounts of personal but publicly available consumer data, say campaign officials and others familiar with the operation, which was capable of recording hundreds of fields for each voter.

Campaign workers added far more detail through a broad range of voter contacts — in person, on the phone, over e-mail or through visits to the campaign’s Web site. Those who used its Facebook app, for example, had their files updated with lists of their Facebook friends along with scores measuring the intensity of those relationships and whether they lived in swing states. If their last names seemed Hispanic, a key target group for the campaign, the database recorded that, too…

To maintain their advantage, Democrats say they must guard against the propensity of political data to deteriorate in off years, when funding and attention dwindles, while navigating the inevitable intra-party squabbles over who gets access now that the unifying forces of a billion-dollar presidential campaign are gone.

The Obama campaign spent countless hours developing this database and will not let it go lightly. I imagine this could become a more common legacy for winning politicians than getting things done while in office: passing on valuable data about voters and supporters to other candidates. If a winning candidate had good information, others will want to build on the same information. I don’t see much mention of one way to solve this issue: let political candidates or campaigns pay for the information!

What about the flip side: will anyone use or want the information collected by the Romney campaign? Would new candidates prefer to start over or are there important pieces of data that can be salvaged from a losing campaign?

Activist charged for downloading millions of JSTOR articles

Many academics use databases like JSTOR to find articles from academic journals. However, one user violated the terms of service by downloading millions of articles and is now being charged by the federal government:

Swartz, the 25-year-old executive director of Demand Progress, has a history of downloading massive data sets, both to use in research and to release public domain documents from behind paywalls. He surrendered in July 2011, remains free on bond and faces dozens of years in prison and a $1 million fine if convicted.

Like last year’s original grand jury indictment on four felony counts, (.pdf) the superseding indictment (.pdf) unveiled Thursday accuses Swartz of evading MIT’s attempts to kick his laptop off the network while downloading millions of documents from JSTOR, a not-for-profit company that provides searchable, digitized copies of academic journals that are normally inaccessible to the public…

“JSTOR authorizes users to download a limited number of journal articles at a time,” according to the latest indictment. “Before being given access to JSTOR’s digital archive, each user must agree and acknowledge that they cannot download or export content from JSTOR’s computer servers with automated programs such as web robots, spiders, and scrapers. JSTOR also uses computerized measures to prevent users from downloading an unauthorized number of articles using automated techniques.”

MIT authorizes guests to use the service, which was the case with Swartz, who at the time was a fellow at Harvard’s Safra Center for Ethics.

It sounds like there is some disconnect here: services like JSTOR want to maintain some control over the academic content they provide even as they exist to help researchers find printed scholarly articles. Services like JSTOR can make big money by collating journal articles and requiring libraries to pay for access. Thus, someone like Swartz could download a lot of the articles and then avoid paying for or using JSTOR down the road (though academic users are primarily paying through institutions who pass the costs along to users). But what is “a limited number of journal articles at a time”? Using an automated program is clearly out according to the terms of service but what if a team of undergraduates banded together, downloaded a similar number of articles, and pooled their downloads?

If we are indeed headed toward a world of “big data,” which presumably would include the thousands of scholarly articles published each year, we are likely in for some interesting battles in a number of areas over who gets to control, download, and access this data.

Another thought: does going to open access academic journals eliminate this issue?

If you want to model the world, look into these online databases

MIT’s Technology Review lists 70 online databases that one could look into in order to model our complex world.

Having used several of the social sciences databases listed here, I am impressed with several features of such databases:

1. The variety of data one can quickly find. (There is a lot of data being collected in the world today.)

2. The openness of this data to users rather than being restricted just to the people who collected the data.

3. The growing ability to do a quick analysis within the database websites.

To me, this is one of the primary functions of the Internet: making good data (information) on all sorts of subjects available to a wide cross-section of users.

Now, with all of this data out there and available, can we do complex modeling of all of social life or natural life or Earthly life? Helbing’s Earth Simulator, mentioned in this story, sounds interesting…