Recovery.gov is the U.S. government’s official website that provides easy access to data related to Recovery Act spending and allows for the reporting of potential fraud, waste, and abuse. My AOL colleague, Richard Walker wrote recently about how Recovery.gov “Shows The Power Of Transparency In Tracking Federal Spending” since the Recovery Accountability and Transparency Board [RAT Board] has provided “a commendable model of transparency… the tremendous success of the RAT Board is worthy of replication throughout the federal bureaucracy.”
He also mentions how the proposed Digital Accountability and Transparency Act of 2011 (DATA Act) would establish consistent data elements and standards for federal financial information to assure comparability and reliability in reported information and how recipient reporting through federalreporting.gov is the most cutting-edge feature of the transparency process and should be an integral part of federal spending accountability.
The Recovery.gov White Paper claims “the result is the current incarnation of Recovery.gov-which, as anyone who has spent significant amounts of time scouring government websites for information will tell you-is perhaps the clearest, richest interactive database ever produced by the American bureaucracy.”
I say good, but first show us all of the data so we can do some real data analysis.
Since the quality and completeness of this data has been questioned, I decided to see for myself by assessing the amount of missing data in each of the 98 columns and by agency.
The scatterplot created in a data analytic software called Spotfire shows in the illustration above, there are considerable number of columns that have far less data in each row than one would expect. (The actual chart extends the plot vertically to more than 500,000 reports, which shows that all reports have at least some data.)
The X axis represents the 98 data elements columns that agencies are suppose to report, such as funding agency name, fiscal year, recipient name, recipient zip code, etc. The Y axis represents the individual reports, assembled by row, of each of 517,589 filings.
Complete data reporting would show a solid line across the top of the graph while no reporting would show a solid line across the bottom of the graph. There is more of the later than the former so show us all of the missing data. The complete data dictionary is given here.
Fully complete data would be a straight line across the top of this graph at 517,589 rows of data. The actual counts are shown in the Data Elements and Number of Rows table in my Spotfire summary. The filters can be used to select an agency like EPA and see the “count of rows by column” in the summary table. One can then reset the filters and select another agency.
In trying this, one sees that only eight of the 98 columns of data elements have been completed for all 517,589 filings (or rows of data.) For example, my former agency, EPA, reports filed 956 reports (rows of data), but filled in only 47 of the 98 data elements (columns.)
This applications succeeds by bringing all of the data into memory on the Web, something the Recovery.gov Web applications does not do, so it can be visualized, sorted, and searched. The data file is about 300 MB and the Spotfire file is about 110 MB.
Brand Niemann is former senior enterprise architect, U.S. Environmental Protection Agency; director and senior data scientist, Semantic Community. He previously Built Recovery.gov in the Cloud where the user can enter their ZIP Code and get results specific for their location.