Monday, April 2, 2012

Gap Between Public Expectations and Archival Practice: The 1940 U.S. Census

         On the day of this writing, the U.S. National Archives and Records Administration has found itself facing bad press for the simple issue of misunderstanding the power of their own archival records. The non-indexed 1940 U.S. Census, which lists all information about those having survived the Great Depression and provides a snapshot of life the year before Pearl Harbor, has been released as of Monday, April 2, 2012. However, the system which NARA used crashed due to the huge number of hits and the lack of planning or contingency system in place for the Internet traffic flow. FOX News reported that "Nearly 2 million people flocked to the site in just the first few hours after the Archives posted a searchable database of materials from the 1940 national head count" (http://www.foxnews.com/us/2012/04/02/21-million-still-alive-from-140-census/).

       NARA's website and records typically have not received heavy traffic flow, something I would argue is because of the limited nature of online digital records apart from newspapers and some standard historical items over different time periods. Alexia, a website traffic reporting agency, gives the following report for the past 3 months on NARA's website and the regular amount of traffic flow they are accustomed to: "Archives.gov is ranked #13,659 in the world according to the three-month Alexa traffic rankings. About 45% of visits to it are bounces (one pageview only). The site's visitors view 4.2 unique pages each day on average. The fraction of visits to the site referred by search engines is about 19%. The time spent in a typical visit to Archives.gov is about four minutes, with 38 seconds spent on each pageview." Most of the viewers to NARA's site are females over the age of 65 years old, which for those of us in the archives profession take this to mean genealogists for the most part.

             As NARA releases the 1940s census in the most wired period of world history, they have simply overlooked a major tenant taught in any basic digital preservation or digital curation graduate archives course: plan for all possible digital access needs, database system requirements, and anticipate the volume of users who could attempt to access the data.  NARA is not use to the types of traffic flows that entertainment companies such as NBC, Hulu, Amazon, Ebay, or Apple are, and such unfamiliarity is hurting their reputation with the public. What is ironic is that the National Archives in one of the standard bearers for institutions throughout the U.S. on how to model their own digital repositories, and how to manage data storage. As NARA calls in a third-party vendor to fix problems with records that until yesterday were maintained as "confidential due to legal privacy restrictions," one has to wonder about the ability of archival institutions to keep pace in the digital world. Most archives debate metadata schemas while failing to plan for proper storage environments or use pre-formatted archival management systems with poor search engines to manage their online content. In a world where 12-year olds can write apps that are used by millions of people, the gaffe of NARA is in showing itself out of step with the digital world.

        In a poorly-structured step in the access module, NARA selected the 5-year old firm Inflection that runs the Archives.com website, along with Familysearch.org, pay sites for genealogical research, to host the 1940s Census, because NARA does not have the digital repository space or server space to manage the collections. A lot of this has to do with the failure of NARA's 10-year program to develop a national digital repository (see earlier post on national digital repositories on this blog). Interestingly enough, Inflection also runs a search company where individuals can pay to locate current records for individuals. Inflection's Archives.com executive VP made the following stunning comment: "John Spottiswood, executive vice president of site host Archives.com, touted worldwide availability of the massive database to millions of family researchers: 'We just hope not all at the same time.' He may not have gotten his wish" (http://h-net.msu.edu/cgi-bin/logbrowse.pl?trx=vx&list=H-DC&month=1204&week=a&msg=KsW7eK%2By6%2BeEXO1bri3DcA&user=&pw=). For one of the biggest records unveils in over a half century, the company NARA is using hopes that not everyone accesses the records at once!

     Of a deeper concern for young archival professionals who have been use to the proliferation of "cloud computing" talk in digital preservation and for digital repositories, one of the greatest concerns with cloud storage systems is that stuff is literally just in a cloud. The technology is too new for the large and vital data that many institutions wish to store in such systems. Many digital curators have raised concerns over cloud storage, due to the incompatibility of cloud systems with one another from other agencies. NARA relied on Archives.com's services for this census records launch, but Archives.com uses the cloud computing system of Amazon.com, a for-profit public merchandise vendor. Amazon was one of the first public cloud computing services in the world, but in their first few launches, Amazon's music cloud failed. Now, "Spottiswood said it might take time for the Amazon cloud system the site is using to accommodate all users. About three hours after the launch, the Archives blog advised: 'We are working with Amazon to get the site up to speed'" (http://h-net.msu.edu/cgi-bin/logbrowse.pl?trx=vx&list=H-DC&month=1204&week=a&msg=KsW7eK%2By6%2BeEXO1bri3DcA&user=&pw=).

    Somehow as a young archivist, I fail to trust Amazon.com to backup the census records from one of the country's most important and anticipated-released census. Why, with all the resources invested by NARA and encouraged by government institutions such as IMLS with grants focused on digitization projects, is the U.S. government so frail when it comes to digital storage of historic records? Currently, the U.S. spends more money and server space on monitoring Americans for terrorist activities than it seems to be on its heritage. One of the founding principles of a National Archives is to make the records easily and readily accessible to the public. I'm shocked that such a high digital curation standard bearer as NARA is not practicing what it preaches. This situation does not bode well for future digital repository systems, nor does it offer any confidence that a majority of the country's digital records will be accessible as we move deeper into the second decade of the 21st century.  

1 comment:

  1. There is a difference between doing a poor job on storage/backup of digital records and having your web site crash because it got too much traffic. Yeah, it sucks that the site could not support the amount of visitors on Day 1 of the 1940 census being available. But, on the other hand, I have got to applaud them for wanting to make that available, freely and digitally, on the very first day that they were legally able to do so.

    People want everything, and they want it online, and they want it free, and they want it now. You can't always have everything the way you want it. I have heard people in project management say, "Cost, schedule, or quality. Pick the 2 that are most important to you. I'll do what I can about the third."

    I must agree that they should have known that Day 1 access to the 1940 census (for free) would draw a crowd. They either grossly underestimated the size of the crowd or they did what they could (bandwidth-wise) and hoped for the best (which was apparently not good enough in this case).

    ReplyDelete