Building the POYi Archive

C. Sean Burns

November 5, 2009

Back to ~/csb

The report below was first written on February 3, 2009 and then modified on November 5, 2009, when I was a Master's student at the University of Missouri and learning to code. The report provides an account of the process I used to build the original POYi Archive using Omeka and various custom scripts.

Building the POYi Archive

Building the original POYi Archive using Omeka.

Part I: Manipulating and Uploading Data to the POYi Archive Omeka Database

Pictures of the Year International (POYi) has been holding annual photojournalism documentary contests since 1943 and in the process has collected nearly 40,000 photographs. Although they are not copyright holders to these photographs, as part of the condition of the contest they are allowed to, under certain conditions, republish these photographs. Throughout the last decade or so, they have been digitizing all the older photographs and were using some very limited, proprietary software to keep track of their collection. Their hope was to create a public archive and share these very important images with the world.

POYi Directory Rick Shaw contacted Dr. Thomas Kochtanek during the fall of 2007. Dr. Thomas Kochtanek is an expert in information storage and retrieval systems and digital libraries and teaches at the School of Information and Library Science at the University of Missouri, Columbia. Over the last decade Dr. Kochtanek has lead a team responsible for the successful Truman Library archive, one of the first and largest digital libraries on the Web. Because of Dr. Kochtanek's experience with digital libraries, Rick Shaw, seeking to build a digital library for the POYi archive, was put into contact with Dr. Kochtanek.

I began work on POYi's archive in January 2008. The Truman digital library is a custom built site and preexists many present content management systems (CMS). Because it is a custom built site, it is costly to maintain. POYi had a limited amount of funding for their project and so we focused our search on an open source content management system. Soon after we began our search, I discovered a digital library CMS called Omeka.

Omeka is an open source CMS sponsored by the Center for History and New Media and George Mason University. It is designed for collections relating to the humanities and cultural artifacts. More information as well as use case scenarios can be found on Omeka's About page.

Omeka uses a combination of PHP5 and MySQL5, with a little bit of JavaScript, technologies to build their software. The software is designed to run on Linux using the popular Apache web server. Omeka also provides industry standard Dublin Core metadata elements to describe the items in user collections. The use of these standard and open technologies made Omeka an extremely attractive solution for the POYi archive.

I began building the POYi archive during the spring of 2008. POYi uses a Red Hat server for their Web sites and we were given a directory on this server to build the Archive. Unfortunately, the current release of Red Hat offers PHP 5.1.6 and Omeka requires PHP 5.2.x or greater. I thought this would be an issue, and it was an issue for the initial install, but after turning to the Omeka forums one of the Omeka web developers suggested altering the verify.php file in the Omeka install/ directory. This file checks system requirements during the install and will inform the user their system needs to be updated to run Omeka if the requirements do not match those on their system. Because PHP 5.1.6 and 5.2.x are close, I replaced the following line (install/verify.php, line 10), hoping this would work. The following code:


    if(version_compare("5.2.x", phpversion(), ">=")) {
    

Became this:


    if(version_compare("5.1.x", phpversion(), ">=")) {
    

Besides this small glitch, installing Omeka took approximately five to ten minutes. The first step involved creating a database for Omeka, installing the Omeka source files in the Web directory, proceeding with the Web install available via the Web browser, and then uploading content. I should note that extensive text and multimedia documentation for the installation and customization of Omeka already exists on their site.

I should also note that the version of Omeka we used is 0.92. Omeka has since rebuilt their software and the version numbering was reset to 0.10. As of today's date (November 5, 2009), the version is at 1.1. The rebuild has resulted in an almost entirely different application. While this will have an affect on our ability to upgrade to a newer version of Omeka when the Red Hat operating system catches up and is able to offer a version of PHP beyond 5.2, as of now, I have not investigated how this upgrade process will happen, although Omeka has provided some documentation on their web site that details the process.

Omeka is a modular application built using the Zend Framework. Plugins and themes are built separately from the core Omeka software. One of the most attractive and useful plugins available was the Drop Box plugin. The Drop Box plugin allowed us to upload thousands of photographs in a single sitting. Unfortunately, the Drop Box plugin did not provide a way to synchronously apply all the metadata to each photograph. This was a problem. All the photos from 1943 to 2006 were now uploaded to the site within collections by year, but none of the photos were described. Faced with the unbearable and timely process of entering each photograph's metadata by hand, I quickly looked for an alternative and faster solution.

Fortunately, since Omeka is built using open source technologies, such as MySQL, the solution soon became obvious. POYi had provided all of their metadata to us in spreadsheet files. These files were exported from the proprietary software they were using to organize their collection. Our solution involved a simple and basic cleanup of the spreadsheet files, normalizing the data, checking for errant punctuation that would interfere with exporting them as CSV (Comma Separated Value) files, turning those CSV files into MySQL update files, and then updating the MySQL Omeka supplied database.

The Omeka version we are using for the site has a table in the database called the omeka_items table (the reset of 0.10 to today's 1.1 version of Omeka uses a different database structure and the following procedure would need to be altered to work on the new version, although beyond that, there is nothing to prevent the same essential process, as detailed below, from being used if the database is rebuilt with a new version of Omeka). The omeka_items table contains all the Dublin Core fields, as well as some Omeka specific fields. These fields are described as:


    id
    title
    publisher
    language
    relation
    spatial_coverage
    rights
    description
    source
    subject
    creator
    additional_creator
    format
    date
    type_id
    collection_id
    contributor
    rights_holder
    provenance
    citation
    temporal_coverage_start
    temporal_coverage_end
    featured
    public

When we first started uploading the data to the database, we described more fields than we did for the later years. We have since disabled viewing those fields on the current Archive site because we do not feel they correctly match the Dublin Core metadata elements. In other words, the final elements we decided to describe were the basic ones: title, publisher, relation, rights, description, creator, and date. We were more confident in the quality of data we had for those fields than we were for the other fields. At a later date we would like to upload some of the additional information provided to us by POYi. An example of such includes the awards won for the photographs. The types of awards have considerably varied over the years, especially as new medias like the Web became popular. Trying to figure how to normalize the fields for those various awards quickly became rather complicated, ind time has been of the essence for this project.

The spreadsheets of metadata and photos supplied to us by POYi were organized by year and we built our collections and worked on the data based on this yearly division. Following is the process we used to make the conversions and upload the data. Development of this project was performed on the Ubuntu GNU/Linux distribution and the scripts used involve the Bash scripting language as well as GNU Awk and Sed. These programming / scripting languages allowed us to manipulate the text after we converted the spreadsheets to CSV files.

Our goal was to match the columns to the MySQL Dublin Core fields we were going to update. When we uploaded the images using the Drop Box plugin, each photograph title was added to the omeka_items table and supplied a primary key--an ID number. Because each photograph contained a unique ID number, the process basically involved updating the table matching the descriptive fields to the appropriate photograph's ID.

In addition to this, a vast majority of the photos provided to us by POYi were named sequentially, beginning with 1.jpg. This was very fortunate because it allowed us to easily match the photograph name to the photograph's ID in the omeka_item table. Unfortunately, the photographs were not consistently sequentially named. Some photographs, especially in the later years, were part of a group of related photographs and were thus named, e.g., 1200a.jpg, 1200b.jpg, 1200c.jpg. Since it involved nearly 40,000 photographs, matching the photograph to the correct ID was initially the most time consuming part of this process, but we were later able to refine our process to make this aspect of the project proceed even more quickly. The scripts provided below will describe the absolute final process rather than the various intermediary steps we took as we learned how to streamline our work. (Note: those grouped or story set of photographs are, unfortunately, not necessarily grouped in the POYi Archive. Although a search in the Archive on a particular photographer will return all the photographs of a story, if that photographer has one, it is not obvious that they might be related in that way. Any future upgrade of the data should remedy this, conceptually that is, from an archival perspective.)

After the photographs were uploaded to the site, we needed a list of the files and their corresponding IDs. To retrieve this list, we logged into MySQL and performed the following command:


    SELECT id,tile FROM omeka_items INTO outfile '/tmp/fileList.txt';
    

This provided us with a two column list that looked like so:


    1   1
    2   2
    3   3
    4   4
    

And so on up to around 40,000 lines, give or take several hundred. The first column listed the database unique ID for the photograph and the second column listed the photograph's title, as supplied to us by the POYi folks. This was the master list we used to match photograph files to their corresponding IDs in the MySQL database.

Next we opened up one of the spreadsheets supplied to us. We used the OpenOffice.org Calc spreadsheet program to perform the following steps. The spreadsheets contained a lot of information we were not going to use, for reasons outlined above, so we spent some time cleaning up the data, removing information we were not going to use, and organizing it into the columns we were going to use. Then we used the file and ID list we exported from the MySQL database above to re-sort the spreadsheet according to the file column in that list. This matched up the file list in the POYi spreadsheet with the file list from the exported file to the proper IDs. Being able to do this was a big time saver.

Next we had to make sure all the fields in the spreadsheets were formatted as text. We selected the entire sheet, right clicked, and formatted all the cells as text. Then we removed all double quotes and replaced them with single quotes. Since MySQL uses double quotes to delimit the data in the upload file, we had to remove all double quotes in the spreadsheet fields and replace them with single quotes. Sometimes double quotes appeared in the description of a photograph. For example, a photograph's description (or caption) may quote someone about something. Next we removed all colons (:) and replaced them with commas (,). Rather than exporting the spreadsheet as a Comma Separated Value file, we were going to export it as a Colon Separated Value file. Also, many of the captions (often short narratives) had commas in them and so exporting the spreadsheet as a standard Comma Separated Value file became meaningless, as the many commas in the captions would mean extra fields. We saved the file as a CVS file using colons as Field Delimiters and double quotes as Text Delimiters.

We had previously prepared a text file listing the fields we were using and the basic MySQL command set up. We named this file simply fields. The fields file looked like this:


    set title   =
    publisher   =
    relation    =
    rights      =
    description =
    creator     =
    date        =
    where id    =
    

That set of fields matches exactly to the fields in our CVS file. Next we had to repeat that set to match the number of lines in the CSV file. To find out how many lines were in the CSV file, in BASH, the following command was used:


    wc -l omekaItems1.csv
    

Therefore, if there were a 1000 lines in the CSV file, we had to repeat the set in the fields file an additional 999 times. To do this we used the Vim text editor to open the fields file. Beginning with the first line set title, we copied the first set, a total of 8 lines, by issuing this command in Vim:


    y8
    

In this example we need to repeat this set 999 more times, in Vim we do this:


    999p
    

Now in our fields file we have a 1000 sets (8000 lines total) of the above set of fields.

The CVS file we exported needed to be broken down into separate lines for each field. When we exported the file from the spreadsheet, each line contained all the information in the spreadsheet row: title, publisher, etc., and we needed to put each of those descriptions on their own line in order to match the lines in the field file. Here's what a CSV file would have looked like before processing (using the column names to list this, but the file contained the actual data):


    "title":"publisher":"relation":"rights":"description":"creator":"date":"id"
    

We needed to put those fields on separate lines and to do so we used a very simple GNU Awk script. Here is the script:


    #!/usr/bin/gawk

    BEGIN { FS = ":" }

    {print $1}
    {print $2}
    {print $3}
    {print $4}
    {print $5}
    {print $6}
    {print $7}
    {print $8}
    

This file was named parse.awk and made into an executable file. To parse the CSV file and output the results into a new file called omeka.fields, we would issue the command in the BASH shell like so:


    awk -f parse.awk omekaItems1.csv > omeka.fields

Next we had to examine the omeka.fields file for any errors. Namely, sometimes the file would contain blank lines. In order to remove blank lines, we open up the file in the Vim text editor and issue the following command:


    :%g/^g/d
    

Also, to help insure that we parsed the CSV file correctly, we would make sure each file had the same amount of lines. In the BASH shell prompt:


    wc -l fields
    wc -l omeka.fields
    

At this point we had to join the fields and the omeka.fields files together. Remember, the fields file looks like this:


    set title   =
    publisher   =
    relation    =
    rights      =
    description =
    creator     =
    date        =
    where id    =
    

And the new omeka.fields file looks like this, except supply the actual data for the name of the data below:


    title
    publisher
    relation
    rights
    description
    creator
    date
    id

And what we needed to have to create the MySQL update script is this:


    set title   = title
    publisher   = publisher
    relation    = relation
    rights      = rights
    description = description
    creator     = creator
    date        = date
    where id    = id

There is a GNU/Linux utility called paste that allowed us to join these two files together and output the result into a new file. The utility is issued from the command line and works like this:


    paste fields omeka.fields > omekaItems

In other words, the syntax is this:


    paste file1 file2 > file3

At this point, we were almost done. But MySQL requires a few more additions to this file before it will work. First, the MySQL update command requires the following syntax:


    UPDATE database.table
    set field  = "data",
    field      = "data",
    field      = "data"
    where id   = "ID";

In order to update our omekaItems file, we had to add a blank line above each instance where the line set title appears and in that blank line we had to put the string UPDATE database.table. In order to do this we used the SED scripting language. The following commands, issued from the Bash shell prompt (or a script), did the trick:


    sed '/set title/{x;p;x;}' omekaItems > file1
    sed 's/^$/update database\.omeka_items/g' file1 > file2

The first line above adds a blank line above each instance of set title and outputs the results into a file named file1. The second line above replaces that blank line with the update information and outputs the results into a file named file2. In the second line database is the name of the database and omeka_items is the name of the table we are updating in that database (the actual database's name has been withheld for basic security reasons).

So now we have a file named file2 that looks like below. For clarity's sake, the following is an actual field set from the 57th Annual POYi Competition:


    UPDATE database.omeka_items
    set title = "Tornado"
    publisher = "Associate Press"
    relation = "57th Annual POYi Competition"
    rights = "All Rights Reserved"
    description = "A woman and her two children huddled together under an overpass as a tornado nears outside Newcastle, Okla., May 3rd, 1999."
    creator = "J. Pat Carter"
    date = "1999"
    where id = "30343"

This photograph can be viewed here: http://archive.poyi.org/items/show/30343.

In order for this MySQL script to work, MySQL UPDATE syntax requires that the last thing we have to do is add commas after some of the lines above (except for the second to the last line) and a semi-colon after the final line. A Sed script helps solve this for us. Our script, named sed.comma, looks like this:


    #!/bin/bash

    sed -e "/title =/s/$/,/g" -e\
    "/publisher =/s/$/,/g" -e\
    "/relation =/s/$/,/g" -e\
    "/rights =/s/$/,/g" -e\
    "/description =/s/$/,/g" -e\
    "/creator =/s/$/,/g" -e\
    "/where id =/s/$/;/g" $1

The $1 at the end allows us to use arguments for this script. The file we run this script on is called file2 and in a BASH shell prompt, after making the file an executable, we issue the following command:


    ./sed.comma file2 > omekaFinal1.sq
    

This outputs the results into a file named omekaFinal1.sq and the set above becomes this:


    UPDATE database.omeka_items
    set title = "Tornado",
    publisher = "Associate Press",
    relation = "57th Annual POYi Competition",
    rights = "All Rights Reserved",
    description = "A woman and her two children huddled together under an overpass as a tornado nears outside Newcastle, Okla., May 3rd, 1999.",
    creator = "J. Pat Carter",
    date = "1999"
    where id = "30343";

One final thing before we updated our database was left. Sometimes the data we received from POYi was incomplete. Some photographs, for example, did not have a description. If we try to update the database with empty fields in our file, MySQL will throw an error. Therefore the last process involves opening the omekaFinal1.sq file in our Vim text editor and searching for empty fields and replacing those fields with the "N/A" string. Example Vim code looks like this:


    :%s/description =\t,/description = \"N\/A\"/g
    

After that was completed, the next step involved uploading the files into the database. These files were often a quite a few megabytes in size (total lines of code for all the years described equal to about 360,000), so we compressed them with bzip2 like so:


    bzip2 omekaItems1.sq
    

And then we transferred them to the Web server:


    scp omekaItems1.sq user@poyi.org:~/
    

After the file was transferred, we would log into the POYi server using a Secure Shell:


    ssh user@poyi.org
    

And decompressed the file with bunzip2 and updated the database with the following command:


    mysql -u databaseUser -p -D databaseName < omekaItems1.sq
    

Then we checked the web site to make sure all was well.

Part II: Redesigning the Theme & Miscellany

Omeka and various outside Omeka contributors have provided several themes for public use. These themes alter the design and layout of the public interface. One of the themes offered is called From Scratch. This theme provides a basic template for customization and all the CSS elements, corresponding to the HTML tags in a default Omeka install, are provided in a file called "screen.css". Most of the customization and presentation can be altered by adding CSS descriptions to this file.

POYi director Rick Shaw wanted the Archive site to match as closely to their regular site as possible. This primarily involved modifying the color scheme and basic layout. Most of this was accomplished by modifying the CSS.

Other modifications required some PHP coding or altering. For example, users may browse items by clicking on a range of pagination links at the top of each photograph's page. The default range is five. Therefore users visiting the site may start on page one and at the most may skip ahead to page five or if they are on page seven they may skip ahead to page twelve. Since the POYi Archive site contains so many photographs, we felt it would be better to provide a longer range. We have since extended that range from five to twenty.