Hi Class, this week we're going to talk about metadata - which means information about information -- or the process of describing information.
We're also going to talk about annotating the web, but I'll create a second video for that.
In order to say something about what it means to describe information, we should probably say something about information itself -- and how exactly it's defined.
It turns out that defining the term 'information' in a way that makes the term robust and easy to research is difficult, but for a moment let's define information simply. Let's say that information is a proposition that is true. Although this is a simplistic definition, as a concept it captures an important aspect of what we think information is. An example of this definition in action is in this proposition:
That statement has a binary truth value -- it's either true or false, and so it's an example of a propositional statement that is true -- and because it's true, it's informative.
So what is metadata?
When we create metadata, are goal is to create data about data -- or simply describe information. For example, one of my favorite books is Larry McMurtry's Lonesome Dove. When we start to describe this book, we could focus on the book's title, Lonesome Dove, or the author, Larry McMurtry, or the year it was published, how many pages are in the book, the main characters, and so forth. If were were to collect all this information and place it in a record, we'd have metadata.
Source: The Alamo
The author of that book is Toni Morrison, so Toni Morrison is author metadata for that book.
Source: Tone Morrison
Metadata should also contain location or identifier information. In the past, for a book, we might be interested in the book's call number if we were looking for that book at a library. That call number would act as location or identifier metadata. We might be still interested in call numbers today if we're interested in finding a physical book at a library, but often we're also interested in locating the book on the web, and so a URL may also function as identifier or location metadata.
Thus the URL for that book's record is [http://www.worldcat.org/oclc/9342768][http://www.worldcat.org/oclc/9342768].
So, that URL is location metadata about that book's record on the internet.
The interesting thing about this URL is that it is a permanent URL. Web pages, as you know, often change, update, or get deleted. Everyone, I'm sure, has seen the 404 page not returned error while web browsing. But this is a permanent URL by a named authority, worldcat.org, so this URL is forever, theoretically, as long as the internet exists. Therefore it exists as a location string and a identifier string since it will always point to this record for Agatha Christie's book, The Orient Express.
Source: Agatha Christie
There's another way to think about this. Each of those examples can be thought of as a relationship between an element or variable and a value contained by the element or variable. If you know something about computer programming, then this is all it is:
title = "lonesome dove"
where title is the variable's name (the element) and "lonesome dove" is the stored value. In many cases, the variable names are fixed. So there will be a scheme where title is always that variable's name, creator is always that variable's name, date is always that variable's name, and so forth. So you cannot just create new variable names for these elements, and do things like replace the term 'date' with some other term, like time, or 'creator' with some other term, like 'author'.
So where is metadata? Well, it's all over the place. The header information in an email contains information about the email, and so it contains metadata. E.g.:
These metadata play important roles in administering information and communication on the web -- this is because computers need to process the sending and receiving of emails, and metadata helps this happen.
Metadata is automatically created simply by our use of a lot of different technologies. Social media, email, and other ICTs automatically generate a lot of metadata in the course of operation. As soon as we send or compose an email or tweet, upload a photo to Instagram, post text or a photo to Facebook, or any of a number of social media applications, metadata about that use is created and tied to our identity.
Thus, metadata is social. E.g., as well as the tweet's content, each tweet includes this kind of metadata:
We generate tons of metadata in the process of using the web. As many of you know, this is related to privacy and political issues. Think of, as an example, the recent Facebook / Cambridge Analytica stories, as well as the collection of identity information by Google, purchasing data by Amazon, etc. Many other major websites/companies use this kind of metadata as a part of their business plans.
Metadata has utility. In order for many of these to function:
So we need metadata to classify, categorized, and process all the different kinds of informational objects that we have.
And as I already mentioned, there are privacy issues with metadata that I want you to be aware of, as I'm sure many of you are. So:
Read about this interesting by sociologist Kieren Healy, and how he used metadata from real historical records to show that if the British had the technology and methods we have today, they could have identified Paul Revere as a Patriot. It's a fun read, but it's also a kind of warning. It identifies the need to have a discussion about the privacy protections that we want and need in order to function as a democracy, and how much privacy we should give up to have the kinds of tools that we find useful or enjoyable.
Source: Paul Revere
Metadata is created purposefully too. Here is an example of a record for Agatha Christie's Murder on the Orient Express, which I linked to above when referring to identifier/location metadata. This is a screen shot of the catalog record for this book from worldcat.org, which itself is an online catalog for 1000s of library holdings across the US and other nations.
We see that the author metadata is Agatha Christie. Publisher metadata is Bantam Books and the publication date is 1983 and copyright date is 1960. These dates may be somewhat misleading, as the former date indicates the printing of this edition and the latter date indicates when the copyright was renewed and not the original publication date.
There are other metadata fields to be aware of. The Summary field displays a short summary or description of the work. The Subjects field highlights three terms that are used to describe what the books is about: one on the fictional main character, Poirot. One titled Private Investigators, and the other titled Orient Express. Notice that these are all hyperlinked, such that they function as access points to other works that share those subject terms. As such, these subject links do the work of collating all other works that share one of these subject headings. In a sense, they function as a kind of digital file cabinet.
Dublin Core is an element set created to describe web objects, probably one of the most commonly used element sets to do so, but it can also be used to describe physical objects.
Dublin Core uses 15 metadata elements to help us describe information. Two things I need to point out is that these elements are either optional or repeatable. If there is more than one authors for a work, then you repeat that variable. If you're familiar with any kind of computer programming, you'd often assign multiple values to a variable by creating a list, but here we do not assign a list to elements. Also, these elements are generally all optional, such that you may not need a Publisher element for a work or a Relation element, and so forth.
The definition of the contributor element is "an entity responsible for making contributions to the resource," where resource refers to a work of some type, like a book, a website, a painting, a photograph, a music recording, and so forth. A more specific example would be a person who contributed to the creation of a work, like a music or a TV producer contributes to the creation of a song, an album, or a TV show. That person would be said to have contributed to the work, or the resource.
The definition of the coverage element is "the spatial or temporal topic of the resource ...." Let's say there's a song and the song is about Tokyo. This is spatial topic. The TV show Downton Abbey is about the end of the Edwardian era. This is a temporal topic, but the show also has a spatial component -- Britain.
The definition of the creator element is "an entity or person responsible for making the resource." This is an element meant to broadly capture a number of functions, such as author, painter, artist, musician, composer, etc. So think of creator as a kind of a term that subsumes all these different aspects of what it means to be a creator. For example, Julian Fellowes is one of the persons who created the show Downton Abbey, and thus if I were to describe the TV show Downton Abbey, I would assign Fellowes to the creator field, as well as anyone else who was a primary creator of the show.
The definition of the date element is "a point or period of time associated with an event in the life cycle of the resource." The date can include a range, e.g., 2010 through 2013, but also specify a single point, such as September 26, 2010, which is the start of the show Downton Abbey.
The description element is defined as "an account of the resource," which functions as a kind of a summary. E.g., from Wikipedia, the introduction section of the article on Downton Abbey provides a short summary of the show. This summary could be used in the Description element if we were creating a Dublin Core record for the show, although we wouldn't copy it without also attributing the source of the description to Wikipedia.
Here we refer to "the file format, [the] physical medium, or [the] dimensions of the resource." In this case, you should think of a PDF file, a HTML file for web pages, a DOCX file for Microsoft Word documents, JPG and PNG for photo files. In the case of a physical object, like a printed book, think of the actual dimensions of the book: 23cm x 15cm.
I've mentioned the identifier element before. This is defined as "an unambiguous reference to the resource within a given context." Here I want you to think about a kind of ID number. In the case of a book, whether printed or online, think of the book's ISBN number, which you can find on the opposite side of its title page (called the verso, if you're interested). Or think of an ISSN number for a journal or a magazine, or more broadly, a periodical. A DOI number is used to identify a specific scholarly article.
Some of these identifiers (not all are listed) can be converted to URLs. E.g., if you prefix http://dx.doi.org/ to a DOI that is 10.2002/ece3.1480, you get http://dx.doi.org/10.2002/ece3.1480, then you'll be taken to the web page for that scholarly article.
The definition here is pretty straightforward. The language element is defined as "a language of the resource." E.g., since the TV show Downton Abbey is in English, you'd assign the value 'English' to the Language element. You could create additional Language elements if you were creating a record of the show if you were describing a version that was dubbed in another language, such as Spanish.
The publisher element is the "entity responsible for making the resource available." E.g., for a book, the publisher Harcourt & Brace would be the element value for any book that was published by this company. To continue our example, Downton Abbey has a distributor that we might use for the publisher element, and this distributor is PBS Distribution. Some books or works, such as films, could have multiple publishers or distributors, especially if the work has an international market.
The relation element is a difficult element to kind of wrap your head around, not because it's intellectually difficult to understand, but because it could be better defined. But a good example of something you would describe as relational would be supplemental information, such as a textbook that includes a DVD with multimedia, or a booklet that comes with a music CD.
The rights element refers to "information about rights held in and over the resource." This is specifically referring to intellectual property, such as trademarks, patents, and copyrights.
This element could include a date, or a short description, such as the "copyrighted in X year," or you could state here that the work is in the public domain, if it is, or has a Creative Commons license, if it does.
The definition of the source element is "a related resource from which the described resource is derived."
For example, with books you've probably seen new editions that contain new information or perhaps a new preface, then this is a case where you could use the Source element -- to describe the original source the derived one comes from. I have a copy of a work that was released a couple years ago and it contains a new preface, but otherwise, the main text is the same. In this case, I could use the Source element to refer to the original work that was published decades earlier.
I pointed out the subject issue earlier, when we were talking about the Murder on the Orient Express book. There we had three subject headings referring to the main fictional character of the book, the role that character played in the book (private investigator), and the location or setting of the book, the train Orient Express. Those subjects are a way to describe what the object is about but also help collate all resources that have the same subjects. In order to accomplish this, subject terms are often controlled. E.g., instances of this kind of controlled vocabulary include the Library of Congress Subject Headings. Here are two examples. The term Clouds is a subject heading in the Library of Congress Subject Headings list. If you go to InfoKat, the UK Libraries online catalog, and do a subject search for Clouds, you'll pull up all works that share that subject. Cardiovascular abnormalities is a MeSH subject heading for that topic. MeSH, standing for Medical Subject Headings, is a derivation of the LCSH -- it is like the LCSH, just more specific for medical subject matters. Controlled vocabularies are maintained by committees and organizations of experts. They oversee these list of terms, that added to thesauri -- and they constantly revise these terms -- add new ones when relevant, remove old ones when they go out of style or out of use.
We also see a lot of uncontrolled vocabulary. When you tag someone on Facebook, Twitter, Instagram, you are creating a kind of subject term for that Facebook post, that tweet, or that photo. When you click on those tags, you'll pull up all other posts, tweets, photos that share that same tag. Think of a common example, such as the throwback Thursday tag that's used across all three of these sites.
The definition of the title element is "a name given to the resource." This is fairly straightforward. In our previous example, Downton Abbey is the title of the TV show by that name, and so this is what you'd use in the title element.
The type element is defined as "the nature or genre of the resource." This is a little bit comparable to the subject element, but it's also a little different. So for type you can think of: a piece of fiction, or more specifically, science fiction or children's literature. If a film, perhaps it's a documentary. If music, perhaps it's a punk rock song. Also think of things like biographies, cartoons, or festschrifts, which refers to works that are dedicated to a living person, such as a book composed of 15 chapters dedicated to a person and her artwork, her scholarship, or political contributions, or other contribution to society.
Dublin Core can be implemented in lots of web languages, such as XML, XHTML, and so forth. Here's an example of a Dublin Core record that I created on the fly. It refers to Larry McMurtry's book Lonesome Dove and I used XML markup to code the record. So you see things like dc.title to refer to the title of the book. dc.creator to refer to the author. You see that I repeated the dc.identifier element twice. One identifier refers to the ISBN for the book. The other one refers to the permanent URL that the Library of Congress uses for their record of the book. You also see that I repeated the use of the dc.format element -- one is dedicated to the number of pages of the book and the other is dedicated to the length of the printed book, specifically, this instance of the book. And so forth.
Dublin Core can be fairly straightforward. Unlike librarians, who spend several years mastering the techniques involved in cataloging works, Dublin Core was designed to be friendly to general practitioners. This is not to say that it doesn't take some practice, but overall, it is fairly user friendly.