date: 2013-06-24 07:56
Outline of this post:
Note: Code beginning with a dollar sign ($) signifies that the command was carried out in a Linux terminal / shell (BASH). Otherwise, code was carried out in the text editor Vim 1).
Purpose: The purpose of this post is to begin outlining how to examine higher education salaries – especially salaries paid to administration, faculty, and adjuncts. The eventual goal is to compare these salaries against each other (in a later analysis related to scholarly communication issues). Since this post outlines how to do this with salaries at the University of Missouri and is based on the University of Missouri personnel information provided by the Secretary of State's office, the code and process below cannot necessarily be generalized to other universities. Each state likely provides its public information in different formats, although I haven't examined this. This post primarily serves as a note to self. Most of what is done in Vim below may eventually be automated with a scripting language, such as sed and/or awk. But this is an exploration and Vim provides a nice interface for this.
Start by downloading the public PDF document with University of Missouri personnel information for the 2011 - 2012 year, the latest available data at the time of this writing.
$ wget http://www.sos.mo.gov/BlueBook/2011-2012/10_Personnel.pdf#personnel
This document contains name, position, and salary information. Although it is organized as Last Name, First Name, Position, and Salary, the fields are not so neat. Some persons have more than one position or title and some individual names are more than two words long. So this file will require some cleaning before it can be analyzed. Also, I will eventually remove the first and last names from the file as they are not needed.
First, convert the downloaded PDF to a text file. In this file, the University of Missouri information begins on page 87 and ends on page 151, and the Linux command
pdftotext can cut that specific part out:
pdftotext -f 87 -l 151 10_Personnel.pdf personnel.txt
The MU information begins somewhere around the middle of the left column on the first page and ends partway through the last page. I delete the personnel information from the other institutions that are carried over on these two pages so just MU information is contained in this file. Also, I delete the section titles that mark the beginning of the MU information section and are now at the top of the document (i.e., the text is: University of Missouri, Columbia 65211).
pdftotext command makes a good attempt at making a 1 to 1 copy of the PDF file (but fortunately, it collapses the multi-column PDF into a single column text file) and so the converted text file has page numbers and page headers in it. These are on their own lines. The page numbers begin with a
^L (control L character).
Excerpt (names removed to avoid focusing on any specific individual in this post):
^L804 OFFICIAL MANUAL Last name, First name, resident phys-4th yr, $52,012 Last name, First name, lecturer, $44,502 Last name, First name, admin asst, $26,977 Last name, First name, prof, assoc teach, $50,043 Last name, First name, clincl imprvmnt spclst, $50,128
In Vim, I remove these control characters and page numbers with the following code. Note that I start the substitution with a caret ^ key, which indicates the beginning of a line, and then press *CONTROL-V CONTROL-L* in order to enter the *^L* character in the substitution string.
Salaries are in the last column, but some of these salaries are carried over to the next line and sit by themselves. They all begin with a dollar sign.
Last name, First name, res phys-1st yr, $46,478
Search for these, by escaping the dollar sign, and back them up on the line they belong to. There may be a way to automate this, but right now, as I develop an intuition about this file and its structure, I do this manually. The Vim search string for locating these lines is:
Excerpt, the above line becomes:
Last name, First name, res phys-1st yr, $46,478
The header lines from the PDF need to be removed. There are two of them and the Vim code to remove them is:
:%s/^UNIVERSITY OF MISSOURI// :%s/^OFFICIAL MANUAL//
Next I remove all empty lines:
Some lines are still carried over and they all seem to begin with a lower case letter. Search for all lines beginning with a lower case letter. This search and replace has to be done manually because I'm not yet sure if there's any regularity to this. Nor do I know how to write the kind of code that could automate backing up a line. In any case, in the Vim code below, the
^\l searches for all lines that begin with a lower case letter.
The file is now mostly cleaned. Since the next step involves deleting some columns, I save the file as a master copy and work with a new copy that I save as a comma separated value file (CSV).
$ cp personnel.txt personnel.csv
There's still some cleaning up to do. The end game is to have a two column file with Positions in the first column and Salaries in the second column, and so I need to discard the first and last name columns. I do this in a roundabout way.
The first step is to remove all commas so that each field is separated only by a space. The Vim command to do this is:
Last name, First name, prof, asst clincl dept, $120,788 Last name, First name, farm wrkr, $21,132 Last name, First name, activity aide, $19,063 Last name, First name, food svc wrkr III, $23,920
Last name First name prof asst clincl dept $120788 Last name First name farm wrkr $21132 Last name First name activity aide $19063 Last name First name food svc wrkr III $23920
And then remove all dollar signs from the salaries:
Last name First name prof asst clincl dept 120788 Last name First name farm wrkr 21132
Then I replace the last empty space, just before the salaries, with a comma:
Last name First name prof asst clincl dept,120788 Last name First name farm wrkr,21132
And to mark the first field containing the last name, I replace the first occurrence of a space (which is just after the last name) with a tab:
The lines become:
Last name First name netwk syst anlyst-arch,88539 Last name First name reactor opr II trainee,41454
Now it's easy to remove the first field with the last name since it's marked by a tab while the other fields are still separated by a space. I remove the tab delimited field containing the last name:
And this becomes:
First name netwk syst anlyst-arch,88539 First name reactor opr II trainee,41454
And I repeat these last two commands in order to remove the field with the first names, which has been moved to the first column. So, repeating the above code, I replace the first occurrence of a space with a tab and then remove the new first column:
:%s/ /\t/ :%s/^[^\t]*\t//
netwk syst anlyst-arch,88539 reactor opr II trainee,41454
Finally, I add quotes around the Position field, which is a text field. This isn't strictly necessary, but it helps to mark it:
Now, all that is left is a comma separated file with Position in the first column and Salary in the second column:
"psych aide",26832 "prgm director",150000
I add a header for this file (Position, Salary) and import into R for analysis.
Note: After I imported the file into RStudio, I found that the Salary column was recognized as a factor data type instead of a numeric data type. Based on previous experience, I knew this meant something was wrong with this column. When I converted it to a numeric data type, R notified me that the coercion resulted in some NAs. I located one NA at line 5900 and opened the CSV file in Vim and found that it was a line that carried over to the next. It seems I had missed one line from the cleaning process. I fixed it and re-imported the file in R, which now recognized it as a numeric data type. And the analysis can begin.
The analysis won't be straightforward until I can determine if there's any regularity to how higher administration titles are named or what all the administration titles are. But it's trivial to get some quick information from the data.
First, I can see how many people are employed by MU:
length(personnel$Position)  18894
18,894 people were employed by MU during the 2011 - 2012 academic year (there will certainly be some measurement error with this – as I don't know how people may be excluded or other such details).
How many position titles were there?
There were 3,126 unique positions.
The average number of and salary for assistant, associate, and full professors:
length(personnel$Salary[personnel$Position == "prof asst"]) summary(personnel$Salary[personnel$Position == "prof asst"]) length(personnel$Salary[personnel$Position == "prof assoc"]) summary(personnel$Salary[personnel$Position == "prof assoc"]) length(personnel$Salary[personnel$Position == "professor"]) summary(personnel$Salary[personnel$Position == "professor"])
The number, median and mean salaries, respectively, for:
The top six positions that earn less than or equal to $50,000 per year:
|Nurse Staff II||260|
|Office Support Staff III||249|
|Fellow Post Doctoral||236|
|Patient Service Rep||226|
The top six positions that earn more than $50,000 and less than $100,000 per year:
|Nurse Staff II||223|
|Prof Asst Clinical||106|
|Resident Phys-4th Year||96|
The top six positions that earn $100,000 or more per year:
|Prof Asst Clincl Dept||180|
Manually exploring and removing titles as I find them – the top six positions that make $100,000 or more and that do not have professor in their title:
head(summary(personnel$Position[personnel$Salary >= 100000 & personnel$Position != "professor" & personnel$Position != "prof assoc" & personnel$Position != "prof asst" & personnel$Position != "prof clinical dept" & personnel$Position != "prof assoc clincl dept" & personnel$Position != "prof asst clincl dept" & personnel$Position != "prof curator teach" & personnel$Position != "prof curators"]))
Since Positions other than those related to faculty have more varied names, they won't be represented in the above summaries, which are ranked by n of Positions. So the main work in an analysis of this type will involve collapsing all Positions related to administration into a handful of categories.
More exploratory analysis will come later. I'll have to figure out a way to identify and categorize administrative positions. This will most likely involve creating a database with these title names.