Processing Data: GNU Coreutils (Part 1)
Text Processing
We've touched on a few of these commands already, such as:
touch
cat
echo
pwd
mkdir
rmdir
head
wc
We have commands also for getting data on the users:
who
w
Or the local time:
date
Today I want to cover some file related commands for processing data in a file; specifically:
sort
for sorting lines of text filesuniq
for reporting or omitting repeats linescut
for removing from each line of fileshead
for outputting the first part of filestail
for outputting the last part of files
Let's look at a toy, sample file that contains structured data as a CSV (comma separated value) file:
cat operating-systems.csv
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Propietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008
To get data from file:
# get the second field, where the fields are separated by a comma ","
cut -d"," -f2 operating-system.csv
# get the third field
cut -d"," -f3 operating-system.csv
# sort it, unique it, and save it in a separate file
cut -d"," -f3 operating-system.csv | sort | uniq > os-years.csv
If that CSV file has a header line, then we may want to remove it from the output. First, let's look at the file:
cat operating-systems.csv
OS, License, Year
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Propietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008
Say we want the license field data but we want to remove that first line, then we need the tail command:
tail -n +2 operating-system-csv | cut -d, -f2 | sort | uniq > license-data.csv
Conclusion
In this lesson, we learned how to process and make sense of data held in a text file. We drew upon some commands we learned in prior lessons that help us navigate the command line and create files and directories. We also added commands that let us sort and view data in different ways. The commands we used in this lesson include:
sort
: for sorting lines of text filesuniq
: for reporting or omitting repeats linescut
: for removing from each line of fileshead
: for outputting the first part of filestail
: for outputting the last part of fileswho
: show who is logged onw
: show who is logged on and what they are doing.touch
: change file timestampscat
: concatenate files and print on the standard outputecho
: display a line of textpwd
: print name of current/working directorymkdir
: show who is logged onrmdir
: remove empty directorieshead
: output the first part of fileswc
: print newline, word, and byte counts for each file
We also used two types of operators, the pipe and the redirect:
|
: redirect standard output command1 to standard input of command2>
: redirect to standard output to a file, overwriting>>
: redirect to standard output to a file, appending