Linux Systems Administration

Author: C. Sean Burns
Date: 2022-08-12
Email: sean.burns@uky.edu
Website: cseanburns.net
Twitter: @cseanburns
GitHub: @cseanburns

Introduction

This short book was written for my Linux Systems Administration course. The book and course's goals are to provide the very basics about systems administration using Linux, and to teach students:

  1. how to use the command line in order to become more efficient computer users and more comfortable with using computers in general;
  2. how to use command line utilities and programs and to learn what can be accomplished using those programs;
  3. how to administer users and manage software on a Linux server;
  4. how to secure a Linux server; and
  5. the basics of cloud computing;

And finally, this book/course ends on walking students through the process of building a LAMP stack.

About This Book

Since I use this book for my Linux Systems Administration course, which I teach each fall semester, this book will be a live document. I will update the content as I teach it in order to address changes in the technology and to edit for clarity when I discover some aspect of the book causes confusion or does not provide enough information.

This book is not a comprehensive introduction to Linux nor to systems administration. It is designed for an entry level course on these topics and is focused on a select and small range of those topics that have specific pedagogical aims (see above).

The book started off as a series of transcripts and demonstrations. It still has that focus, but I've had a long-term goal to make these transcripts more cohesive. Achieving this became easier when I learned about mdBook.

The content in this book is open access and licensed under the GNU GPL v3.0. Feel free to fork it on GitHub and modify it for your own needs.

History of This Course

I created and started teaching this course in the Fall 2016 semester. I originally used Soyinka's (2016) excellent introduction to Linux administration, and we used VirtualBox and the Fedora Server distribution to practice and learn the material.

However, around 2018 or '19, I moved away from Soyinka's comprehensive book to focus the material on a more limited range of topics. I did this for two reasons. First, most of my students do not become systems administrators, although some have (to my delight). Second, my students have grown up using only graphical user interfaces on one of the two common, commercial operating systems, and consequently have very constrained and limited understandings of how computers work and what can be done with them. In redesigning this course, I wanted to strike a balance between these two problems. I wanted students to acquire enough skills and gain enough confidence to feel comfortable applying for (at least) entry level systems administrator jobs, and more basically, I wanted students to be exposed to a different type of computing environment than what they were used to and that fostered a hacking mentality, in the more benign and playful sense of the word.

I moved us away from using Fedora Server for the Fall 2022 course. Fedora Server is a great and fun operating system, and there's a lot to learn about Linux using it. However, since it is rather bleeding edge, it meant something would break in my demonstrations each semester, and identifying what had changed in Fedora each year made it somewhat of a chore to keep up. I have therefore switched to a less bleeding edge distribution of Linux: a still supported Ubuntu Server LTS release. Based on my personal experience managing servers that run on some version of Ubuntu LTS, I believe this should provide more stability. It helps that Ubuntu Server has a good share of the Linux server market.

The primary reason I moved us away from VirtualBox is because a good number of my students each year use Apple computers, which became a major obstacle when Apple switched to the M1 chip. I originally considered asking those students to use different virtualization software, but it was nice to have all students, regardless of operating system, and myself using the same software. I also considered using something like Docker as a replacement, but decided instead to use Google Cloud. I figured that learning how to use a service like Google Cloud might be more broadly useful to students, and that if we used Docker, we'd have to spend a lot of time installing and configuring that on their laptops. Time is already a constraint in this course, but we'll see how it goes this semester (Fall 2022).

References

Soyinka, W. (2016). Linux administration: A beginner's guide (7th ed.). New York: MacGraw Hill Education. ISBN: 978-0-07-184536-6

History of Unix and Linux

An outline of the history of Unix and Linux.

Location: Bell Labs, part of AT&T (New Jersey), late 1960s through early 1970s

  • Starts with an operating system called Multics.
  • Multics was a time sharing system
    • That is, more than one person could use it at once.
  • But Multics had issues and was slowly abandoned
  • Ken Thompson found an old PDP-7. Started to write UNIX.
    • The ed line editor was written.
    • Pronounced e.d. but generally sounded out.
  • This version of UNIX would later be referred to as Research Unix
  • Dennis Ritchie, the creator of the C programming language, joined Thompson's efforts.

Location: Berkeley, CA (University of California, Berkeley), early to mid 1970s

  • The code for UNIX was not 'free software' but low cost and easily shared.
  • Ken Thompson visited Berkeley and helped install Version 6 of UNIX
  • Bill Joy and others contributed heavily
    • Joy created the vi text editor, a descendant of the popular Vim editor, many other important programs, and was a co-founder of Sun Microsystems
  • This installation of UNIX would eventually become known as the Berkeley Software Distribution, or BSD.

AT&T

  • Until its breakup in 1984, AT&T was not allowed to profit off patents that were not directly related to its telecommunications businesses.
  • This agreement with the US government helped protect the company from monopolistic charges, and as a result, they could not commercialize UNIX.
  • This changed after the breakup. System V UNIX became the standard bearer of commercial UNIX.

Location: Boston, MA (MIT), early 1980s through early 1990s

  • In the late 1970s, Richard Stallman noticed that software began to become commercialized.
    • As a result, hardware vendors stopped sharing the code they developed to make their hardware work.
  • Software code became eligible for copyright protection with the Copyright Act of 1976
  • Stallman, who thrived in a hacker culture, began to battle against this turn of events.
  • Stallman created the GNU project, the free software philosophy, GNU Emacs, a popular and important text editor, and he wrote many other programs.
  • The GNU project is an attempt to create a completely free software operating system, that was Unix-like, called GNU.
  • By the early 1990s, Stallman and others had developed all the utilities needed to have a full operating system, except for a kernel, which they called GNU Hurd.
  • This included the Bash shell, written by Brian Fox.
  • The GNU philosophy includes several propositions that define free software:

The four freedoms, per GNU Project: 0. The freedom to run the program as you wish, for any purpose (freedom 0).

  1. The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
  2. The freedom to redistribute copies so you can help others (freedom 2).
  3. The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

The Four Freedoms

The Unix wars and the lawsuit, late 1980s through the early 1990s

  • AT&T, after its breakup, began to commercialize Unix, and differences in AT&T Unix and BSD Unix arose.
  • The former was aimed at commercialization, and the latter aimed at researchers and academics.
  • UNIX Systems Laboratories, Inc. (USL, part of AT&T) sued Berkeley Software Design, Inc. (BSDi, part of the University of California, Berkeley) for copyright and trademark violations.
  • USL ultimately lost the case, but the lawsuit delayed adoption of BSD Unix.

Linux, Linus Torvalds, University of Helsinki, Finland, early 1990s

  • On August 25, 1991, Linus Torvalds announced that he had started working on a free operating system kernel for the 386 CPU architecture and for his specific hardware.
  • This kernel would later be named Linux.
  • Linux technically refers only to the kernel.
    • An operating system kernel handles startup, devices, memory, resources, etc.
    • A kernel does not provide user land utilities---the kinds of software they people use when using computers.
  • Torvalds' motivation was to learn about OS development but also to have access to a Unix-like system.
    • He already had access to an Unix-like system called MINIX, but MINIX had technical and copyright restrictions.
  • Torvalds has stated that if a BSD or if GNU Hurd operating system were available, then he may not have created the Linux kernel.
  • But Torvalds and others took the GNU utilities and created what is now called Linux or GNU/Linux.

Distributions, early 1990s through today

  • Soon after the Linux development, people would create their own Linux and GNU based operating systems and would distribute them.
  • As such, these Linux operating systems became referred to as distributions.
  • The two oldest distributions that are still in active development include:

Short History of BSD, 1970s through today

  • Unix version numbers 1-6 eventually led to BSD 1-4.
  • At BSD 4.3, all versions had some AT&T code.
    • Desire to remove this code led to BSD Net/1.
  • All AT&T code was removed by BSD Net/2.
  • BSD Net/2 was ported to the Intel 386 processor.
    • This became 386BSD and was made available in 1992, a year after the Linux kernel was released.
  • 386BSD split into two projects:
  • NetBSD split into another project: OpenBSD.
  • All three of these BSDs are still in active development.
  • From a bird's eye point of view, they each have different focuses:
    • NetBSD focuses on portability (MacOS, NASA)
    • FreeBSD focuses on wide applicability (WhatsApp, Netflix, PlayStation 4, MacOS)
    • OpenBSD focuses on security (has contributed a number of very important applications)

MacOS is based on Darwin, is technically UNIX, and is partly based on FreeBSD with some code coming from the other BSDs. See Why is macOS often referred to as 'Darwin'? for a short history.

Short History of GNU, 1980s through today

  • The GNU Hurd is still under active development, but it's the pre-production state.
  • The last release was 0.9 on December 2016.
  • A complete OS based on the GNU Hurd can be downloaded and ran. For example: Debian GNU/Hurd

Free and Open Source Licenses

In the free software and open source landscape, there are several important free and/or open source licenses that are used. The two biggest software licenses are based on the software used by GNU/Linux and the software based on the BSDs. They each take very different approaches to free and/or open source software. The biggest difference is this:

  • Software based on software licensed under the GPL must also be licensed under the GPL. This is referred to as copyleft software, and the idea is to propagate free software.
  • Software based on software licensed under the BSD license may be closed source and primarily must only attribute the original source code and author.

What is Linux?

The Linux Kernel

Technically, Linux is a kernel, and a kernel is a part of an operating system that oversees CPU activity like multitasking, as well as networking, memory management, device management, file systems, and more. The kernel alone does not make an operating system. It needs user land applications and programs, the kind we use on a daily basis, to form a whole, as well as ways for these user land utilities to interact with the kernel.

Linux and GNU

The earliest versions of the Linux kernel were combined with tools, utilities, and programs from the GNU project to form a complete operating system, without necessarily a graphical user interface. This association continues to this day. Additional non-GNU, but free and open source programs under different licenses, have been added to form a more functional and user friendly system. However, since the Linux kernel needs user land applications to form an operating system, and since user land applications from GNU cannot work without a kernel, some argue that the operating system should be called GNU/Linux and not just Linux. This has not gained wide acceptance, though. Regardless, credit is due to both camps for their contribution, as well as many others who have made substantial contributions to the operating system.

Linux Uses

We are using Linux as a server in this course, which means we will use Linux to provide various services. Our first focus is to learn to use Linux itself, but by the end of the course, we will also learn how to provide web and database services. Linux can be used to provide other services that we won't cover in this course, such as:

  • file servers
  • mail servers
  • print servers
  • game servers
  • computing servers

Although it's a small overall percentage, many people use Linux as their main desktop/laptop operating system. I belong in this camp. Linux has been my main OS since the early 2000s. While our work on the Linux server means that we will almost entirely work on the command line, this does not mean that my Linux desktop environment is all command line. In fact, there are many graphical user environments, often called desktop environments, available to Linux users. Since I'm currently using the Ubuntu Desktop distribution, my default desktop environment is called Gnome. KDE is another popular desktop environment, but there are many other attractive and useful ones. And it's easy to install and switch between multiple ones on the same OS.

Linux has become quite a pervasive operating system. Linux powers the hundreds of the fastest supercomputers in the world. It, or other Unix-like operating systems, are the foundation of most web servers. The Linux kernel also forms the basis of the Android operating system and of Chrome OS. The only place where Linux does not dominate is in the desktop/laptop space.

What is Systems Administration?

Introduction

What is systems administration or who is a systems administrator (or sysadmin)? Let's start off with some definitions provided by the National Institute of Standards and Technology:

An individual, group, or organization responsible for setting up and maintaining a system or specific system elements, implements approved secure baseline configurations, incorporates secure configuration settings for IT products, and conducts/assists with configuration monitoring activities as needed.

Or:

Individual or group responsible for overseeing the day-to-day operability of a computer system or network. This position normally carries special privileges including access to the protection state and software of a system.

See: Systems Administrator @NIST

Specialized Positions

In addition to the above definitions, which broadly define the role, there are a number of related or specialized positions. We'll touch on the first three in this course:

  • Web server administrator:
    • "web server administrators are system architects responsible for the overall design, implementation, and maintenance of Web servers. They may or may not be responsible for Web content, which is traditionally the responsibility of the Webmaster (Web Server Administrator" @NIST).
  • Database administrator:
    • like web admins, and to paraphrase above, database administrators are system architects responsible for the overall design, implementation, and maintenance of database management systems.
  • Network administrator:
    • "a person who manages a network within an organization. Responsibilities include network security, installing new applications, distributing software upgrades, monitoring daily activity, enforcing licensing agreements, developing a storage management program, and providing for routine backups" (Network Administrator @NIST).
  • Mail server administrator:

Depending on where a system administrator works, they may specialize in any of the above administrative areas, or if they work for a small organization, all of the above duties may be rolled into one position. Some of the positions have evolved quite a bit over the last couple of decades. For example, it wasn't too long ago when organizations would operate their own mail servers, but this has largely been outsourced to third-party providers, such as Google (via Gmail) and Microsoft (via Outlook). People are still needed to work with these third-party email providers, but the nature of the work is different than operating independent mail servers.

Certifications

It's not always necessary to get certified as a systems administrator to get work as one, but there might be cases where it is necessary; for example, in government positions or in large corporations. It also might be the case that you can get work as an entry level systems administrator and then pursue certification with the support of your organization.

Some common starting certifications are:

Plus, Google offers, via Coursera, a beginners Google IT Support Professional Certificate that may be helpful.

Associations

Getting involved in associations and related organizations is a great way to learn and to connect with others in the field. Here are few ways to connect.

LOPSA, or The League of Professional System Administrators, is a non-profit association that seeks to advance the field and membership is free for students.

ACM, or the Association for Computing Machinery, has a number of relevant special interest groups (SIGs) that might be beneficial to systems administrators.

NPA, or the Network Professional Association, is an organization that "supports IT/Network professionals."

Codes of Ethics

Systems administrators manage computer systems that contain a lot of data about us and this raises privacy and competency issues, which is why some have created code of ethics statements. Both LOPSA and NPA have created such statements that are well worth reviewing and discussing.

Keeping Up

Technology changes fast. In fact, even though I teach this course about every year, I need to revise the course each time, sometimes substantially, to reflect changes that have developed over short periods of time. It's also your responsibility, as sysadmins, to keep up, too.

I therefore suggest that you continue your education by reading and practicing. For example, there are lots of books on systems administration. O'Reilly continually publishes on the topic. RedHat, the makers of the Red Hat Linux distribution, and sponsors of Fedora Linux and CentOS Linux, provides the Enable Sysadmin site, with new articles each day, authored by systems administrators, on the field. Opensource.com, also supported by Red Hat, publishes articles on systems administration. Command Line Heroes is a fun and informative podcast on technology and sysadmin related topics. Linux Journal publishes great articles on Linux related topics.

Conclusion

In this section I provided definitions of systems administrators and also the related or more specialized positions, such as database administrator, network administrator, and others.

I provided links to various certifications you might pursue as a systems administrator, and links to associations that might benefit you and your career.

Technology manages so much of our daily lives, and computer systems store lots of data about us. Since systems administrators manage these systems, they hold a great amount of responsibility to protect them and our data. Therefore, I provided links to two code of ethics statements that we will discuss.

It's also important to keep up with the technology, which changes fast. The work of a systems administrator is much different today than it was ten or twenty years ago, and that surely indicates that it could be much different in another ten to twenty years. If we don't keep up, we won't be of much use to the people we serve.

Using Google Cloud (gcloud)

This section introduces us to Google Cloud (gcloud). We will use this platform to create virtual instances of the Ubuntu Server Linux operating system.

Using gcloud for Virtual Machines

Virtual Machines

Our goal in this section is to create a virtual machine (VM) instance. A VM is basically a virtualized operating system that runs on a host operating system. That host operating system may also be Linux, but it could be Windows or macOS. In short, when we use virtual machines, it means instead of installing an operating system (like Linux, macOS, Windows, etc) on a physical machine, we use virtual machine software to mimic the process. The virtual machine, thus, runs on top of our main OS. It's like an app, where the app is a fully functioning operating system.

In past semesters of this course, we used VirtualBox to create virtual machines with Linux as the virtual operating system. This worked despite whether you or I were running Windows, macOS, or Linux as our main operating systems. VirtualBox is freely available virtualization software, and using it let students and myself run Linux as a server on our own desktops and laptops without changing the underlying OS on those machines (e.g., Windows, macOS).

However, even though we virtualize an operating system when we run a VM, the underlying operating system and CPU architecture is still important. When Apple, Inc launched their new M1 (ARM-based) chip in 2020, it created problems for running non ARM-based operating systems as virtual machines (i.e., x86_64 chips).

Fortunately, we are able to solve that issue using a third-party virtualization platform. In this course, that means we're going to use gcloud (via Google), but there are other options available that you can explore on your own.

Google Cloud / gcloud

Google Account

We need to have a Google account to get started with gcloud. I imagine most of you already have a Google account, but if not, go ahead and create one at https://www.google.com.

Google Cloud (gcloud) Project

Next, you need to use gcloud to create a Google Cloud project. Once you've created that project, you can enable billing for that project, and then install the gcloud software on your local machine.

Follow Step 1 at the top of the Install the gcloud CLI page to in create a new project. Also, review the page on creating and managing projects.

When you create your project, you can name it anything, but try to name it something to do with this course. E.g., I am using the name sysadmin-418. Avoid using spaces when naming your project.

Then click on the Create button, and leave the organization field set to No Organization.

Google Billing

The second thing to do is to set up a billing account for your gcloud project. This does mean there is a cost associated with this product, but the good news is that our bills by the end of the semester should only amount to a couple of dollars, at most. Follow Step 2 to enable billing for your new project. See also the page on how to create, modify, or close your self-serve Cloud Billing account

Install the latest gcloud CLI version

After you have set up billing, the next step is to install gcloud on your local machines. The Install the gcloud CLI page provides instructions for different operating systems.

There are installation instructions for macOS, Windows, Chromebooks, and various Linux distributions. Follow these instructions closely for the operating system that you're using. Note that for macOS, you have to choose among three different CPU/chip architectures. If you have an older macOS machine (before November 2020 or so), it's likely that you'll select macOS 64-bit (x86_64). If you have a newer macOS machine, then it's likely you'll have to select macOS 64-bit (arm64, Apple M1 silicon). It's unlikely that any of you are using a 32-bit macOS operating system. If you're not sure which macOS system you have, then let me know and I can help you determine the appropriate platform. Alternatively, follow these instructions to find your processor information:

  • click on the Apple menu
  • choose About This Mac
  • locate the Processor or Chip information

After you have downloaded the gcloud CLI for your particular OS and CPU architecture, you will need to open a command prompt/terminal on your machines to complete the instructions the describe how to install the gcloud CLI. macOS uses the Terminal app, which can located using Spotlight. Windows user can use Command.exe, which can be located by search also.

Windows users will download a regular .exe file, but macOS users will download a .tar.gz file. Since macOS is Unix, you can use the mv command to move that file to your $HOME directory. Then you extract it there using the tar command, and once extracted you can change to the directory that it creates with the cd command. For example, if you are downloading the X86_64 version of the gcloud CLI, then you would run the following commands:

mv google-cloud-cli-392.0.0-darwin-x86_64.tar.gz $HOME
tar -xzf google-cloud-cli-392.0.0-darwin-x86_64.tar.gz
cd google-cloud-sdk

Modify the above commands, as appropriate, if you're using the M1 version of the gcloud CLI.

Initializing the gcloud CLI

Once you have downloaded and installed the gcloud CLI program, you need to initialize it on your local machine. Scroll down on the install page to the section titled Initializing the gcloud CLI. In your terminal/command prompt, run the initialization command, per the instructions at the above page:

gcloud init

And continue to follow the above instructions.

gcloud VM Instance

Once you've initialized gcloud, log into Google Cloud Console, which should take you to the Dashboard page.

Our first goal is to create a virtual machine (VM) instance. As a reminder, a VM is basically a virtualized operating system. That means instead of installing an operating system (like Linux, macOS, Windows, etc) on a physical machine, software is used to mimic the process.

gcloud offers a number of Linux-based operating systems to create VMs. We're going to use the Ubuntu operating system and specifically the Ubuntu 20.04 LTS version.

Ubuntu is a Linux distribution. A new version of Ubuntu is released every six months. The 20.04 signifies that this is the April 2020 version. LTS signifies Long Term Support. LTS versions are released every two years, and Canonical LTD, the owners of Ubuntu, provide standard support for LTS versions for five years.

LTS versions of Ubuntu are also more stable. Non-LTS versions of Ubuntu only receive nine months of standard support, and generally apply cutting edge technology, which is not always desirable for server operating systems. Each version of Ubuntu has a code name. 20.04 has the code name Focal Fossa. You can see a list of versions, code names, release dates, and more on Ubuntu's Releases page.

We will create our VM using the gcloud console. To do so, follow these steps:

  • Click the Select from drop-down list.
  • In the window, select the project that you created earlier.
  • Next, click on Create a VM.
  • Provide a name for your instance.
    • E.g., I chose fall-2022 (no spaces)
  • Under the Series dropdown box, make sure E2 is selected.
  • Under the Machine type dropdown box, select e2-micro (2 vCPU, 1 GB memory)
    • This is the lowest cost virtual machine and perfect for our needs.
  • Under Boot disk, click on the Change button.
  • In the window, select Ubuntu from the Operating system dropdown box.
  • Select Ubuntu 20.04 LTS x86/64
  • Leave Boot disk type be set to Balanced persistant disk
  • Disk size should be set to 10 GB.
  • Click on the Select button.
  • Check the Allow HTTP Traffic button
  • Finally, click on the Create button to create your VM instance.

Connect to our VM

After the new VM machine has been created, we need to connect to it via the command line. macOS users will connect to it via their Terminal.app. Windows users can connect to it via their command prompt.

Unlike our past ssh sessions, we use a slightly different ssh command to connect to our VMs. The syntax follows this pattern:

gcloud compute ssh --zone "zone-info" "name-info" --project "project-id"

The values in the double quotes in the above command can be located in your Google Cloud console and in your VM instances section.

Update our Ubuntu VM

The VM will include a recently updated version of Ubuntu 20.04, but it may not be completely updated. Thus the first thing we need to do is update our machines. On Ubuntu, we'll use the following two commands, which you should run also:

sudo apt update
sudo apt -y upgrade

Then type exit to logout and quit the connection to the remote server.

exit

Snapshots

Lastly, we have installed a pristine version of Ubuntu, but it's likely that we will mess something up as we work on our systems. Or it could be that our systems may become compromised at some point. Therefore, we want to create a snapshot of our newly installed Ubuntu server. This will allow us to restore our server if something goes wrong later.

To get started:

  1. In the left hand navigation panel, click on Snapshots.

  2. At the top of the page, click on Create Snapshot.

  3. Provide a name for your snapshot: e.g., ubuntu-1.

  4. Provide a description of your snapshot: e.g.,

    This is a new install of Ubuntu 20.04.

  5. Choose your Source disk.

  6. Choose a Location to store your snapshot.

    • To avoid extra charges, choose Regional.
    • From the dropdown box, select the same location (zone-info) your VM has
  7. Click on Create

Please monitor your billing for this to avoid costs that you do not want to incur.

Conclusion

Congratulations! You have just completed your first installation of a Linux server.

To summarize, in this section, you learned about and created a VM with gcloud. This is a lot! After this course is completed, you will be able to fire up a virtual machine on short notice and deploy websites and more.

Learning the Command Line

It's obviously more common for people today to learn how to use a computer via a graphical user interface, but graphical user interfaces entail extra software, and the more software we have on a server, the more resources that software consumes, and the more we expose our systems to security risks.

Graphical user interfaces also do not provide a good platform for automation, at least not remotely as well as command line interfaces do. Working on the command line, in what is know as a shell, is in fact programming the computer.

Fortunately, Linux, and many other Unix-like operating systems, have the ability to operate without graphical user interfaces. This is partly the reason why these operating systems have done so well in the server market.

In this section, our focus is learning the command line environment, how to use it, and what it offers.

The Linux Filesystem

In this demo, we will cover the:

  • the Linux filesystem and how it is structured and organized, and
  • the basic commands to navigate around and to work with directories and files

The terms directories and folders are synonymous, but as users of primarily graphical user interfaces, you are more likely familiar with the term folders. I will more often use the term directories since that is the command line (text user interface) convention. I will use the term folders when referring to a graphical environment.

Throughout this demonstration, I encourage you to gcloud compute ssh into our remote server and follow along with the commands that I use. See Section 2.1 for details on connecting to the remote server.

Visualizing the Filesystem as a Tree

We will need to work within the filesystem quite a lot in this course, but the term filesystem may refer to different concepts, and it's important to clear that up before we start.

In come cases, a filesystem refers to how data (files) are stored and retrieved on a device like a hard drive, USB drive, etc. For example, macOS uses the Apple File System (APFS) by default, and Windows uses the New Technology File System (NTFS). Linux and other unix-like operating systems use a variety of filesystems, but presently, the two major ones are ext4 and btrfs. The former is the default filesystem on distributions like Debian and Ubuntu; the latter is the default on the Fedora and openSUSE distributions. Opensource.com has a nice overview of filesystems under this concept.

A filesystem might also be used to refer to the directory structure or directory tree of a system. This concept is related to the prior concept of a filesystem, but it's used here to refer to the location of files and directories on a system. For example, on Windows, the filesystem is identified by a letter, like the C: drive, regardless if the disk has a NTFS filesystem or a FAT filesystem. Additional drives (e.g., extra hard drives, USB drives, DVD drives, etc.), will be assigned their own letters (A:, B:, D:, etc.). macOS adheres to a tree like filesystem like Linux and other unix-like operating systems. (This is because macOS is UNIX.) In these operating systems, we have a top-level root directory identified by a single forward slash /, and then subdirectories under that root directory. Additional drives (e.g., extra hard drives, USB drives, DVD drives, etc.) are mounted under that root hierarchy and not separately like on Windows. Linux.com provides a nice overview of the most common directory structure that Linux distributions use along with an explanation for the major bottom level directories. In this section, we will learn about this type of filesystem.

On Linux, we can visualize the filesystem with the tree command. The tree command, like many Linux commands, can be run on its own or with options, like in the second example below:

  • tree : list contents of directories in a tree-like format
    • tree -dfL 1 : directories only, full path, one level
    • tree -dfL 1 / : list directories only at root / level

The root Directory and its Base Level Directories

As explained on the Linux.com page, here are the major sub directories under / (root) and a short description of their main purpose:

  • /bin : binary files needed to use the system
  • /boot : files needed to boot the system
  • /dev : device files -- all hardware has a file
  • /etc : system configuration files
  • /home : user directories
  • /lib : libraries/programs needed for other programs
  • /media : external storage is mounted
  • /mnt : other filesystems may be mounted
  • /opt : store software code to compile software
  • /proc : files containing info about your computer
  • /root : home directory of superuser
  • /run : used by system processes
  • /sbin : like /bin, binary files that require superuser privileges
  • /srv : contains data for servers
  • /sys : contains info about devices
  • /tmp : temp files used by applications
  • /usr : user binaries, etc that might be installed by users
  • /var : variable files, used often for system logs

Although there are 18 directories listed above that branch off from the root directory, we will use some more often than others. For example, the /etc directory contains system configuration files, and we will use the contents of this directory, along with the /var directory, quite a bit when we set up our web servers, relational database servers, and more later in the semester. The /home directory is where our default home directories are stored, and if you manage a multi-user system, then this will be an important directory to manage.

Source: Linux Filesystem Explained

Relative and Absolute Paths

macOS users have the Finder app to navigate their filesystem, to move files to different folders, to copy files, to trash them, etc. Window users have File Explorer for these functions. Linux users have similar graphical software options, but all of these functions can be completed on the Linux command line, and generally more efficiently. To get started, we need to learn two things first:

  1. how to specify the locations of files and directories in the filesystem
  2. the commands needed to work with the filesystem

To help specify the locations of files and directories, there are two key concepts to know:

  • absolute paths
  • relative paths

Above we learned about the / root directory and its subdirectories. All sorts of commands, especially those that deal with files and directories (like copying, moving, deleting), require us to specify on the command line the locations of the files and directories. It's common to specify the location in two different ways, by specifying their absolute path (or location) on the filesystem, or the relative path (or location).

To demonstrate, we might want to move around the filesystem. When we first log in to our remote system, our default location will be our home directory, sometimes referred to as $HOME. The path (location) to that directory will be.

/home/USER

Where USER is your username. Therefore, since my username is sean, my home directory is located at:

/home/sean

which we can see specified with the pwd (print working directory) command:

pwd
/home/sean

When I write $HOME, I am referring to a default, environmental variable that points to our home directory. It's variable because, depending on which account we're logged in as, $HOME will point to a different location. For me, then, that will be /home/sean, if I'm logged in as sean. For you it'll point to your home directory.

In my home directory, I have a subdirectory called public_html. The path to that is:

/home/sean/public_html

In a program like Finder (macOS) or File Explorer (Windows), if I want to change my location to that subdirectory (or folder), then I'd double click on its folder icon. On the command line, however, I have to write out the command and the path to the subdirectory. Therefore, starting in my home directory, I use the following command to switch to the public_html subdirectory:

cd public_html

Note that files and directories in Linux are case sensitive. This means that a directory named public_html can co-exist alongside a directory named Public_html. Or a file named paper.txt can co-exist alongside a file named Paper.txt. So be sure to use the proper case when spelling out files, directories, and even commands.

The above is an example of using a relative path, and that command would only be successful if I were first in my $HOME directory. That's because I specified the location of public_html relative to my default ($HOME) location.

I could have also specified the absolute location, but this would be the wordier way. Since the public_html directory is in my $HOME directory, and my $HOME directory is a subdirectory in the /home directory, then to specify the absolute path in the above command, I'd write:

cd /home/sean/public_html

Again, the relative path specified above would only work if I was in my home directory, because cd public_html is relative to the location of /home/sean. That is, the subdirectory public_html is in /home/sean. But specifying the absolute path would work no matter where I was located in the filesystem. For example, if I was working on a file in the /etc/apache2 directory, then using the absolute path (cd /home/sean/public_html) would work. But the relative path (cd public_html) command would not since there is no subdirectory called public_html in the /etc/apache2 directory.

Finally, you can use the ls command to list the contents of a directory, i.e., the files and subdirectories in a directory:

ls

We will cover this more next.

Conclusion

Understanding relative and absolute paths is one of the more difficult concepts for new commandline users to learn, but after time, it'll feel natural. So just keep practicing, and I'll go over this throughout the semester.

In this section, you learned the following commands:

  • tree to list directory contents in a tree-like format
  • cd to change directory
  • pwd to print working directory

You learned different ways to refer to the home directory:

  • /home/USER
  • $HOME
  • ~

You learned about relative and absolute paths. An absolute path starts with the root directory /. Here's an absolute path to a file named paper.txt in my home directory:

  • absolute path: /home/sean/paper.txt

If I were already in my home directory, then the relative path would simply be:

  • relative path: paper.txt

Files and Directories

Basic Directory and File commands

In order to explore the above directories but also to create new ones and work with files, we need to know some basic terminal commands. A lot of these commands are part of the base system called GNU Coreutils, and in this demo, we will specifically cover some of the following GNU Coreutils:

Directory Listing

I have already demonstrated one command: the cd (change directory) command. This will be one of the most frequently used commands in your toolbox.

In our current directory, or once we have changed to a new directory, we will want to learn its contents (what files and directories it contains). We have a few commands to choose from to list contents (e.g., you have already seen the tree command), but the most common command is the ls (list) command. We use it by typing the following two letters in the terminal:

ls

Again, to confirm that we're in some specific directory, use the pwd command to print the working directory.

Most commands can be combined with options. Options provide additional functionality to the base command, and in order to see what options are available for the ls command, we can look at its man(ual) page:

man ls

From the ls man page, we learn that we can use the -l option to format the output of the ls command as a long-list, or a list that provides more information about the files and directories in the working directory. Later in the semester, I will talk more about what the other parts of output of this option mean.

ls -l

We can use the -a option to list hidden files. In Linux, hidden files are hidden from the base ls command if the files begin with a period. We have a some of those files in our $HOME directories, and we can see them like so:

ls -a

We can also combine options. For example, to view all files, including hidden ones, in the long-list format, we can use:

ls -al

Basic File Operations

Some basic file operation commands include:

  • cp : copying files and directories
  • mv : moving (or renaming) files and directories
  • rm : removing (or deleting) files and directories
  • touch : change file timestamps (or, create a new, empty file)

These commands also have various options that can be viewed in their respective man pages. Again, command options provide additional functionality to the base command, and are mostly (but not always) prepended with a dash and a letter or number. To see examples, type the following commands, which will launch the manual pages for them. Press q to exit the manual pages, and use your up and down arrow keys to scroll through the manuals:

man cp
man mv
man rm
man touch

The touch command's primary use is to change a file's timestamp; that is, the command updates a file's "access and modification times" (see man touch). For example, let's say we have a file called paper.txt in our home directory. We can see the output here:

ls -l paper.txt
-rw-rw-r-- 1 sean sean 0 Jun 27 00:13 /home/sean/paper.txt

This shows that the last modification time was 12:03AM on June 27.

If I run the touch command on paper.txt, the timestamp will change:

touch paper.txt
-rw-rw-r-- 1 sean sean 0 Jun 27 00:15 /home/sean/paper.txt

This shows an updated modification timestamp of 12:15AM.

The side effect occurs when we name a file with the touch command, but the file does not exist, in which case the touch command will create an empty file with the name we use. Let's say that I do not have a file named file.txt in my home directory. If I run the ls -l file.txt command, I'll receive an error since the file does not exist. But if I then use the touch file.txt command, and then run ls -l file.txt. we'll see that the file now exists, that it has a byte size of zero:

ls -l file.txt
ls: cannot access 'file.txt': No such file or directory
touch file.txt
ls -l file.txt
-rw-rw-r-- 1 sean sean 0 Jun 27 00:18 file.txt

Here are some ways to use the other three commands and their options:

Copying Files and Directories

To copy an existing file (file1.txt) to a new file (file2.txt):

cp file1.txt file2.txt

Use the -i option to copy that file in interactive mode; that is, to prompt you before overwriting an existing file.

We also use the cp command to copy directories.

Moving Files and Directories

The mv command will move an existing file to a different directory, and/or rename the file. For example, from within our home directory (therefore, using relative path names), to move a file named "file.docx" to a subdirectory named "Documents":

mv file.docx Documents/

To rename a file only (keeping it in the same directory), the command looks like this:

mv file.docx newName.docx

To move the file to our Documents/ subdirectory and also rename it, then we'd do this:

mv file.docx Documents/newName.docx

The man page for the mv command also describes an -i option for interactive mode that helps prevent us from overwriting existing files. For example, if we have a file called paper.docx in our $HOME directory, and we have a file named paper.docx in our $HOME/Documents directory, and if these are actually two different papers (or files), then moving the file to that directory will overwrite it without asking. The -i option will prompt us first:

mv -i paper.docx Documents/paper.docx

Remove or Delete

Finally, to delete a file, we use the rm command:

rm file.html

Unlike the trash bin in your graphical user environment, it's very hard to recover a deleted file using the rm command. That is, using rm does not mean the file or directory is trashed; rather, it means it was deleted.

Special File Types

For now, let's only cover two commands here:

  • mkdir for creating a new directory
  • rmdir for deleting an empty directory

Like the above commands, these commands also have their own set of options that can be viewed in their respective man pages:

man mkdir
man rmdir

Make or Create a New Directory

We use these commands like we do the ones above. If we are in our $HOME directory, and we want to create a new directory called bin, we do:

mkdir bin 

The bin directory in our $HOME directory is a default location to store our personal applications, or applications (programs) that are only available to us.

And if we run ls, we should see that it was successful.

Delete a Directory

The rmdir command is a bit weird because it only removes empty directories. To remove the directory we just created, we use it like so:

rmdir bin 

However, if you want to remove a directory that contains files or other subdirectories, then you will have to use the rm command along with the -r (recursive) option:

rm -r directory-with-content/

Printing Text

There a number of ways to print text to standard output, which is our screen by default in the terminal. We could also redirect standard output to a file, to a printer, or to a remote shell. We'll see examples like that later in the semester. Here let's cover two commands:

  • echo : to print a line of text to standard output
  • cat : to concatenate and write files
  • less : to view files one page at a time

Standard output is by default the screen. When we print to standard output, then by default we print to the screen. However, standard output can be redirected to files, programs, or devices, like actual printers.

To use echo:

echo "hello world"
echo "Today is a good day."

We can also echo variables:

a=4
echo "$a"

cat is listed elsewhere in the GNU Coreutils page. The primary use of the cat command is to join, combine, or concatenate files, but if used on a single file, it has this nice side effect of printing the content of the file to the screen:

cat file.html

If the file is very long, we might want to use what's called a pager. There are a few pagers to use, but the less command is a common one:

less file.html

Like with the man pages, use the up and down arrow keys to scroll through the output, and press q to quit the pager.

Conclusion

In this demo, we learned about the filesystem or directory structure of Linux, and we also learned some basic command to work with directories and files. You should practice using these commands as much as possible. The more you use them, the easier it'll get. Also, be sure to review the man pages for each of the commands, especially to see what options are available for each of them.

Basic commands covered in this demo include:

  • cat : display contents of a file
  • cp : copy
  • echo : print a line of text
  • less : display contents of a file by page
  • ls : list
  • man : manual pages
  • mkdir : create a directory
  • mv : move or rename
  • pwd : print name of current/working directory
  • rmdir : delete an empty directory
  • rm : remove or delete a file or directory
  • tree : list contents of directories in a tree-like format

File Attributes

Identifying Ownership and Permissions

In the last section, we saw that the output of the ls -l command included a lot extra information besides a listing of file names. The output also listed the owners and permissions for each file and directory.

Each user account on a Linux system (like many operating systems) has a user name and has at least one group membership, and that name and that group membership determine the user and group ownership for all files created under that account.

In order to allow or restrict access to files and directories (for example, to allow other users to read, write to, or run your or others' files), ownership and permissions are set in order to manage that kind of access to those files and directories. There are thus two owners for every file (and directory):

  • user owner
  • group owner

And there are three permission modes that restrict or expand access to each file (or directory) based on user or group membership:

  • (r)ead
  • (w)rite
  • e(x)ecute (as in a program)

I am emphasizing the rwx in the above list of modes because we will need to remember what these letters stand for when we work with file and directory permissions.

Consider the output of ls -l in some public_html directory that contains a single file called index.html:

-rw-rw-r-- 1 sean sean 11251 Jun 20 14:41 index.html

According to the above output, we can parse the following information about the file:

Attributesls -l output
File permissions-rw-rw-r--
Number of links1
Owner namesean
Group namesean
Byte size11251
Last modification dateJun 20 14:41
File nameindex.html

What's important for us right now are the File permissions row, the Owner name row, and the Group name row.

The Owner and Group names of the index.html file are sean because there is a user account named sean on the system and a group account named sean on the system, and that file exists in the user sean's home directory.

The File permissions row shows:

-rw-rw-r--

Let's ignore the first dash for now. The remaining permissions can be broken down as:

  • rw- (read and write only permissions for the Owner)
  • rw- (read and write only permissions for the Group)
  • r-- (read-only permissions for the World)

We read the output as such (dashes, other than the initial one, signify no permissions):

  • User sean is the Owner and has (r)ead and (w)rite permissions on the file but not e(x)ecute permissions (rw-).
  • Group sean is the Group owner and has (r)ead and (w)rite permissions on the file but not e(x)ecute permissions (rw-).
  • The World can (r)ead the file but cannot (w)rite to the file nor e(x)ecute the file (r--).

The word write is a classical computing term that means, essentially, to edit and save edits of a file. Today we use the term save instead of write, but remember that they are basically equivalent terms.

Since this is an HTML page for a website, the World ownership allows people to view (read) the file but not write (save) to it nor execute (run) it. Any webpage you view on the internet at least has World mode set to read.

Let's take a look at another file. In our /bin directory, we can see a listing for this program (note that I specify the absolute path of the file named bin):

ls -l /bin/zip
-rwxr-xr-x 1 root   root    212K Feb  2  2021  zip*
Attributesls -l output
File permissions-rwxr-xr-x
Number of links1
Owner nameroot
Group nameroot
Byte size212K
Last modification dateFeb 2 2021
File namezip*

Since zip is a computer program used to package and compress files, it needs to be e(x)ecutable. That is, users on the system need to be able to run it. But notice that the owner and group names of the file point to the user root. We have already learned that there is a root level in our filesystem. This is the top level directory in our filesystem and is referenced by the forward slash: /. But there is also a root user account. This is the system's superuser. The superuser can run or access anything on the system, and this user also owns most of the system files.

We read the output as such:

  • User root is the Owner and has (r)ead, (w)rite, and e(x)ecute (rwx) permissions on the file.
  • Group root is the Group owner and has (r)ead and e(x)ecute permissions but not (w)rite permissions (r-x)
  • The World has (r)ead and e(x)ecute permissions but not (w)rite (r-x). This permissions allows other users (like you and me) to use the zip program.

The asterisk at the end of the file name (zip*) simply indicates that this file is an executable; i.e., it is a software program that you can run.

Finally, let's take a look at the permissions for a directory. On my system, I run the following command in my home directory, which will show the permissions for my /home/sean directory:

ls -ld

And the output is:

drwx--x--- 51 sean sean 4.0K Jun 23 18:35 ./

This shows that:

Attributesls -ld output
File permissionsdrwx--x---
Number of links1
Owner namesean
Group namesean
Byte size4.0K
Last modification dateJun 23
File name./

This is a little different from the previous examples, but let's parse it:

  • Instead of an initial dash, this file has an initial d that identifies this as a directory. Directories in Linux are simply special types of files.
  • User sean has read, write, and execute (rwx) permissions.
  • Group sean has execute (--x) permissions only.
  • The World has no permissions (---).
  • ./ signifies the current directory, which happens to be my home directory, since I ran that command at the /home/sean path.

The takeaway from this set of permissions and the ownership is that only the user sean and those in the group sean, which is just the user sean, can access this home directory.

We might ask why the directory has an e(x)ecutable bit set for the owner and the group if a directory is not an executable file. That is, it's not a program or software. This is so that the owner and the group can access that directory using, for example, the cd (change directory) command. If the directory was not executable, like it's not for the World (---), then it would not be accessible with the cd command, or any other command. In this case, the World (users who are not me) cannot access my home directory.

Changing File Permissions and Ownership

Changing File Permissions

All the files and directories on a Linux system have default ownership and permissions set. This includes new files that we might create as we use our systems. There will be times when we will want to change the defaults, for example, the kinds of defaults described above. There are several commands available to do that, and here I'll introduce you to the two most common ones.

  1. The chmod command is used to change file (and directory) permissions (or file mode bits).
  2. The chown command is used to change a file's (and directory's) owner and group.

The chmod command changes the -rwxrwxrwx part of a file's attributes that we see with the ls -l command. Each one of those bits (the r, the w, and the x) are assigned the following octal values:

permissiondescriptionoctal value
rread4
wwrite2
xexecute1
-no permissions0

There are three octal values for the three set of permissions represented by -rwxrwxrwx. If I bracket the sets (for demonstration purposes only), they look like this:

-[rwx][rwx][rwx]

The first set describes the permissions for the owner. The second set describes the permissions for the group. The third set describes the permissions for the World.

We use the chmod command and the octal values to change a file or directory's permissions. For each set, we add up the octal values. For example, to make a file read (4), write (2), and executable (1) for the owner only, and zero out the permissions for the group and World, we use the chmod command like so:

chmod 700 paper.txt

We use 7 because 4+2+1=7, and we use two zeroes in the second two places since we're removing permissions for group and World.

If we want to make the file read, write, and executable by the owner, the group, and the world, then we repeat this for each set:

chmod 777 paper.txt

More commonly, we might want to restrict ownership. Here we enable rw- for the owner, and r-- for the group and the World:

chmod 644 paper.txt

Because 4+2=6 for owner, and 4 is read only for group and World, respectively.

Changing File Ownership

In order to change the ownership of a file, we use the chown command followed by the name of the owner. We can optionally change the owner of the group by adding a colon (no spaces) and the name of the group.

We can see what groups we belong to with the groups command. On one system that I have an account on, I am a member of two groups: a group sean (same as my user name on this system), and a group sudo, which signifies that I'm an administrator on this system (more on sudo later in the semester).

groups
sean sudo 

We can only change the user and group ownership of a file or directory if we have administrative privileges (sudo administrative access), or if we share group membership. This means that, unless we have sudo (admin) privileges, we often might change the group name for a file or directory than the user owner. Later in the semester, you will have to do this kind of work (change user and group names) of files and directories. In the meantime, let's see some examples:

Imagine that my Linux user account belongs to the group sisFaculty, and that there are other users on the Linux system (my colleagues at work) who are also members of this group. If I want to make a directory or file accessible to them, then I can change the group name of a file I own to sisFaculty. Let's call that file testFile.txt. To change only the group name for the file:

chown :sisFaculty testFile.txt

I can generally only change the user owner of a file if I have admin access on a system. In such a case, I might have to use the sudo command (you do not have access to the sudo command on our shared server, but you will have it later on your virtual machines). In this case, I don't need the colon. To change the owner only, say from the user sean to the user tmk:

sudo chown tmk testFile.txt

To change both user owner and group name, we simply specify both names and separate those names by a colon, where the syntax is chown USER:GROUP testFile.txt

sudo chown tmk:sisFaculty testFile.txt

After using the chown command to change either the owner or group, we should double check the file or directory's permissions using the chmod command. Here I make it so that the user owner and the group sisFaculty has (r)ead and (w)rite access to the file. I use sudo because, as the user sean, I'm changing the file permissions for a file that I do not own:

sudo chmod 660 testFile.txt

Conclusion

In this section, we learned:

  • how to identify file/directory ownership and permissions
  • and how to change file/directory ownership and permissions.

Specifically, we looked at two ways to change the attributes of a file. This includes changing the ownership of a file with the chown command, and setting the read, write, and execute permissions of a file with the chmod command.

The commands we used to change these attributes include:

  • chmod : for changing file permissions (or file mode bits)
  • chown : for changing file ownership

We also used the following commands:

  • ls : list directory contents
    • ls -ld : long list directories themselves, not their contents
  • groups : print the groups a user is in
  • sudo : execute a command as another user

Text Processing: Part 1

One of the more important sets of tools that Linux (as well Unix-like) operating systems provide are tools that aid processing and manipulating text. The ability to process and manipulate text, programmatically, is a basic and essential part of many programming languages, (e.g., Python, JavaScript, etc), and learning how to process and manipulate text is an important skill for a variety of jobs including statistics, data analytics, data science, programming, web programming, systems administration, and so forth. In other words, this functionality of Linux (and Unix-like) operating systems essentially means that to learn Linux and the tools that it provides is akin to learning how to program.

Plain text files are the basic building blocks of programs and data. Programs are written in plain text editors, and data is often stored as plain text. Linux offers many tools to examine, manipulate, process, analyze, and visualize data in plain text files.

In this section, we will learn some of the basic tools to examine plain text (i.e., data). We will do some programming later in this class, but for us, the main objective with learning to program aligns with our work as systems administrators. That means our text processing and programming goals will serve our interests in managing users, security, networking, system configuration, and so forth as Linux system administrators.

In the meantime, the goal of this section is to acquaint ourselves with some of the tools that can be used to process text. In this section, we will only cover a handful of text processing programs or utilities, but here is a fairly comprehensive list, and we'll examine some additional ones from this list later in the semester:

  • cat : concatenate files and print on the standard output
  • cut : remove sections from each line of files
  • diff : compare files line by line
  • echo : display a line of text
  • expand : convert tabs to spaces
  • find : search for files in a directory hierarchy
  • fmt : simple optimal text formatter
  • fold : wrap each input line to fit in specified width
  • grep : print lines that match patterns
  • head : output the first part of files
  • join : join lines of two files on a common field
  • look : display lines beginning with a given string
  • nl : number lines of files
  • paste : merge lines of files
  • printf : format and print data
  • shuf : generate random permutations
  • sort : sort lines of text files
  • tail : output the last part of files
  • tr : translate or delete characters
  • unexpand : convert spaces to tabs
  • uniq : report or omit repeat lines
  • wc : print newline, word, and byte counts for each file

We will also discuss two types of operators, the pipe and the redirect. The latter has a version that will write over the contents of a file, and a version that will append contents to the end of a file:

  • | : redirect standard output from command1 to standard input for command2
  • > : redirect to standard output to a file, overwriting
  • >> : redirect to standard output to a file, appending

Today I want to cover a few of the above commands for processing data in a file; specifically:

  • cat : concatenate files and print on the standard output
  • cut : remove sections from each line of files
  • head : output the first part of files
  • sort : sort lines of text files
  • tail : output the last part of files
  • uniq : report or omit repeat lines
  • wc : print newline, word, and byte counts for each file

Let's look at a toy, sample file that contains structured data as a CSV (comma separated value) file. The file contains a list of operating systems (column one), their software license (column two), and the year they were released (column three). We can use the cat command to view the entire contents of this small file:

Command:

cat operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

It's a small file, but we might want the line and word count of the file. To acquire that, we can use the wc (word count) command. By itself, the wc command will print the number of lines, words, and bytes of a file. The following output states that the file contains seven lines, 23 words, and 165 bytes:

Command:

wc operating-systems.csv

Output:

  7  23 165 operating-systems.csv

We can use the head command to output, by default, the first ten lines of a file. Since our file is only seven lines long, we can use the -n option to change the default number of lines:

Command:

head -n3 operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991

Using the cut command, we can select data from file. In the first example, I want to select column two (or field two), which contains the license information. Since this is a CSV file, the fields (aka, columns) are separated by commas. The -d option tells the cut command to use commas as the separator character. The -f option tells the cut command to select field two. (A CSV file may use other characters as the separator character, like the Tab character or a colon.)

Command:

cut -d"," -f2 operating-system.csv

Output:

 Proprietary
 BSD
 GPL
 Proprietary
 Proprietary
 Proprietary
 Apache

From there it's trivial to select a different column. In the next example, I select field (or column) three to get the release year:

Command:

cut -d"," -f3 operating-system.csv

Output:

 2009
 1993
 1991
 2007
 2001
 1993
 2008

One of the magical aspects of the Linux (and Unix) commandline is the ability to pipe and redirect output from one program to another program, and then to a file. By stringing together multiple programs with these operators, we can create small programs that do much more than the simple programs that compose them. In this next example, I use the pipe operators to send the output of the cut command to the sort command, which sorts the data in alphabetical or numerical order, depending on the character type (lexical or numerical), pipe that output to the uniq command, which removes duplicate rows, and then redirect that final output to a new file titled os-years.csv. Since the year 1993 appears twice in the original file, it only appears once in the output because the uniq command removed the duplicate:

Command:

cut -d"," -f3 operating-system.csv | sort | uniq > os-years.csv

Output:

cat os-years.csv
 1991
 1993
 2001
 2007
 2008
 2009

Data files like this often have a header line at the top row that names the data columns. It's useful to know how to work with such files, so let's add a header row to the top of the file. In this example, I'll use the sed command, which we will learn more about in the next lesson. For now, we use sed with the option -i to edit the file, then 1i instructs sed to insert text at line 1. \OS, License, Year is the text that we want inserted at line 1. We wrap the argument within single quotes:

Command:

sed -i '1i \OS, License, Year' operating-systems.csv
cat operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

Since the CSV file now has a header line, we want to remove it from the output. Say we want the license field data, but we need to remove that first line. In this case, we need the tail command:

Command:

tail -n +2 operating-system-csv | cut -d"," -f2 | sort | uniq > license-data.csv
cat license-data.csv

Output:

 Apache
 BSD
 GPL
 Proprietary

The tail command generally outputs the last lines of a file, but the -n +2 option is special. It makes the tail command ouput a file starting at the second line. We could specify a different number in order to start output at a different line. See man tail for more information.

Conclusion

In this lesson, we learned how to process and make sense of data held in a text file. We used some commands that let us select, sort, de-duplicate, redirect, and view data in different ways. Our data file was a small one, but these are powerful and useful command and operators that would easily make sense of large amounts of data in a file.

The commands we used in this lesson include:

  • cat : concatenate files and print on the standard output
  • cut : remove sections from each line of files
  • head : output the first part of files
  • sort : sort lines of text files
  • tail : output the last part of files
  • uniq : report or omit repeat lines
  • wc : print newline, word, and byte counts for each file

We also used two types of operators, the pipe and the redirect:

  • | : redirect standard output command1 to standard input of command2
  • > : redirect to standard output to a file, overwriting

Text Processing: Part 2

Introduction

In the last section, we covered the cat, cut, head, sort, tail, uniq, and wc utilities.

We also learned about the | pipe operator, which we use to redirect standard output from one command to a second command so that second command can process the output from the first command. An example is:

sort file.txt | uniq

This sorts the lines in a file named file.txt and then prints to standard output only the unique lines (by the way, files must be sorted before piped to uniq).

We learned about the > and >> redirect operators. They work like the pipe operator, but instead of directing output to a new command for the new command to process, they direct output to a file for saving. As a reminder, the single redirect > overwrites a file or creates a file if it does not exist. The double redirect >> appends to a file or creates a file if it does not exist. It's safer to use the double redirect, but if you are processing large amounts of data, it could also mean creating large files really quickly. If that gets out of hand, then you might crash your system. To build on our prior example, we can add >> to send the output to a new file called output.txt:

sort file.txt | uniq >> output.txt

We have available more powerful utilities and programs to process, manipulate, and analyze text files. In this section, we will cover the following three of these:

  • grep : print lines that match patterns
  • sed : stream editor for filtering and transforming text
  • awk : pattern scanning and text processing language

Grep

The grep command is one of my most often used commands. Basically, grep "prints lines that match patterns" (see man grep). In other words, it's search, and it's super powerful.

grep works line by line. So when we use it to search a file for a string of text, it will return the whole line that matches the string. This line by line idea is part of the history of Unix-like operating systems, and it's super important to remember that most utilities and programs that we use on the commandline are line oriented.

"A string is any series of characters that are interpreted literally by a script. For example, 'hello world' and 'LKJH019283' are both examples of strings." -- Computer Hope. More generally, it's the literal characters that we type. It's data.

Let's consider the file operating-systems.csv, as seen below:

OS, License, Year
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

If we want to search for the string Chrome, we can use grep. Notice that even though the string Chrome only appears once, and in one part of a line, grep returns the entire line.

Command:

grep "Chrome" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009

Be aware that, by default, grep is case-sensitive, which means a search for the string chrome, with a lower case c, would return no results. Fortunately, grep has an -i option, which means to ignore the case of the search string. In the following examples, grep returns nothing in the first search since we do not capitalize the string chrome. However, adding the -i option results in success:

Command:

grep "chrome" operating-systems.csv

Output:

None.

Command:

grep -i "chrome" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009

We can also search for lines that do not match our string using the -v option. We can combine that with the -i option to ignore the string's case. Therefore, in the following example, all lines that do not contain the string chrome are returned:

Command:

grep -vi "chrome" operating-systems.csv

Output:

FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

I used the tail command in the prior section to show how we might use tail to remove the header (1st line) line in a file, but it's an odd use of the tail command, which normally just returns the last lines of a file. Instead, we can use grep to remove the first line. To do so, we use what's called a regular expression, or regex for short. Regex is a method used to identify patterns in text via abstractions. They can get complicated, but we can use some easy regex methods.

Let's use a version of the above file with the header line:

Command:

cat operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

To use grep to remove the first line of a file, we can invert our search to select all lines not matching "OS" at the start of a line. Here the carat key ^ is a regex indicating the start of a line. Again, this grep command returns all lines that do not match the string os at the start of a line, ignoring case:

Command:

grep -vi "^os" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

Alternatively, since we know that the string Year comes at the end of the first line, we can use grep to invert search for that. Here the dollar sign key $ is a regex indicating the end of a line. Like the above, this grep command returns all lines that do not match the string year at the end of a line, ignoring case:

Command:

grep -vi "year$" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

The man grep page lists other options, but a couple of other good ones include:

Get a count of the matching lines with the -c option:

Command:

grep -ic "proprietary" operating-systems.csv

Output:

4

Print only the match and not the whole line with the -o option:

Command:

grep -io "proprietary" operating-systems.csv

Output:

Proprietary
Proprietary
Proprietary
Proprietary

We can simulate a Boolean OR search, and print lines matching one or both strings using the -E option. We separate the strings with a vertical bar |. This is similar to a Boolean OR search since there's at least one match in the following string, there is at least one result.

Here is an example where only one string returns a true value:

Command:

grep -Ei "bsd|atari" operating-systems.csv

Output:

FreeBSD, BSD, 1993

Here's an example where both strings evaluate to true:

Command:

grep -Ei "bsd|gpl" operating-systems.csv

Output:

FreeBSD, BSD, 1993
Linux, GPL, 1991

By default, grep will return results where the string appears within a larger word, like OS in macOS.

Command:

grep -i "os" operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009
iOS, Proprietary, 2007
macOS, Proprietary, 2001

However, we might want to limit results so that we only return results where OS is a complete word. To do that, we can surround the string with special characters:

Command:

grep -i "\<os\>" operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009

Sometimes we want to the context for a result; that is, we might want to print lines that surround our matches. For example, print the matching line plus the two lines after the matching line using the -A NUM option:

Command:

grep -i "chrome" -A 2 operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991

Or, print the matching line plus the two lines before the matching line using the -B NUM option:

Command

grep -i "android" -B 2 operating-systems.csv

Output:

macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

We can combine many of the variations. Here I search for the whole word BSD, case insensitive, and print the line before and the line after the match:

Command:

grep -i -A 1 -B 1 "\<bsd\>" operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991

grep is very powerful, and there are more options listed in its man page.

Note that I enclose my search strings in double quotes. For example: grep "search string" filename.txt It's not always required to enclose a search string in double quotes, but it's good practice because if your string contains more than one word or empty spaces, the search will fail.

Sed

sed is a type of non-interactive text editor that filters and transforms text (man sed). By default sed works on standard output, and edits can be redirected (> or >>) to new files or, more appropriately, made in-place using the -i option.

Like the other utilities and programs we've covered, including grep, sed works line by line. But unlike grep, sed provides a way to address specific lines or ranges of lines, and then run filters or transformations on those lines. Once the lines in a text file have been identified or addressed, sed offers a number of commands to filter or transform the text at those specific lines.

This concept of the line address is important, but not all text files are explicitly numbered by each line. Below I use the nl command to number lines in our file, even though the contents of the file do not actually display line numbers:

Command:

nl operating-systems.csv

Output:

     1	OS, License, Year
     2	Chrome OS, Proprietary, 2009
     3	FreeBSD, BSD, 1993
     4	Linux, GPL, 1991
     5	iOS, Proprietary, 2007
     6	macOS, Proprietary, 2001
     7	Windows NT, Proprietary, 1993
     8	Android, Apache, 2008

After we've identified the lines in a file that we want to edit, sed offers commands to filter, transform, or edit the text at the line addresses. Some of these commands include:

  • a : appending text
  • c : replace text
  • d : delete text
  • i : inserting text
  • p : print text
  • r : append text from file
  • s : substitute text
  • = : print the current line number

Let's see how to use sed to print line numbers instead of using the nl command. To do so, we use the equal sign = to identify line numbers (although note that it places the line numbers just above each line):

Command:

sed '=' operating-systems.csv

Output:

1
OS, License, Year
2
Chrome OS, Proprietary, 2009
3
FreeBSD, BSD, 1993
4
Linux, GPL, 1991
5
iOS, Proprietary, 2007
6
macOS, Proprietary, 2001
7
Windows NT, Proprietary, 1993
8
Android, Apache, 2008

In the last section, we used the tail command to remove the header line of our file, and above, we used grep to accomplish this task. It's much easier to use sed to remove the header line of the operating-systems.csv. We simply specify the line number (1) and then use the delete command (d). Thus, we delete line 1:

Command:

sed '1d' operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

Note that I use single apostrophes for the sed command. This is required.

If I wanted to make that a permanent deletion, then I would use the -i option, which means that I would edit the file in-place (see man sed):

Command:

sed -i '1d' operating-system.csv

To refer to line ranges, I add a comma between addresses. Therefore, to edit lines 1, 2, and 3:

Command:

sed '1,3d' operating-systems.csv

Output:

Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

I can use sed to find and replace strings. The syntax for this is:

sed 's/regexp/replacement/' filename.txt

The regexp part of the above command can take regular expressions, but simple strings like words work here, too, since they are treated as regular expressions themselves.

In the next example, I use sed to search for the string "Linux", and replace it with the string "GNU/Linux":

Command:

sed 's/Linux/GNU\/Linux/' operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
GNU/Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

Because the string GNU/Linux contains a forward slash, and because sed uses the forward slash as a separator, note that I escaped the forward slash with a back slash. This escape tells sed to interpret the forward slash in GNU/Linux literally and not as a special sed character.

If we want to add new rows to the file, we can append a or insert i text after or at specific lines:

To append text after line 3, use a:

Command:

sed '3a FreeDOS, GPL, 1998' operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
FreeDOS, GPL, 1998
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

To insert at line 3, use i:

Command:

sed '3i CP\/M, Proprietary, 1974' operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009
CP/M, Proprietary, 1974
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

Note that the FreeDOS line doesn't appear in the last output. This is because I didn't use the -i option nor did I redirect output to a new file. If we want to edit the file in-place, that is, save the edits, then the commands would look like so:

sed -i '3a FreeDOS, GPL, 1998' operating-systems.csv
sed -i '3i CP\/M, Proprietary, 1974' operating-systems.csv

Instead of using line numbers to specify addresses in a text file, we can use regular expressions as addresses, which may be simple words. In the following example, I use the regular expression 1991$ instead of specifying line 4. The regular expression 1991$ means "lines ending with the string 1991". Then I use the s command to start a find and replace. sed finds the string Linux and then replaces that with the string GNU/Linux. I use the back slash to escape the forward slash in GNU/Linux:

Command:

sed '/1991$/s/Linux/GNU\/Linux/' operating-systems.csv

Output:

OS, License, Year
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
GNU/Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

Here's an example using sed to simply search for a pattern. In this example, I'm interested in searching for all operating systems that were released on or after 2000:

Command:

sed -n '/20/p' operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Android, Apache, 2008

The above would be equivalent to grep "20" operating-systems.csv.

sed is much more powerful than what I've demonstrated here, and if you're interested in learning more, there are lots of tutorials on the web. Here are a few good ones:

Awk

awk is a complete scripting language designed for "pattern scanning and processing" text. It generally performs some action when it detects some pattern and is particularly suited for columns of structured data (see man awk).

awk works on columns regardless if the contents include structured data (like a CSV file) or not (like a letter or essay). If the data is structured, then that means the data will be formatted in some way. In the last few sections, we have looked at a CSV file. This is structured data because the data points in this file are separated by commas.

For awk to work with columns in a file, it needs some way to refer to those columns. In the examples below, we'll see that columns in a text file are referred to by a dollar sign and then the number of the column $n. So, $1 indicates column one, $2 indicates column two, and so on. If we use $0, then we refer to the entire file. In our example text file, $1 indicates the OS Name column, $2 indicates the License column, $3 indicates the release Year column, and $0 indicates all columns.

The syntax for awk is a little different than what we've seen so far. Basically, awk uses the following syntax, where pattern is optional.

awk pattern { action statements }

Let's see some examples.

To print the first column of our file, we do not need the pattern part of the command but only need to state an action statement (within curly braces). In the command below, the action statement is '{ print $1 }'.

Command:

awk '{ print $1 }' operating-systems.csv

Output:

OS,
Chrome
FreeBSD,
Linux,
iOS,
macOS,
Windows
Android,

By default, awk considers the first empty space as the field delimiter. That's why in the command above only the term Windows and Chrome appear in the results even though it should be Windows NT and Chrome OS. It's also why we see commas in the output. To fix this, we tell awk to use a comma as the field separator, instead of the default empty space. To specify that we want awk to treat the comma as a field delimiter, we use the -F option, and we surround the comma with single quotes:

Command:

awk -F',' '{ print $1 }' operating-systems.csv

Output:

OS
Chrome OS
FreeBSD
Linux
iOS
macOS
Windows NT
Android

By specifying the comma as the field separator, our results are more accurate, and the commas no longer appear either.

Like grep and sed, awk can do search. In this next example, I print the column containing the string Linux. Here I am using the pattern part of the command: '/Linux/'.

Command:

awk -F',' '/Linux/ { print $1 }' operating-systems.csv

Output:

Linux

Note how awk does not return the whole line but only the match.

With awk, we can retrieve more than one column, and we can use awk to generate reports, which was part of the original motivation to create this language.

In the next example, I select columns two and one in that order, which is something the cut command cannot do. I also add a space between the columns using the double quotes to surround an empty space, and I modified the field delimiter to include both a comma and a space to get the output that I want:

Command:

awk -F', ' '{ print $2 " " $1 }' operating-systems.csv

Output:

License OS
Proprietary Chrome OS
BSD FreeBSD
GPL Linux
Proprietary iOS
Proprietary macOS
Proprietary Windows NT
Apache Android

I can make output more readable by adding text to print:

Command:

awk -F',' '{ print $1 " was released in" $3 "." }' operating-systems.csv

Output:

OS was released in Year.
Chrome OS was released in 2009.
FreeBSD was released in 1993.
Linux was released in 1991.
iOS was released in 2007.
macOS was released in 2001.
Windows NT was released in 1993.
Android was released in 2008.

Since awk is a full-fledged programming language, it understands data structures, which means it can do math or work on strings of text. Let's illustrate this by doing some math or logic on column 3.

Here I print all of column three:

Command:

awk -F',' '{ print $3 }' operating-systems.csv

Output:

 Year
 2009
 1993
 1991
 2007
 2001
 1993
 2008

Next I print only the parts of column three that are greater than 2005, and then pipe | the output through the sort command to sort the numbers in numeric order:

Command:

awk -F',' '$3 > 2005 { print $3 }' operating-systems.csv | sort

Output:

 2007
 2008
 2009

If I want to print only the parts of column one where column three equals to 2007, then I would run this command:

Command:

awk -F',' '$3 == 2007 { print $1 }' operating-systems.csv

Output:

iOS

If I want to print only the parts of columns one and three where column 3 equals 2007:

Command:

awk -F',' '$3 == 2007 { print $1 $3 }' operating-systems.csv

Output:

iOS 2007

Or, print the entire line where column three equals 2007:

Command:

awk -F',' '$3 == 2007 { print $0 }' operating-systems.csv

Output:

iOS, Proprietary, 2007

I can print only those lines where column three is greater than 2000 and less than 2008:

Command:

awk -F',' '$3 > 2000 && $3 < 2008 { print $0 }' operating-systems.csv

Output:

iOS, Proprietary, 2007
macOS, Proprietary, 2001

Even though we wouldn't normally sum years, let's print the sum of column three to demonstrate how summing works in awk:

Command:

awk -F',' 'sum += $3 { print sum }' operating-systems.csv

Output:

2009
4002
5993
8000
10001
11994
14002

Here are a few basic string operations. First, print column one in upper case:

Command:

awk -F',' '{ print toupper($1) }' operating-systems.csv

Output:

OS
CHROME OS
FREEBSD
LINUX
IOS
MACOS
WINDOWS NT
ANDROID

Or print column on in lower case:

Command:

awk -F',' '{ print tolower($1) }' operating-systems.csv

Output:

os
chrome os
freebsd
linux
ios
macos
windows nt
android

Or, get the length of each string in column one:

Command:

awk -F',' '{ print length($1) }' operating-systems.csv

Output:

2
9
7
5
3
5
10
7

We can add additional logic. The double ampersands && indicate a Boolean/Logical AND. The exclamation point ! indicates a Boolean/Logical NOT. In the next example, I print only those lines where column three is greater than 1990, and the line has the string "BSD" in it:

Command:

awk -F',' '$3 > 1990 && /BSD/ { print $0 }' operating-systems.csv

Output:

FreeBSD, BSD, 1993

Now I reverse that, and print only those lines where column three is greater than 1990 and the line DOES NOT have the string "BSD" in it:

Command:

awk -F',' '$3 > 1990 && !/BSD/ { print $0 }' operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

The double vertical bar || indicates a Boolean/Logical OR. The next command prints only those lines that contain the string "Proprietary" or the string "Apache", or it would print both if both strings were in the text:

Command:

awk -F',' '/Proprietary/ || /Apache/ { print $0 }' operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

I can take advantage of regular expressions. If the file that I was looking at was large, and if I wasn't sure that some fields would be upper or lower case, then I could use regular expressions to consider both possibilities. That is, by adding [pP] and [aA], awk will check for both the words Proprietary and proprietary, and Apache and apache.

Command:

awk -F',' '/[pP]roprietary/ || /[aA]pache/ { print $0 }' operating-systems.csv

Output:

Chrome OS, Proprietary, 2009
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

awk is full-fledged programming language. It provides conditionals, control structures, variables, etc., and so I've only scratched the surface. If you're interested in learning more, then check out some of these tutorials:

Conclusion

The Linux (and other Unix-like OSes) command line offers a lot of utilities to examine data. Prior to this lesson, we covered a few of them that help us get parts of a file and then pipe those parts through other commands or redirect output to files. We can use pipes and redirects with grep, sed, and awk. If needed, we may be able to avoid using the basic utilities like cut, wc, etc if want to learn more powerful programs like grep, sed, and awk.

It's fun to learn and practice these. Despite this, you do not have to become a sed or an awk programmer. Like the utilities that we've discussed in prior lectures, the power of programs like these is that their on hand and easy to use as one-liners. If you want to get started, the resources listed above can guide you.

Review

Here is a review of commands and concepts that we have covered so far.

Commands

We have covered the following commands so far:

CommandExampleExplanation
treetree -dfL 1List directories, full path, one level
cdcd ~change to home directory
cd /change to root directory
cd binchange to bin directory from current directory
pwdpwdprint working / current directory
lsls ~list home directory contents
ls -allist long format and hidden files in current directory
ls -dllist long format the current directory
manman lsopen manual page for the ls command
man manopen manual page for the man command
cpcp * bin/copy all files in current directory to bin subdir
mvmv oldname newnamerename file oldname to newname
mv oldir bin/newdirmove oldman to bin subdir and rename to newdir
rmrm oldfiledelete file named oldfile
rm -r olddirdelete directory olddir and its contents
touchtouch newfilecreate a file called newfile
touch oldfilemodify timestamp of file called oldfile
mkdirmkdir newdircreate a new directory called newdir
rmdirrmdir newdirdelete directory called newdir if empty
echoecho "hello"print "hello" to screen
catcat data.csvprint contents of file called data.csv to screen
cat data1.csv data2.csvconcatenate data1.csv and data2.csv to screen
lessless fileview contents of file called file
sudosudo commandrun command as superuser
chownsudo chown root:root filechange owner and group to root of file file
chmodchmod 640 filechange permissions of file to -rw-r-----
chmod 775 somedirchange permissions of of somedir to drwxrwxr-x
groupsgroups userprint the groups the user is in
wcwc -l fileprint number of lines of file
wc -w fileprint number of words of file
headhead fileprint top ten lines of file
head -n3 fileprint top three lines of file
tailtail fileprint bottom ten lines of file
tail -n3 fileprint bottom three lines of file
cutcut -d"," -f2 data.csvprint second column of file data.csv
sortsort -n filesort file by numerical order
sort -rn filesort file by reverse numerical order
sort -df filesort file by dictionary order and ignore case
uniquniq filereport or omit repeated lines in sorted file
uniq -c filereport count of duplicate lines in sorted file

In addition to the above commands, we also have pipelines using the |. Pipelines send the standard output of one command to a second command (or more). The following command sorts the contents of a file and then sends the output to the uniq command to remove duplicates:

sort file | uniq

Redirection uses the > or the >> to redirect output of a command to a file. A single > will overwrite the contents of a file. A double >> will append to the contents of a file.

Redirect the output of the ls command to a file called dirlist:

ls > dirlist

Append the date to the end of the file dirlist:

date >> dirlist

Paths

I introduced the concept of absolute and relative paths in section 2.3. In this session, the goal is to revisit paths (locations of files and directories in the filesystem), and provide some examples. This will be important as we proceed to Bash scripting and other tasks going forward.

Change Directories

The cd command is used to change directories. When we login to our systems, we will find ourselves in our $HOME directory, which is located at /home/USER.

To change to the root directory, type:

pwd
/home/sean
cd /
pwd
/

From there, to change to the /bin directory:

cd bin
pwd
/bin

To change to the previous working directory:

cd -
pwd
/

To go home quickly, just enter cd by itself:

cd
pwd
/home/sean

To change to the public_html directory:

cd public_html
pwd
/home/sean/public_html

To change to the directory one level up:

cd ..
pwd
cd /home/sean

Make Directories

Sometimes we'll want to create new directories. To do so, we use the mkdir command.

To make a new directory in our $HOME directory:

pwd
/home/sean
mkdir documents
cd documents
pwd
/home/sean/documents
cd
pwd
/home/sean

To make more than one directory at the same time, where the second or additional directories are nested, use the -p option:

mkdir -p photos/2022

Remove or Delete Files and Directories

To remove a file, we use the rm command. If the file is in a subdirectory, specify the relative path:

pwd
/home/sean
rm public_html/index.html

To remove a file in a directory one level up, use the .. notation. For example, if I'm in my documents directory, and I want to delete a file in my home (parent) directory:

cd documents
pwd
/home/sean/documents
rm ../file.txt

Alternatively, I could the tilde as shorthand for $HOME:

rm ~/file.txt

To remove a file nested in multiple subdirectories, just specify the path (absolute or relative).

rm photos/2022/05/22/IMG_2022_05_22.jpg

Remember that the rm command deletes files and directories. Use it with caution, or with the -i option.

Copy Files or Directories

Let's say I want to copy a file in my $HOME directory to a nested directory:

cp file.txt documents/ICT418/homework/

Or, we can copy a file from one subdirectory to another. Here I copy a file in my ~/bin directory to my ~/documentsdirectory. The ~ (tilde) is shorthand for my $HOME directory.

cp ~/bin/file.txt ~/documents/``

Move or Rename Files or Directories

Let's say I downloaded a file to my ~/Downloads directory, and I want to move it to my ~/documents directory:

mv ~/Downloads/article.pdf ~/documents/

Or, let's say we rename it in the process:

mv ~/Downloads/article.pdf ~/documents/article-2022.pdf

We can also move directories. Since the commandline is case-sensitive, let's say I rename the documents directory to Documents:

mv ~/documents ~/Documents

Conclusion

Use this page as a reference to the commands that we have covered so far.

Scripting the Command Line

We have learned some of the many commands available on the Linux command line as well as how to navigate around the filesystem. Now we can begin to learn how to use command line text editors in order to write Bash scripts.

Text editors

Working on the command line means writing a lot of commands. But there will be times when we want to save some of the commands that we write in order to re-use them later, or we might want to develop the commands into a script (i.e., program) because we might want to automate a process. The shell is great for writing one off commands, so-called one-liners, but it's not a great place to write multi-line or very long commands. Therefore it can be helpful to write and save our commands in a text editor. In this lesson, we'll learn about three text editors: ed, vim, and nano. Of these, I'll encourage you to use nano, but I want you to know something about ed and vim because ed, even if not often used, is historically important to the Unix and Linux ecosystem. (I use ed almost daily). Vim, which is my everyday editor, is important and highly used and under active development to this day. If you want to use Vim, I'd encourage you to do so, but know that it's not required because it takes some time and consistent practice to get good at it.

Another thing to keep in mind is that the shell that we are working with is called bash, and bash is a full-fledged programming language. That means that when we write a simple command, like cd public_html, we are programming. It makes sense that the more programming that we do, the better we'll get at it. This requires more sophisticated environments to help manage our programs than the command line prompt can provide. Text editors fulfill that role.

As we learn more about how to do systems administration with Linux, we will need to edit configuration files, too. Most configuration files exist in the /etc directory. For example, later in the semester we will install the Apache Web Server, and we will need to edit Apache's configuration files in the process. We could do this using some of the tools that we've already covered, like sed and awk, but it'll make our lives much easier to use a text editor.

In any case, in order to save our commands or edit text files, a text editor is very helpful. Programmers use text editors to write programs, but because programmers often work in graphical user environments, they may often use graphical text editors or graphical IDEs. As systems administrators, it would be unusual to have a graphical user interface installed on a server. The servers that we manage will contain limited or specific software that serves the server's main purpose. Additional software on a server that is not relevant to the main function of a server only takes up extra disk space, consumes valuable computing resources, and poses an additional security footprint.

As stated, although ed and vim are difficult, they are very powerful editors. I use both daily, and am in fact using vim to write what this. I believe they are both worth learning; however, for the purposes of this course, I think it's more important that you are aware of them. If you wish to learn more, there are lots of additional tutorials on the web on how to use these fine, esteemed text editors.

ed

ed is a line editor that is installed by default on many Linux distributions. Ken Thompson created ed in the late 1960s to write the original Unix operating system. It was used without computer monitors because those were still uncommon, and instead for teletypewriters (TTYs) and printers. The lack of a visual display, like a monitor, is the reason that ed(1) was written as a line editor. If you visit that second link, you will see the terminal interface from those earlier days. It is the same basic interface you are using now when you use your terminal applications, which are virtualised versions of those old teletypewriters. I think this is a testament of the power of the terminal: that advanced computer users still use the same basic technology today.

In practice, when we use a line editor like ed, the main process of entering text is like any other editor. The big difference is when we need to manipulate text. In a graphical text editor, if we want to delete a word or edit some text, we might backspace over the text or highlight a word and delete it. In a line editor, we manipulate text by referring to lines or across multiple lines and then run commands on the text in those line(s). This is much the same process we followed when we covered grep, sed, and awk, and especially sed, and it should not surprise you that these are related.

To operationalize this, like in sed, each line has an address. The address for line 7 is 7, and so forth. Line editors like ed are command driven. There is no menu to select from at the top of the window, and in fact, when we used ed to open an existing file, the text in the file isn't even printed on the screen. If a user wants to delete a word, or print (to screen) some text, the user has to command the line editor to print the relevant line by specifying its address and issuing a command to delete the word on that line, or print the line. Line editors also work on ranges of line, including all the lines in the file, just like sed does.

In fact, many of the commands that ed uses are also used by sed, since sed is based on ed . To compare:

Commandseded
append textaa
replace textcc
delete textdd
insert textii
print textpp
substitute textss
print w/ line #=n

However, there are big differences that mainly relate to the fact that ed is a text editor and sed is not. For example, here are some commands that mostly make sense in ed as a text editor. sed can do some of these tasks, where it makes sense (e.g., we don't quit sed), but sometimes in a non-trivial way.

Commanded only
edit filee
join linesj
copies linest
moves linesm
undou
saves filew
quits edq
Quits ed w/o savingQ

There are other differences, but these are sufficient for our purposes.

Let's see how to use ed to open a file, and print the content with and without line numbers.

ed operating-systems.csv
183
1,$p
OS, License, Year
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008
1,$n
1 OS, License, Year
2 Chrome OS, Proprietary, 2009
3	FreeBSD, BSD, 1993
4	Linux, GPL, 1991
5	iOS, Proprietary, 2007
6	macOS, Proprietary, 2001
7	Windows NT, Proprietary, 1993
8	Android, Apache, 2008

Using ed, another way to remove the header line of the operating-systems.csv file is to specify the line number (1) and then the delete command (d), just like in sed. This becomes a permanent change if I save the file with the w (write) command:

1d
1,$p
Chrome OS, Proprietary, 2009
FreeBSD, BSD, 1993
Linux, GPL, 1991
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008

To refer to line ranges, I add a comma between addresses. Therefore, to delete lines 1, 2, and 3, and then quit without saving:

1,3d
,p
iOS, Proprietary, 2007
macOS, Proprietary, 2001
Windows NT, Proprietary, 1993
Android, Apache, 2008
Q

Note that with sed, in order to make a change in-place, we need to use the -i option. But with ed, we save changes with the w command.

I can use ed to find and replace strings. The syntax is the same as it is in sed. I'll start with a fresh version of the file:

ed operating-systems.csv
183
1,$s/Linux/GNU\/Linux/

If we want to add new rows to the file, we can append a or insert i text after or at specific lines. To append text after line 3, use a. We enter a period on a newline to leave input mode and return to command mode:

3a
FreeDOS, GPL, 1998
.

Because we enter input mode when using the a, i, or c commands, we enter a period . on a line by itself to revert to command mode.

To insert at line 2, use i:

2i
CP/M, Proprietary, 1974
.

Like sed, we can also find and replace using regular expressions instead of line numbers. I start a new ed session to reload the file to start fresh:

ed operating-systems.csv
183
/Linux/s/Linux/GNU\/Linux/

Of course, ed can be used to write and not simply edit files. Let's start fresh, in the following session, I'll start ed, enter append mode a, write a short letter, exit append mode ., name the file f, write w (save) the file, and quit q:

ed
a
Dear Students,

I hope you find this really interesting.
Feel free to practice and play on the command line,
as well as use tools like ed, the standard editor.

Sincerely,
Dr. Burns
.
f letter.txt
w
q

It's good to know something about ed not just for historical reasons, but also because the line editing technology developed for it is still in use today, and is a basic part of the vim text editor, which is a very widely used application.

vim

The vim text editor is an improved version of the vi text editor and is in fact called Vi IMproved. (The original vi text editor is usually available via the nvi editor these days. nvi is a rewrite of the original.) vim is a visual editor. It is multi-modal like ed and is a direct descendant through vi. Due to this genealogy, vim can use many of the same commands as ed does when vim is in command mode. Like ed, we can start vim at the Bash prompt with or without a file name. Here I will open the letter.txt file with vim. The default mode is command mode:

vim letter.txt
Dear Students,

I hope you find this really interesting.
Feel free to practice and play on the command line,
as well as use tools like ed, the standard editor.

Sincerely,
Dr. Burns

To enter insert mode, I can type i or a for insert or append mode. There isn't any difference on an empty file, but on a file that has text, i will start insert mode where the cursor lies, and a will start insert mode right-adjacent to the cursor. Once in insert mode, you can type text as you normally would and use the arrow keys to navigate around the file.

To return to command mode in vim, you press the Esc key. And then you can enter commands like you would with ed, using the same syntax.

Unlike ed, when in command mode, the commands we type are not placed wherever the cursor is, but at the bottom of the screen. Let's first turn on line numbers to know which address is which, and then we'll replace ed with Ed. Note that I precede these commands with a colon:

:set number
:5s/ed/Ed/

One of the more powerful things about both ed and vim is that I can call Bash shell commands from the editors. Let's say that I wanted to add the date to my letter file. To do that, Linux has a command called date that will return today's date and time. To call the date command within Vim and insert the output into the file, I press Esc to enter command mode (if I'm not already in it), enter a colon, type r for the read into buffer command, then enter the shell escape command, which is an exclamation point !, and then the Bash shell date command:

:r !date
Dear Students,

I hope you find this really interesting.
Feel free to practice and play on the command line,
as well as use tools like ed, the standard editor.
Thu Jun 30 02:44:08 PM EDT 2022

Sincerely,
Dr. Burns

Since the last edit I made was to replace ed with Ed, vim entered the date after that line, which is line 6. To move that date line to the top of the letter, I can use the move m command and move it to line 0, which is the top of the file:

:6m0
Thu Jun 30 02:44:30 PM EDT 2022
Dear Students,

I hope you find this really interesting.
Feel free to practice and play on the command line,
as well as use tools like Ed, the standard editor.

Sincerely,
Dr. Burns

Although you can use the arrow keys and Page Up/Page Down keys to navigate in vim and vi, by far the most excellent thing about this editor is to be able to use the j,k,l,h keys to navigate around a file:

  • j moves down line by line
  • k moves up line by line
  • l moves right letter by letter
  • h moves left letter by letter

Like the other commands, you can precede this with addresses. To move 2 lines down, you type 2j, and so forth. vi and vim have had such a powerful impact on software development that you can in fact use these same keystrokes to navigate a number of sites such as Gmail, Facebook, Twitter, and more.

To save the file and exit vim, return to command mode by pressing the Esc key, and then write and quit:

:wq

The above barely scratches the surface. There are whole books on these editors as well as websites, videos, etc that explore them, and especially vim in more detail. But now that you have some familiarity with them, you might find this funny: Ed, man! !man ed.

nano

The nano text editor is the user-friendliest of these text editors but still requires some adjustment as a new commandline user. The friendliest thing about nano is that it is modeless, which is what you're already accustomed to using, because it can be used to enter text and manipulate text without changing to insert or command mode. It is also friendly because, like many graphical text editors and software, it uses control keys to perform its operations. The tricky part is that the control keys are assigned to different keystroke combinations than what many graphical editors (or word processors) use by convention today. For example, instead of Ctrl-c or Cmd-c to copy, in nano you press the M-6 key (press Alt, Cmd, or Esc key and 6) to copy. Then to paste, you press Ctrl-u instead of the more common Ctrl-v. Fortunately, nano lists the shortcuts at the bottom of the screen.

The shortcuts listed need some explanation, though. The carat mark is shorthand for the keyboard's Control (Ctrl) key. Therefore to Save As a file, we write out the file by pressing Ctrl-o. The M- key is also important, and depending on your keyboard configuration, it may correspond to your Alt, Cmd, or Esc keys. To search for text, you press ^W, If your goal is to copy, then press M-6 to copy a line. Move to where you want to paste the text, and press Ctrl-u to paste.

For the purposes of this class, that's all you really need to know about nano. Use it and get comfortable writing in it. Some quick tips:

  1. nano file.txt will open and display the file named file.txt.
  2. nano by itself will open to an empty page.
  3. Save a file by pressing Ctrl-o.
  4. Quit and save by pressing Ctrl-x.
  5. Be sure to follow the prompts at the bottom of the screen.

Conclusion

In prior lessons, we learned how to use the Bash interactive shell and how to view, manipulate, and edit files from that shell. In this lesson, we learned how to use several command line text editors. Editors allow us to save our commands, create scripts, and in the future, edit configuration files.

The commands we used in this lesson include:

  • ed : line-oriented text editor
  • vim : Vi IMproved, a programmer's text editor
  • nano : Nano's ANOther editor, inspired by Pico

Regular Expressions

Oftentimes, as systems administrators, we will need to search the contents of a file, like a log file. One of the commands that we use to do that is the grep command. We have already discussed using the grep command, which is not unlike doing any kind of search, such as in Google. The command simply involves running grep along with the search string and against a file.

Multiword strings

It's good habit to include search strings within quotes, but this is especially important if we would search for multiword strings. In these cases, we must enclose them in quotes.

Command:

cat cities.csv

Output:

City              | 2020 Census | Founded
New York City, NY | 8804190     | 1624
Los Angeles, CA   | 3898747     | 1781
Chicago, IL       | 2746388     | 1780
Houston, TX       | 2304580     | 1837
Phoenix, AZ       | 1624569     | 1881
Philadelphia, PA  | 1576251     | 1701
San Antonio, TX   | 1451853     | 1718
San Diego, CA     | 1381611     | 1769
Dallas, TX        | 1288457     | 1856
San Jose, CA      | 983489      | 1777

Command:

grep "San Antonio" cities.csv

Output:

San Antonio, TX | 1451853 | 1718

Whole words, case sensitive by default

As a reminder, grep commands are case-sensitive by default, and since the contents of cities.csv are all in lowercase, if I run the above command without the city named capitalized, then grep will return nothing:

Command:

grep "san antonio" cities.csv

In order to tell grep to ignore case, I need to use the -i option. We also want to make sure that we enclose our entire search string withing double quotes.

This is a reminder for you to run man grep and to read through the documentation and see what the various options exit for this command.

Command:

grep -i "san antonio" cities.csv

Output:

San Antonio, TX   | 1451853     | 1718

Whole words by the edges

To search whole words, we can use special characters to match strings at the start and/or the end of words. For example, note the output if I search for cities in California in my file by searching for the string ca. Since this string appears in Chicago, then that city matches my grep search:

Command:

grep -i "ca" cities.csv

Output:

Los Angeles, CA   | 3898747     | 1781
Chicago, IL       | 2746388     | 1780
San Diego, CA     | 1381611     | 1769
San Jose, CA      | 983489      | 1777

To limit results to only CA, we can enclose our search in certain special characters:

Command:

grep -i "\bca\b" cities.csv

Output:

Los Angeles, CA   | 3898747     | 1781
San Diego, CA     | 1381611     | 1769
San Jose, CA      | 983489      | 1777

We can reverse that output and look for strings within other words. Here is an example of searching for the string ca within words:

Command:

grep -i "\Bca\B" cities.csv

Output:

Chicago, IL       | 2746388     | 1780

Bracket Expressions and Character Classes

In conjunction with the grep command, we can also use regular expressions to search for more general patterns in text files. For example, we can use bracket expressions and character classes to search for patterns in the text. Here again using man grep is very important because it includes instructions on how to use these regular expressions.

Bracket expressions

From man grep on bracket expressions:

A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that list. If the first character of the list is the caret ^ then it matches any character not in the list ... For example, the regular expression [0123456789] matches any single digit.

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters

To see how this works, let's search the cities.csv file for letters matching A, B, or C. Specifically in the following command I use a hyphen to match any characters in the range A, B, C. The output does not include the cities Houston or Dallas since neither of those lines contain capital A, B, or C characters:

Command:

grep "[A-C]" cities.csv 

Output:

City              | 2020 Census | Founded
New York City, NY | 8804190     | 1624
Los Angeles, CA   | 3898747     | 1781
Chicago, IL       | 2746388     | 1780
Phoenix, AZ       | 1624569     | 1881
Philadelphia, PA  | 1576251     | 1701
San Antonio, TX   | 1451853     | 1718
San Diego, CA     | 1381611     | 1769
San Jose, CA      | 983489      | 1777

Bracket expressions, inverse searches

When placed after the first bracket, the carat key acts as a Boolean NOT. The following command matches any characters not in the range A,B,C:

Command:

grep "[^A-C]" cities.csv

However, the output matches all lines since there are no instances of A, B, and C in all lines:

Output:

City              | 2020 Census | Founded
New York City, NY | 8804190     | 1624
Los Angeles, CA   | 3898747     | 1781
Chicago, IL       | 2746388     | 1780
Houston, TX       | 2304580     | 1837
Phoenix, AZ       | 1624569     | 1881
Philadelphia, PA  | 1576251     | 1701
San Antonio, TX   | 1451853     | 1718
San Diego, CA     | 1381611     | 1769
Dallas, TX        | 1288457     | 1856
San Jose, CA      | 983489      | 1777

Process substitution

We can confirm that output from the first command does not include Houston or Dallas in the second command by comparing the outputs of the two commands using process substitution. This is a technique that pipes the standard output of multiple commands to be processed by another command. Here I use the diff command to compare the output of both grep commands:

Command:

diff <(grep "[A-C]" cities.csv) <(grep "[^A-C]" cities.csv)

The diff output shows that the second grep command includes the two lines below that are not in the output of the first grep command:

Output:

4a5
> Houston, TX       | 2304580     | 1837
8a10
> Dallas, TX        | 1288457     | 1856

The output of the diff command is nicely explained in this Stack Overflow answer.

Try this command for an alternate output:

diff -y <(grep "[A-C]" cities.csv) <(grep "[^A-C]" cities.csv)

Our ranges may be alphabetical or numerical. The following command matches any numbers in the range 1,2,3:

Command:

grep [1-3] cities.csv

Since all single digits appear in the file, the above command returns all lines. To invert the search, we can use the following grep command. This will match all non-integers:

Command:

grep [^0-9] cities.csv

Bracket expressions, carat preceding the bracket

We saw in a previous section that the carat ^ key indicates the start of line; however, we learned above that it is used to return the inverse of a string. To use the carat to signify the start of a line, the carat key must precede the opening bracket. For example, the following command matches any lines that start with the upper case letters within the range of N,O,P:

Command:

grep ^[N-P] cities.csv

Output:

New York City, NY | 8804190 | 1624
Phoenix, AZ       | 1624569 | 1881
Philadelphia, PA  | 1576251 | 1701

And we can reverse that with the following command, which returns all lines that do not start with N,O, or P:

Command:

grep ^[^N-P] cities.csv

Output:

City            | 2020 Census | Founded
Los Angeles, CA | 3898747     | 1781
Chicago, IL     | 2746388     | 1780
Houston, TX     | 2304580     | 1837
San Antonio, TX | 1451853     | 1718
San Diego, CA   | 1381611     | 1769
Dallas, TX      | 1288457     | 1856
San Jose, CA    | 983489      | 1777

Character classes

Character classes are special types of predefined bracket expressions. They make it easy to search for general patterns. From man grep on character classes:

Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:blank:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means the character class of numbers and letters ...

Here I search for anything that matches the Year column. Specifically, I search for a empty space [[:blank:]], a four digit string [[:digit:]]{4}. The {4} means "The preceding item is matched exactly 4 times" (man grep), and the number 4 can be replaced with any relevant number. and an end of line $:

Command:

grep -Eo "[[:blank:]][[:digit:]]{4}$" cities.csv 

Output:

 1624
 1781
 1780
 1837
 1881
 1701
 1718
 1769
 1856
 1777

In the above command, the [[:blank:]] can be excluded and we'd still retrieve the desired results because we've included the dollar sign to mark the end of the line, but I include it here for demonstration purposes. Note that I also added the -E option. This is required for character classes.

Anchoring

As seen above, outside of bracket expressions and character classes, we use the caret ^ to mark the beginning of a line. We can also use the $ to match the end of a line. Using either (or both) is called anchoring. Anchoring works in many places. For example, to search all lines that start with capital D through L

Command:

grep "^[D-L]" cities.csv

Output:

Los Angeles, CA | 3898747 | 1781
Houston, TX     | 2304580 | 1837
Dallas, TX      | 1288457 | 1856

And all lines that end with the numbers 4, 5, or 6:

Command:

grep "[4-6]$" cities.csv

Output:

New York City, NY | 8804190 | 1624
Dallas, TX        | 1288457 | 1856

We can use both anchors in our grep commands. The following searches for any lines starting with capital letters ranging from D through L and any lines ending with the numbers starting from 4 through 6. The single dot stands for any character, and the asterisk stands for "the preceding character will zero or more times" (man grep).

Command:

grep "^[D-L].*[4-6]$" cities.csv

Output:

Dallas, TX        | 1288457     | 1856

Repetition

If we want to use regular expressions to identify repetitive patterns, then we can use repetition operators. As we saw above, the most useful one is the * asterisk. But there are other options:

In come cases, we need to add the -E option to extend grep's regular expression functionality:

Here, the preceding item S is matched one or more times:

Command:

grep -E "S+" cities.csv

Output:

San Antonio, TX | 1451853 | 1718
San Diego, CA   | 1381611 | 1769
San Jose, CA    | 983489  | 1777

In the next search, the preceding item l is matched exactly 2 times:

Command:

grep -E "l{2}" cities.csv

Output:

Dallas, TX | 1288457 | 1856

Finally, in this example, the preceding item 7 is matched at least two times or at most three times:

Command:

grep -E "7{2,3}" cities.csv

Output:

San Jose, CA | 983489 | 1777

OR searches

We can use the vertical bar | to do a Boolean OR search. In a Boolean OR statement, the statement is True if either one part is true, the other part is true, or both are true. In a search statement, this means that at least one part of the search is true.

The following will return lines for each city because they both appear in the file:

Command:

grep -E "San Antonio|Dallas" cities.csv

Output:

San Antonio, TX | 1451853 | 1718
Dallas, TX      | 1288457 | 1856

The following will match San Antonio even though Lexington does not appear in the file:

Command:

grep -E "San Antonio|Lexington" cities.csv

Output:

San Antonio, TX   | 1451853     | 1718

Conclusion

We covered a lot in this section on grep and regular expressions.

We specifically covered:

  • multiword strings
  • whole word searches and case sensitivity
  • bracket expressions and character classes
  • anchoring
  • repetition
  • Boolean OR searches

Even though we focused on grep, many these regular expressions work across many programming languages.

See Regular-Expression.info for more in-depth lessons on regular expressions.

Bash Scripting

It's time to get started on Bash scripting. So far, we've been working on the Linux commandline. Specifically, we have been working in the Bash shell. Wikipedia refers to Bash as a command language, and by that it means that Bash is used as a commandline language but also as a scripting language. The main purpose of Bash is to write small applications/scripts that analyze text (e.g., log files) and automate jobs, but it can be used for a variety of other purposes.

Variables

One of the most important abilities of any programming or scripting language is to be able to declare a variable. Variables enable us to attach some value to a name. That value may be temporary, and it's used to pass information to other parts of a program.

In Bash, we declare a variable with the name of the variable, an equal sign, and then the value of the variable within double quotes. Do not insert spaces. In the following code snippet, which can be entered on the commandine, I create a variable named name and assign it the value Sean. I create another variable named backup and assign it the value /media. Then I use the echo and cd commands to test the variables:

name="Sean"
backup="/media"
echo "My name is ${name}"
echo "${backup}"
cd "${backup}"
pwd
cd

Variables may include values that may change given some context. For example, if we want a variable to refer to today's day of week, we can use command substitution, which "allows the output of a command to replace the command name" (see man bash). Thus, the output at the time this variable is set will differ if it is set on a different day.

today="$(date +%A)"
echo "${today}"

The curly braces are not strictly necessary, but they offer benefits when we start to use things like array variables. See:

For example, let's look at basic brace expansion, which can be used to generate arbitrary strings:

echo {1..5}
echo {5..1}
echo {a..l}
echo {l..a}

Another example: using brace notation, we can generate multiple sub-directories at once. Start off in your home directory, and:

mkdir -p homework/{drafts,notes}
cd homework
ls

But more than that, they allow us to deal with arrays (or lists). Here I create a variable named seasons, which holds an array, or multiple values: winter spring summer fall. Bash lets me access parts of that array:

seasons=(winter spring summer fall)
echo "${seasons[@]}"
echo "${seasons[1]}"
echo "${seasons[2]}"
echo "${seasons[-1]}"

See Parameter expansions for more advanced techniques.

Conditional Expressions

Whether working on the commandline, or writing scripts in a text editor, it's sometimes useful to be able to write multiple commands on one line. There are several ways to do that. We can include a list of commands on one line in Bash where each command is separated by a semicolon:

cd ; ls -lt

But we can use conditional expressions and apply logic with && (Logical AND) or || (Logical OR).

Here, command2 is executed if and only if command1 is successful:

command1 && command2

Here, command2 is executed if and only if command1 fails:

command1 || command2

Example:

cd documents && echo "success"
cd documents || echo "failed"
# combine them:
cd test && pwd || echo "no such directory"
mkdir test
cd test && pwd || echo "no such directory"

Shebang or Hashbang

When we start to write scripts, the first thing we add is a shebang at line one. We can do so a couple of ways:

##!/usr/bin/env bash

The first one should be more portable, but alternatively, you could put the direct path to Bash:

#!/usr/bin/bash

Looping

There are several looping methods Bash, including: for, while, until, and select. The for loop is often very useful.

for i in {1..5} ; do
  echo "${i}"
done

With that, we can create a rudimentary timer:

for i in {1..5} ; do
  echo "${i}" && sleep 1
done

We can loop through our seasons variable:

seasons=(winter spring summer fall)
for i in ${seasons[@]} ; do
  echo "I hope you have a nice ${i}"
done

Testing

Sometimes we will want to test certain conditions. There are two parts to this, we can use if; then ; else commands, and we can also use the double square brackets: [[. There are a few ways to get documentation on these functions. See the following:

man test
help test
help [
help [[
help if

We can test integers:

if [[ 5 -ge 3 ]] ; then
  echo "true"
else
  echo "false"
fi

Reverse it to return the else statement:

if [[ 3 -ge 5 ]] ; then
  echo "true"
else
  echo "false"
fi

We can test strings:

if [[ "$HOME" = "$PWD" ]] ; then
 echo "You are home."
else
 echo "You are not home, but I will take you there."
 cd "$HOME"
 pwd
fi

We can test file conditions. Let's first create a file called paper.txt and a file called paper.bak. We will add some trivial content to paper.txt but not to the paper.bak. The following if statement will test if paper.txt has a more recent modification date, and if so, it'll back up the file with the cp and echo back its success:

if [[ "$HOME/paper.txt" -nt "$HOME/paper.bak" ]] ; then
  cp "$HOME/paper.txt" "$HOME/paper.bak" && echo "Paper is backed up."
fi

Here's a script that prints info depending on which day of the week it is:

day1="Tue"
day2="Thu"
day3="$(date +%a)"

if [[ "$day3" = "$day1" ]] ; then
  printf "\nIf %s is %s, then class is at 9:30am.\n" "$day3" "$day1"
elif [[ "$day3" = "$day2" ]] ; then
  printf "\nIf %s is %s, then class is at 9:30am.\n" "$day3" "$day2"
else
  printf "\nThere is no class today."
fi

Resources

I encourage you to explore some useful guides and cheat sheets on Bash scripting:

Summary

In this demo, we learned about:

  • creating and referring to variables
  • conditional expressions with && and ||
  • adding the shebang or hashbang at the beginning of a script
  • looping with the for statement
  • testing with the if statement

These are the basics. I'll cover more practical examples in upcoming demos, but note that mastering the basics requires understanding a lot of the commands and paths that we have covered so far in class. So keep practicing.

Managing the System

Now that we have the basics of the command line interface down, it's time to learn some systems administration. In this section, we learn how to expand storage space, create new user and group accounts and manage those accounts, install and remove software, and manage that software and other processes.

Expanding Storage

I'm sure all or most of you have needed extra disk storage at some point (USB drives, optical disks, floppies???). Such needs are no different for systems administrators, who often are responsible for managing, monitoring, or storing large amounts of data.

The disk that we created for our VM is small (10 GB), and that's fine for our needs, albeit quite small in many real world scenarios. To address this, we can add a persistent disk that is much larger. In this section, we will add a disk to our VM, mount it onto the VM's filesystem, and format it. Extra storage does incur extra cost. So at the end of this section, I will show you how to delete the extra disk to avoid that if you want.

We will essentially follow the Google Cloud tutorial to add a non-boot disk to our VM, but with some modification.

Add a persistent disk to your VM

Note: the main disk used by our VM is the boot disk. The boot disk contains the software required to boot the system. All of our computers (desktops, laptops, tablets, phones, etc.), regardless of which operating system they run, have some kind of boot system.

Creating a Disk

In the Google Cloud console, visit the Disks page in the Storage section, which should be here:

Create a disk

And then follow these steps:

  1. Under Name, add a preferred name or leave the default.
  2. Under Description, add text to describe your disk.
  3. Under Location, leave or choose Single zone.
    • We are not concerned about data safety.
    • If we were, then we would select other options here.
  4. Under Source, select Blank disk.
  5. Under Disk settings, select Balanced persistent disk.
  6. Under Size, change this to 10GB.
    • You can actually choose larger sizes, but be aware that disk pricing is $0.10 per GB.
    • At that cost, 100 GB = $10 / month.
  7. Click on Enable snapshot schedule.
  8. Under Encryption, make sure Google-managed encryption key is selected.
  9. Click Create to create your disk.

Adding the Disk to our VM

Now that we have created our disk, we need to mount it onto our filesystem so that it's available to our VM. Conceptually, this process is like inserting a new USB drive into our computer.

To add the new disk to our VM, follow these steps:

  1. Visit the VM instances page.
  2. Click on the check box next to your virtual machine.
    • That will convert the Name of your VM into a hyperlink.
  3. Click on that Name.
    • That will take you to the VM instance details page.
  4. Click on the Edit button at the top of the details page.
  5. Under the Additional disks section, click on + ATTACH EXISTING DISK.
  6. A panel will open on the right side of your browser.
  7. Click on the drop down box and select the disk, by name, you created.
  8. Leave the defaults as-is.
  9. Click on the SAVE button.
  10. Then click on the SAVE button on the details page.

If you return to the Disks page in the Storage section, you will now see that the new disk is in use by our VM.

Formatting and Mounting a Non-Boot Disk

Formatting Our Disk

In order for our VM to make use of the extra storage, the new drive must be formatted and mounted. Different operating systems use different filesystem formats. You may already know that macOS uses the Apple File System (APFS) by default and that Windows uses the New Technology File System (NTFS). Linux is no different, but uses different file systems than macOS and Windows, by default. There are many formatting technologies that we can use in Linux, but we'll use the ext4 (fourth extended filesystem) format, since this is recommended by Google Cloud and is also a stable and common one for Linux.

In this section, we will closely follow the steps outlined under the Formatting and mounting a non-boot disk on a Linux VM section. I replicate those instructions below, but I highly encourage you to read through the instructions on Google Cloud and here:

  1. Use the gcloud compute ssh command that you have previously used to connect to your VM.
    • Alternatively, you can ssh to your VM via your browser:
      • Click on the VM instances page.
      • Under the Connect column, select Open in browser window next to SSH.
  2. When you have connected to your VM's command line, run the lsblk command.
    • Ignore the loop devices.
    • Instead, you should see sda and sdb under the NAME column outputted by the lsblk command.
    • sda represents your main disk.
      • sda1, sda14, sda15 (may be slightly different for you) represent the partitions of the sda disk.
      • Notice the MOUNTPOINT for sda1 is /, or the root level of our filesystem.
    • sdb represents the attached disk we just added.
      • After we format this drive, there will be an sdb1, and this partition will also have a mountpoint.

To format our disk for the ext4 filesystem, we will use the mkfs.ext4 (see man mkfs.ext4 for details). The instructions tell us to run the following command (please read the Google Cloud instructions closely; it's important to understand these commands as much as possible and not just copy and paste them):

sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/DEVICE_NAME

But replace DEVICE_NAME with the name of our device. My device's name is sdb, which we saw with the output of the lsblk command; therefore, the specific command I run is:

sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sdb

Mounting Our Disk

Now that our disk has been formatted in ext4, I can mount it.

Note: to mount a disk simply means to make the disk's filesystem available so that we can use it for accessing, storing, etc files on the disk. Whenever we insert a USB drive, a DVD drive, etc into our computers, the OS you use should mount that disk automatically so that you can access and use that disk. Conversely, when we remove those drives, the OS unmounts them. In Linux, the commands for these are mount and umount. Note that the umount command is not unmount.

You will recall that we have discussed filesystems earlier, and that the term is a bit confusing since it refers to both the directory hierarchy and also the formatting type (e.g., ext4). In that prior section, I discussed how in Windows, attaching a new drive, whether it's a USB drive, a DVD drive, an additional disk drive, or an external drive, Windows gives the new drive a letter, like A:, B:, D:, etc. Unlike Windows, I mentioned that in Linux and Unix (e.g., macOS), when we add an additional disk, its filesystem gets added onto our existing one. That is, it becomes part of the directory hierarchy and under the / top level part of the hierarchy. In practice, this means that we have to create the mountpoint for our new disk, and we do that with the mkdir command. The Google Console documentation instructs us to use the following command:

sudo mkdir -p /mnt/disks/MOUNT_DIR

And to replace MOUNT_DIR with the directory we want to create. Since my added disk is named disk-1, I'll call it that:

sudo mkdir -p /mnt/disks/disk-1

Now we can mount the disk to that directory. Per the instructions on Google Console, and given that my added drive has the device name sdb, I use the following command:

sudo mount -o discard,defaults /dev/sdb /mnt/disks/disk-1

We also need to change the modifications, and grant access for additional users:

sudo chmod 777 /mnt/disks/disk-1

We can test that it exists and is accessible with the lsblk and the cd commands. The lsblk command should show that sdb is mounted at /mnt/disks/disk-1, and we can cd (change directory) to it:

cd /mnt/disks/disk-1

Automounting Our Disk

Our disk is mounted, but if the computer (VM) gets rebooted, we would have to re-mount the additional drive manually. In order to avoid this and automount the drive upon reboot, we need to edit the file /etc/fstab.

Note that the file is named fstab and that it's located in the /etc directory. Therefore the full path is /etc/fstab

The fstab file is basically a configuration file that provides information to the OS about the filesystems the system can mount. The standard information fstab contains includes the name (or label) of the device being mounted, the mountpoint (e.g., /mnt/disks/disk-1), the filesystem type (e.g., ext4), and various other mount options. See man fstab for more details. For devices to mount upon boot up automatically, they have to be listed in this file. That means we need to edit this file on our VM.

Again, here we're following the Google Cloud instructions:

Before we edit system configuration files, however, always create a backup. We'll use the cd command to create a backup of the fstab file.

sudo cp /etc/fstab /etc/fstab.backup

Next we use the blkid command to get the UUID (universally unique identifier) number for our new device. Since my device is /dev/sdb, I'll use that:

sudo blkid /dev/sdb

The output should look something like this:

/dev/sdb: UUID="3bc141e2-9e1d-428c-b923-0f9vi99a1123" TYPE="ext4"

We need to add that value to /etc/fstab plus the standard information that file requires. The Google Cloud documentation explicitly guides us here. We'll use nano to make the edit:

nano /etc/fstab

And then add this line at the bottom:

UUID=3bc141e2-9e1d-428c-b923-0f9vi99a1123 /mnt/disks/disk-1 ext4 discard,defaults,nofail 0,2

And that's it! If you reboot your VM, or if your VM rebooted for some reason, the extra drive we added should automatically mount upon reboot. If it doesn't, then it may mean that the drive failed, or that there was an error (i.e., typo) in the configuration.

Delete the Disk

You are welcome to keep the disk attached to the VM, but if you do not want to incur any charges for it, which would be about $1 / month at 10 GB, then we can delete it.

To delete the disk, first delete the line that we added in /etc/fstab, unmount it, and the delete the disk in the gcloud console.

To unmount the disk, we use the umount command:

sudo umount /mnt/disks/disk-1

Then we need to delete the disk in gcloud.

  1. Go to the VM instances page.
  2. Click on the check box next to the VM.
  3. Click on the name, which should be a hyperlink.
  4. This goes to the VM instances detail page.
  5. Click on the Edit button at the top of the page.
  6. Scroll down to the Additional disks section.
  7. Click the edit (looks like a pencil) button.
  8. In the right-hand pane that opens up, select Delete disk under the Deletion rule section.
  9. Scroll back to the Additional disks section.
  10. Click on the X to detach the disk.
  11. Click on Save.
  12. Go the Disk section in the left-hand navigation pane.
  13. Check the disk to delete, and then Delete it.
  14. Click on the Snapshots section int he left-hand navigation pane.
  15. Check the disk snapshot to delete, and then Delete it.
    • Be sure you don't delete your VM here but just your disk.

Conclusion

In this section we learned how to expand the storage of our VM by creating a new virtual drive, adding it to our VM, formatting the drive in the ext4 filesystem format, mounting the drive at /mnt/disks/disk-1, and then editing /etc/fstab to make automount the drive.

In addition to using the gcloud console, the command we used in this section include:

  1. ssh : to connect to the remote VM
  2. sudo : to run commands as the administrator
  3. mkfs.ext : to create an ext4 filesystem on our new drive
  4. mkdir -p : to create multiple directories under /mnt
  5. mount : to mount manually the new drive
  6. umount : to unmount manually the new drive
  7. chmod : to change the mountpoint's file permission attributes
  8. cd : to change directories
  9. cp : to copy a file
  10. nano : to use the text editor nano to edit /etc/fstab

Managing Users and Groups

In some cases we'll want to provide user accounts on the servers we administrate, or we'll want to set up servers for others to use. The process of creating accounts is fairly straightforward, but there are a few things to know about how user accounts work.

The passwd file

The /etc/passwd file contains information about the users on your system. There is a man page that describes the file, but man pages are divided into sections (see man man), and the man page for the passwd file is in section 5. Therefore in order to read the man page for the /etc/passwd file, we run the following command:

man 5 passwd

Before we proceed, let's take a look at a single line of the file. Below I'll show the output for a made up user account:

grep "peter" /etc/passwd
peter:x:1000:1000:peter,,,:/home/peter:/bin/bash

The line starting with peter is a colon separate line. That means that the line is composed of multiple fields each separated by a colon.

man 5 passwd tells us what each field indicates. The first field is the login name, which in this case is peter. The second field, marked x, marks the password field. This file does not contain the password, though. The passwords, which are hashed and salted, for users are stored in the /etc/shadow file, which can only be read by the root user (or using the sudo command).

Hashing a file or a string of text is a process of running a hashing algorithm on the file or text. If the file or string is copied exactly, byte for byte, then hashing the copy will return the same value. If anything has changed about the file or string, then the hash value will be different. By implication, this means that if two users on a system use the same password, then the hash of each will be equivalent. Salting a hashed file (or file name) or string of text is a process of adding random data to the file or string. Each password will have a unique and mostly random salt added to it. This means that even if two users on a system use the same password, salting their passwords will result in unique values.

The third column indicates the user's numerical ID, and the fourth column indicates the users' group ID. The fifth column repeats the login name, but could also serve as a comment field. Comments are added using certain commands (discussed later). The fifth field identifies the user's home directory, which is /home/peter. The sixth field identifies the user's default shell, which is /bin/bash.

The user name or comment field merely repeats the login name here, but it can hold specific types of information. We can add comments using the chfn command. Comments include the user's full name, their home and work phone numbers, their office or room number, and so forth. To add a full name to user peter's account, we use the -f option:

sudo chfn -f "Peter Parker" peter

The /etc/passwd file is a standard Linux file, but some things will change depending on the Linux distribution. For example, the user and group IDs above start at 1000 because peter is the first human account on the system. This is a common starting numerical ID nowadays, but it could be different on other Linux distributions. The home directory could be different on other systems, too; for example, the default could be located at /usr/home/peter. Also, other shells exist besides bash, like zsh, which is now the default shell on macOS; so other systems may default to different shell environments.

The shadow file

The /etc/passwd file does not contain any passwords but a simple x to mark the password field. Passwords on Linux are stored in /etc/shadow and are hashed with sha512, which is indicated by $6$. You need to be root to examine the shadow file or use sudo:

The fields are (see man 5 shadow):

  • login name (username)
  • encrypted password
  • days since 1/1/1970 since password was last changed
  • days after which password must be changed
  • minimum password age
  • maximum password age
  • password warning period
  • password inactivity period
  • account expiration date
  • a reserved field

The /etc/shadow file should not be directly edited. To set, for example, a warning that a user's password will expire, we would use the passwd command (see man passwd for options). The following command would make it so the user peter is warned that their password will expire in 14 days:

passwd -w 14 peter

The group file

The /etc/group file holds group information about the entire system (see man 5 group). The file can be viewed by anyone on a system, by default, but there is also a groups command (see man groups) that will return the groups for a user. Running the groups command by itself will return your own memberships.

Management Tools

There are different ways to create new users and groups, and the following list includes most of the utilities to help with this. Note that, based on the names of the utilities, some of them are repetitive.

  • useradd (8) - create a new user or update default new user information
  • usermod (8) - modify a user account
  • userdel (8) - delete a user account and related files
  • groupadd (8) - create a new group
  • groupdel (8) - delete a group
  • groupmod (8) - modify a group definition on the system
  • gpasswd (1) - administer /etc/group and /etc/gshadow
  • adduser.conf (5) - configuration file for adduser(8) and addgroup(8) .
  • adduser (8) - add a user or group to the system
  • deluser (8) - remove a user or group from the system
  • delgroup (8) - remove a user or group from the system
  • chgrp (1) - change group ownership

The numbers within parentheses above indicate the man section. Therefore, to view the man page for the userdel command:

man 8 userdel

Practice

Modify default new user settings

Let's modify some default user account settings for new users, and then we'll create a new user account.

Before we proceed, let's review several important configuration files that establish some default settings:

  • /etc/skel
  • /etc/adduser.conf

The /etc/skel directory defines the home directory for new users. Whatever files or directories exist in this directory at the time a new user account is created will result in those files and directories being created in the new user's home directory. We can view what those are using the following command:

ls -a /etc/skel/

The /etc/adduser.conf file defines the default parameters for new users. It's in this file where the default starting user and group IDs are set, where the default home directory is located (e.g., in /home/), where the default shell is defined (e.g., /bin/bash), where the default permissions are set for new home user directories (e.g., 0755) and more.

Let's change some defaults for /etc/skel. We need to use sudo [command] or use su to become the root user. I prefer to use sudo [command] since this is a bit safer than becoming root. Let's edit the default .bashrc file:

sudo nano /etc/skel/.bashrc

We want to add these lines at the end of the file. This file is a configuration file for /bin/bash, and will be interpreted by Bash. Therefore, lines starting with a hash mark are comments:

# Dear New User,
#
# I have made the following settings
# to make your life a bit easier:
#
# make "c" a shortcut for "clear"
alias c='clear'

Use nano again to create a README file. This file will be added to the home directories of all new users. Add any welcome message you want to add, plus any guidelines for using the system.

sudo nano /etc/skel/README

Add new user account

After writing (saving) and exiting nano, we can go ahead and create a new user named linus.

sudo adduser linux

We'll be prompted to enter a password for the new user, plus comments (full name, phone number, etc). Any of these can be skipped by pressing enter. You can see from the output of the grep command below that I added some extra information:

grep "linus" /etc/passwd
linus:x:1003:1004:Linus Torvalds,333,555-123-4567,:/home/linus:/bin/bash

Let's modify the minimum days before the password can be changed, and the maximum days of the password's lifetime:

sudo passwd -n 90 linus
sudo passwd -x 180 linus

You can see these values by grepping the shadow file:

sudo grep "linus" /etc/shadow

Add users to a new group

Because of the default configuration defined in /etc/adduser.conf, the linus user only belongs to a group of the same name. Let's create a new group that both linus and peter belong to. For that, we'll use the -a option for the gpasswd command. We'll also make the user peter the group administrator using the -A option (see man gpasswd for more details).

sudo groupadd developers
sudo gpasswd -a peter developers
sudo gpasswd -A peter developers
sudo gpasswd -a linus developers
grep "developers" /etc/group

Note: if a user is logged in when you add them to a group, they need to logout and log back in before the group membership goes into effect.

Create a shared directory

One of the benefits of group membership is that members can work in a shared directory.

Let's make the /srv/developers a shared directory. The /srv directory already exists, so we only need to create the developers subdirectory:

sudo mkdir /srv/developers

We'll have to change the default permissions, which are currently set to 0755:

ls -ld /srv
ls -ld /srv/developers

Now we can change ownership of the directory:

sudo chgrp developers /srv/developers

The directory ownership should now reflect that it's owned by the developers group:

ls -ld /srv/developers

In order to allow group members to read and write to the above directory, we need to use the chmod command in a way we haven't yet. Specifically, we add a leading 2 sets the group identity. The 770 indicates that the user and group owners of the directory have read, write, and execute permissions for the directory:

sudo chmod 2770 /srv/developers

Now either linus or peter can add, modify, and delete files in the /srv/developers directory.

User account and group deletion

You can keep the additional user and group on your system, but know that you can also remove them. The deluser and delgroup commands offer great options and may be preferable to the others utilities (see man deluser or man delgroup).

If we want to delete the new user's account and the new group, these are the commands to use. The first command will create an archival backup of linus' home directory and also remove the home directory and any files in it.

deluser --backup --remove-home linus
delgroup developers

Managing Software

Introduction

Many modern Linux distributions offer some kind of package manager to install, manage, and remove software. These package management systems interact with curated and audited central repositories of software that are collected into packages. They also provide a set of tools to learn about the software that exists in these repositories.

If package management seems like an odd concept to you, it's just a way to manage software installation, and it's very similar to the way that Apple and Google distribute software via the App Store and Google Play.

On Debian based systems, which includes Ubuntu, we use apt, apt-get, and apt-cache to manage most software installations. For most cases, you will simply want to use the apt command, as it is meant to combine the functionality commonly used with apt-get and apt-cache.

We can also install software from source code or from pre-built binaries. On Debian and Ubuntu, for example, we might want to install (if we trust it) pre-build binaries distributed on the internet as .deb files. These are comparable to .dmg files for macOS and to .exe files for Windows. When installing .deb files, though, we need to use the dpkg command.

Installing software from source code often involves compiling the software. It's usually not difficult to install software this way, but it can become complicated to manage software that's installed from source code simply because it means managing dependencies and keeping a close eye on new versions of the software.

Another way to install software (I know, there's a lot) is to use the snap command. This is a newer way of packaging programs that involves packaging all of a program and all of its dependencies into a single container. The main point of snap seems to be aimed at IoT and embedded devices, but it's perfectly usable and preferable (in some scenarios) on the desktop because the general aim is end users and not system administrators. See the snap store for examples.

You might also want to know that some programming languages provide their own mechanisms to install packages. In many cases, these packages may be installed with the apt command, but the packages that apt will install tend to be older (but more stable) than the packages that a programming language will install. For example, Python has the pip or pip3 command to install and remove Python libraries. The R programming language has the install.packages(), remove.packages(), and the update.packages() commands to install R libraries.

Despite all these ways to install, manage, remove, and update software, we will focus on using the apt command, which is pretty straightforward.

APT

Let's look at the basic apt commands.

apt update

Before installing any software, we need to update the index of packages that are available for the system.

sudo apt update

apt upgrade

The above command will also state if there is software on the system that is ready for an upgrade. If any upgrades are available, we run the following command:

sudo apt upgrade

We may know a package's name when we're ready to install it, but we also may not. To search for a package, we use the following syntax:

apt search [package-name]

Package names will never have spaces between words. Rather, if a package name has more than one word, each word will be separated by a hyphen.

In practice, say I'm curious if there are any console based games:

apt search ncurses game

I added ncurses to my search query because the ncurses library is often used to create console-based applications.

apt show

The above command returned a lit that included a game called ninvaders, which seems to be a console-based Space Invaders like game. To get additional information about this package, we use the apt show [package-name] command:

apt show ninvaders

apt install

It's quite simple to install the package called ninvaders:

sudo apt install ninvaders

apt remove or apt purge

To remove an installed package, we can use either the apt remove or the apt purge commands. Sometimes when a program is installed, configuration files get installed with it in the /etc directory. The apt purge command will remove those configuration files but the apt remove command will not. Both commands are offered because sometimes it is useful to keep those configuration files.

apt remove ninvaders

Or:

apt purge ninvaders

apt autoremove

All big software requires other software to run. This other software are called dependencies. The apt show [package-name] command will list a program's dependencies. However, when we remove software with the prior two commands, the dependencies, even if no longer needed, are not necessarily removed. To remove them, (which restores more disk space) we do:

sudo apt autoremove

apt history

Unfortunately, the apt command does not provide a way to get a history of how it's been used on a system, but a log of its activity is kept. We can review that log with the following command:

less /var/log/apt/history.log

Daily Usage

This all may seem complicated, but it's really not. For example, to keep my systems updated, I run the following two commands on a daily or near daily basis:

apt update
sudo apt upgrade

Conclusion

There are a variety of ways to install software on a Linux or Ubuntu system. The common way to do it on Ubuntu is to use the apt command, which was covered in this section.

We'll come back to this command often because we'll soon install and setup a complete LAMP (Linux, Aapache, MariaDB, and PHP) server. Until then, I encourage you to read through the manual page for apt:

man apt

Using systemd

Introduction

When computers boot up, obviously some software manages that process. On Linux and other Unix or Unix-like systems, this is usually handled via an init system. For example, macOS uses launchd and many Linux distributions, including Ubuntu, use systemd.

systemd does more than handle the startup process, it also manages various services and connects the Linux kernel to various applications. In this section, we'll cover how to use systemd to manage services, and to review log files.

Manage Services

When we install complicated software, like a web server (e.g., Apache2, Nginx), a SSH server (e.g., OpenSSH), or a database server (e.g., mariaDB or MySQL), then it's helpful to have commands that manage that service (the web service, the SSH service, the database service, etc).

For example, the ssh service is installed by default on our gcloud servers, and we can check its status with the following systemctl command:

systemctl status ssh

The output tells us a few things. The line beginning with Loaded tells us that the SSH service is configured. At the end of that line, it also tells us that it is enabled, which means that the service will automatically start when the system gets rebooted or starts up.

The line beginning with Active tells us that the service is active (running) and for how long. It has to say this since I'm connecting to the machine using ssh. If the service was not active (running), then I wouldn't be able to login remotely. We also can see the process ID (PID) for the service as well as how much memory it's using.

At the bottom of the output, we can see the recent log files. We can view more of those log files using the journalctl command. By default, running journalctl by itself will return all log files, but we can specify that we're interested in log files only for the ssh service. We can specify using the PID number. Replace NNN with the PID number attached to your ssh service:

journalctl _PID=NNN

Or we can specify by service, or more specifically, its unit name:

journalctl -u ssh

Use Cases

Later we'll install the Apache web server, and we will use systemctl to manage some aspects of this service.

In particular, we will use the following commands to:

  1. check the state of the Apache service,
  2. configure the Apache service to auto start on reboot,
  3. start the service,
  4. reload the service after editing its configuration files, and
  5. stop the service.

In order, these work out to:

systemctl status apache2
sudo systemctl enable apache2
sudo systemctl start apache2
sudo systemctl reload apache2
sudo systemctl stop apache2

systemctl is a big piece of software, and there are other arguments the command will take. See man systemct for details.

Examine Logs

As mentioned, the journalctl command is part of the systemd software suite, and it is used to monitor system logs.

It's really important to monitor system logs. They help identify any problems in the system or with various services. For example, by monitoring the log entries for ssh, I can see all the attempts to break into the server. Or if the Apache2 web server malfunctions for some reason, which might be because of a configuration error, the logs will indicated how to identify the problem.

If we type journalctl at the command prompt, we are be presented with the logs for the entire system. These logs can be paged through by pressing the space bar, the page up/page down keys, or the up/down arrow keys, and they can also be searched by pressing the forward slash / and then entering a search keyword. To exit out of the pager, press q to quit.

journalctl

It's much more useful to specify the field and to declare an option when using journalctl. See the following man pages for details:

man systemd.journal-fields
man journalctl

There are many fields and options we can use, but as an example, we see that there is an option to view the more recent entries first (which is not the default):

journalctl -r

Or we view log entries in reverse order, for users on the system, and since the last boot with the following options:

journalctl -r --user -b 0

Or for the system:

journalctl -r --system -b 0

I can more specifically look at the logs files for a service by using the -u option with journalctl:

journalctl -u apache2

I can follow the logs in real-time (press ctrl-c to quit the real-time view):

journalctl -f

Useful Systemd Commands

You can see more of what systemctl or journalctl can do by reading through their documentation:

man systemctl
man journalctl

You can check if a service if enabled:

systemctl is-enabled apache2

You can reboot, poweroff, or suspend a system (suspending a system mostly makes sense for laptops and not servers):

systemctl reboot
systemctl poweroff
systemctl suspend

To show configuration file changes to the system:

systemd-delta

To list real-time control group process, resource usage, and memory usage:

systemd-cgtop
  • to search failed processes/services:
systemctl --state failed
  • to list services
systemctl list-unit-files -t service
  • to examine boot time:
systemd-analyze

Conclusion

This is a basic introduction to systemd, which is composed of a suite of software to help manage booting a system, managing services, and monitoring logs.

We'll put what we've learned into practice when we set up our LAMP servers.

Networking and Security

Even if we do not work as network administrators, system administrators need to know network basics. In this section, we cover TCP/IP and other protocols related to the internet protocol suite, and how to protect our systems locally, from external threats, and how to create backups of our systems in case of disaster.

Networking and TCP/IP

An important function of a system administrator is to set up, configure, and monitor a network. This may involve planning, configuring, and connecting the devices on a local area network, to planning and implementing a large network that interfaces with an outside network, and to monitoring networks for various sorts of attacks, such as denial of service attacks.

In order to prepare for this type of work, we need at least a basic understanding of how the internet works and how local devices interact with the internet. In this section, we will focus mostly on internet addressing, but we will also devote some space to TCP and UDP, two protocols for transmitting data.

Connecting two or more devices together nowadays involves the TCP/IP or the UDP/IP protocols, otherwise part of the Internet protocol suite. This suite is a kind of expression of the more generalized OSI communication model.

The internet protocol suite is generally framed as a series of layers beginning with a lower layer, the link layer, that interfaces with internet capable hardware, to the highest layer, the application layer.

The link layer describes the local area network. Devices connected locally, e.g., via Ethernet cables, comprise the link layer. The link layer connects to the internet layer. Data going into or out of a local network must be negotiated between these two layers.

The internet layer makes the internet possible by basically making the ability to transmit data among multiple networks possible. (The internet is, in fact, a network of networks). The primary characteristic of the internet layer is the IP address, which currently comes in two versions: IPv4 (32 bit) and IPv6 (128 bit). IP addresses are used to locate hosts on a network.

The transport layer makes the exchange of data on the internet possible. There are two dominant protocols attached to this layer: UDP and TCP. Very generally, UDP is used when the integrity of data is less important than the its ability to reach its destination. For example, streaming video, VOIP, and online gaming are often transported via UDP because the loss of some pixels or some audio is acceptable. TCP is used when the integrity of the data is important. If the data cannot be transmitted without error, then the data won't reach its final destination until the error is corrected.

The application layer provides the ability to use the internet in particular ways. For example, the HTTP protocol enables the web, which is simply an application on the internet. The SMTP, IMAP, and POP protocols control email exchange. DNS is a system that maps IP addresses to domain names. In this book, we use SSH, also part of the application layer, to connect to remote computers.

By application, they simply mean that these protocols provide the functionality for applications. They are not themselves considered user applications, like a web browser.

The Internet Protocol Suite

ARP (Address Resolution Protocol)

ARP (Address Resolution Protocol) is a protocol at the link layer and is used to map network addresses, like an IP address, to the ethernet addresses, also called the MAC or Media Access Control address, or the hardware address. Routers use MAC addresses to enable communication inside networks (w/in subnets or local area networks) so that computers within a local network can talk to each other. Networks are designed so that IP addresses are associated with MAC addresses before systems can communicate over a network. Everyone of your internet capable devices, your smartphone, your laptop, your internet connected toaster, have a MAC address.

To get ARP info for a system, we use the ip command, which uses regular options (like -b) and names specific objects. To get the MAC address for a specific computer, we can use the following command, where ip is the command and a or link are considered objects (see man ip for details):

ip a

On my home system, the above command produces three numbered sections of output. The first section refers to the lo or loopback device. This is a special device that allows the computer to communicate with itself. It always has an IPv4 address of 127.0.0.1. The next section on my home machine refers to the ethernet card. Currently, I'm connected via wifi, and so this section reports the MAC address for that ethernet card plus some other other information, such as whether the device is down or up. Since there's no physical cable connecting my machine to the router, this section reports DOWN. The third section on my home system refers to the wifi card. Since this is UP (or active), it reports the internal IP address (e.g., 192.168.0.4), plus the MAC address, and other details. The internal address is different from the machine's external address, which might be something like 159.3.45.2.

We can get just the link object information with the following command:

ip link

The following two commands help identify parts of the local network (or subnet) and the routing table.

ip neigh
ip route

The ip neigh command produces the ARP cache, basically what other systems your system is aware of on the local network. The ip route command is used to define how data is routed on the network but can also define the routing table. Both of these commands are more commonly used on Linux-based routers.

These details enable the following scenario: A router gets configured to use a specific network address when it's brought online. It searches the sub network for connected MAC addresses that are assigned to wireless cards or ethernet cards. It then assigns each of those MAC addresses an available IP address based on the network address.

Internet Layer

IP (Internet Protocol)

The Internet Protocol, or IP, address is used to uniquely identify a host on a network and place that host at a specific location (its IP address). If that network is subnetted (i.e., routed), then a host's IP address will have a subnet or private IP address that will not be directly exposed to the Internet. Remember this, there are public IP addresses that are distinct from private IP addresses. Public IP addresses are accessible on the internet. Private IP addresses are not, but they are accessible on subnets or local area networks.

Private IP address ranges are reserved address ranges, which means no public internet device will have an IP address within these ranges. The private address ranges include:

Start AddressEnd Address
10.0.0.010.255.255.255
172.16.0.0172.31.255.255
192.168.0.0192.168.255.255

If you have a router at home, and look at the IP address for at any of your devices connected to that router, like your phone or computer, you will see that it will have an address within one of the ranges above. For example, it might have an IP address beginning with 192.168.X.X. This is a standard IP address range for a home router. The 10.X.X.X private range can assign many more IP addresses on its network, which is why you'll see that IP range on bigger networks, like a university's network. We'll talk more about subnetwork sizes, shortly.

Example Private IP Usage

Let's say my office computer's IP address is 10.163.34.59/24 via a wired connection. My office neighbor has an IP address of 10.163.34.65/24 via their wired connection. Both IP addresses are private because they fall within the 10.0.0.0 to 10.255.255.255 range. And it's likely they both exist on the same subnet since they share the first three octets: 10.163.34.XX.

However, if we both, using our respective wired connected computers, searched Google for what's my IP address, we will see that we share the same public IP address, which will be something like 128.163.8.25. That is a public IP address because it does not fall within the ranges listed above.

Without any additional information, therefore, we know that all traffic coming from our computers and going out to the internet looks like it's coming from the same IP address (128.163.8.25). And in reverse, all traffic coming from outside our network first goes to 128.163.8.25 before it's routed to our respective computers via the router.

Let's say I also have a laptop in my office, and that it has a wireless connection. When I check with ip a, I find that the laptop had the IP address 10.47.34.150/16. You can see there's a different pattern with this IP address. The reason it has a different pattern is because this laptop is on an different subnet even though it's physically sitting next to the wired computer. This wireless subnet was configured to allow more hosts to connect to it since it must allow for more devices (i.e., laptops, phones, etc). When I searched Google for my IP address from this laptop, it reports 128.163.238.148, indicating that UK owns a range of public IP address spaces.

Here's kind of visual diagram of what this network looks like:

Network diagram
Fig. 1. This figure contains a network switch, which is used to route traffic within a subnet. The switch relies solely on MAC addresses and not IP addresses to determine the location of devices on its subnet. The router acts as the interface between the private network and the public network and is managing two subnets: a wired and a wireless one.

Using the ip Command

The ip command can do more than provide us information about our network. We can also use it to turn a connection to the network on or off (and more). Here we disable and then enable a connection on a machine. Note that enp0s3 is the name of my network card/device. Yours might have a different name.

sudo ip link set enp0s3 down
sudo ip link set enp0s3 up

Transport Layer

The internet layer does not transmit content, like web pages or video streams. This is the work of the transport layer. As discussed previously, the two most common transport layer protocols are TCP and UDP.

TCP, Transmission Control Protocol

TCP or Transmission Control Protocol is responsible for the transmission of data and for making sure the data arrives at its destination w/o errors. If there are errors, the data is re-transmitted or halted in case of some failure. Much of the data sent over the internet is sent using TCP.

UDP, User Datagram Protocol

The UDP or User Datagram Protocol performs a similar function as TCP, but it does not error check and data may get lost. UDP is useful for conducting voice over internet calls or for streaming video, such as through YouTube, which uses a type of UDP transmission called QUIC that has builtin encryption.

TCP and UDP Headers

The above protocols send data in data TCP packets or UDP datagrams, but these terms may be used interchangeably. Packets for both protocols include header information to help route the data across the internet. TCP includes ten fields of header data, and UDP includes four fields.

We can see this header data using the tcpdump command, which requires sudo or being root to use. The first part of the IP header contains the source address, then comes the destination address, and so forth. Aside from a few other parts, this is the primary information in an IP header.

You should use tcpdump on your local computer and not on your gcloud instance. First we identify the IP number of a host, which we can do with the ping command, and then run tcpdump:

ping -c1 www.uky.edu
sudo tcpdump host 128.163.35.46

While that's running, we can type that IP address in our web browser, or enter www.uky.edu, and watch the output of tcpdump.

TCP headers include port information and other mandatory fields for both source and destination servers. The SYN, or synchronize, message is sent when a source or client requests a connection. The ACK, or acknowledgment, message is sent in response, along with a SYN message, to acknowledge the request for a connection. Then the client responds with an additional ACK message. This is referred to as the TCP three-way handshake. In addition to the header info, TCP and UDP packets include the data that's being sent (e.g., a webpage) and error checking if it's TCP.

Ports

TCP and UDP connections use ports to bind internet traffic to specific IP addresses. Specifically, a port associates a process with an application (and is part of the application layer of the internet suite), such as a web service or outgoing email. That is, ports provide a way to distinguish and filter internet traffic (web, email, etc) through an IP address. For example, all traffic going to IP address 10.0.5.33:80 means that this is http traffic for the http web service, since HTTP is commonly associated with port 80. Note that the port info is attached to the end of the IP address via a colon.

Common ports include:

  • 21: FTP
  • 22: SSH
  • 25: SMTP
  • 53: DNS
  • 143: IMAP
  • 443: HTTPS
  • 587: SMTP Secure
  • 993: IMAP Secure

There's a complete list of the 318 default ports/protocols on your Linux systems. It's located in the following file:

less /etc/services

And to get a count of the ports, we can invert grep for lines starting with a pound sign or are empty

grep -Ev "^#|^$" /etc/services | wc -l

See also the Wikipedia page: List of TCP and UDP port numbers

IP Subnetting

Let's now return to the internet layer and discuss one of the major duties of a systems administrator: subnetting.

Subnets are used to carve out smaller and more manageable subnetworks out of a larger network. They are created using routers that have this capability (e.g., commercial use routers) and certain types of network switches.

Private IP Ranges

When subnetting local area networks, we work with the private IP ranges:

Start AddressEnd Address
10.0.0.010.255.255.255
172.16.0.0172.31.255.255
192.168.0.0192.168.255.255

It's important to be able to work with IP addresses like those listed above in order to subnet; and therefore, we will need to learn a bit of IP math along the way.

IP Meaning

An IPv4 address is 32 bits (8 x 4), or four bytes, in size. In human readable context, it's usually expressed in the following, decimal-based, notation style:

  • 192.168.1.6
  • 172.16.3.44

Each set of numbers separated by a dot is referred to as an octet. An octet is a group of 8 bits. Eight bits equal a single byte. By implication, 8 gigabits equals 1 gigabyte, and 8 megabits equals 1 megabyte. We use these symbols to note the terms:

TermSymbol
bitb
byteB
octeto

Each bit is represented by either a 1 or a 0. For example, the first address above in binary is:

  • 11000000.10101000.00000001.00000110 is 192.168.1.6

Or:

ByteDecimal Value
11000000192
10101000168
000000011
000001106

IP Math

When doing IP math, one easy way to do it is to simply remember that each bit in each of the above bytes is a placeholder for the following values:

128 64 32 16 8 4 2 1

Alternatively, from low to high:

base-2Output
201
212
224
238
2416
2532
2664
27128

In binary, 192 is equal to 11000000. It's helpful to work backward. For IP addresses, all octets are 255 or less (256 total, from 0 to 255) and therefore do not exceed 8 bits or places. To convert the integer 192 to binary:

1 * 2^7 = 128
1 * 2^6 =  64 (128 + 64 = 192)

Then STOP. There are no values left, and so the rest are zeroes. Thus: 11000000

Our everyday counting system is base-10, but binary is base-2, and thus another way to convert binary to decimal is to multiple each bit (1 or 0) by the power of base two of its placeholder:

(0 * 2^0) = 0 +
(0 * 2^1) = 0 +
(0 * 2^2) = 0 +
(0 * 2^3) = 0 +
(0 * 2^4) = 0 +
(0 * 2^5) = 0 +
(1 * 2^6) = 64 +
(1 * 2^7) = 128 = 192

Another way to convert to binary: simply subtract the numbers from each value. As long as there is something remaining or the placeholder equals the remainder of the previous subtraction, then the bit equals 1. So:

  • 192 - 128 = 64 -- therefore the first bit is equal to 1.
  • Now take the leftover and subtract it:
  • 64 - 64 = 0 -- therefore the second bit is equal to 1.

Since there is nothing remaining, the rest of the bits equal 0.

Subnetting Examples

Subnetting involves dividing a network into two or more subnets. When we subnet, we first identify the number of hosts, aka, the size, we will require on the subnet. For starters, let's assume that we need a subnet that can assign at most 254 IP addresses to the devices attached to it via the router.

In order to do this, we need two additional IP addresses: the subnet mask and the network address/ID. The network address identifies the network and the subnet mask marks the boundary between the network and the hosts. Knowing or determining the subnet mask allows us to determine how many hosts can exist on a network. Both the network address and the subnet mask can be written as IP addresses, but these IP addresses cannot be assigned to computers on a network.

When we have determined these IPs, we will know the broadcast address. This is the last IP address in a subnet range, and it also cannot be assigned to a connected device/host. The broadcast address is used by a router to communicate to all connected devices on the subnet.

For our sake, let's work through this process backwards; that is, we want to identify and describe a network that we are connected to. Let's work with two example private IP addresses that exist on two separate subnets.

Example IP Address 1: 192.168.1.6

Using the private IP address 192.168.1.6, let's derive the network mask and the network address (or ID) from this IP address. First, convert the decimal notation to binary. State the mask, which is /24, or 255.255.255.0. And then derive the network addressing using an bitwise logical AND operation:

11000000.10101000.00000001.00000110 IP              192.168.1.6
11111111.11111111.11111111.00000000 Mask            255.255.255.0
-----------------------------------
11000000.10101000.00000001.00000000 Network Address 192.168.1.0

Note the mask has 24 ones followed by 8 zeroes. That 24 is used as CIDR notation:

  • 192.168.1.6/24

For Example 1, we thus have the following subnet information:

TypeIP
Netmask/Mask255.255.255.0
Network ID192.168.1.0
Start Range192.168.1.1
End Range192.168.1.254
Broadcast192.168.1.255

Example IP Address 2: 10.160.38.75

For example 2, let's start off with a private IP address of 10.160.38.75 and a mask of /24:

00001010.10100000.00100110.01001011 IP               10.160.38.75
11111111.11111111.11111111.00000000 Mask            255.255.255.0
-----------------------------------
00001010.10100000.00100110.00000000 Network Address   10.160.38.0
TypeIP
Netmask/Mask255.255.255.0
Network ID10.160.38.0
Start Range10.160.38.1
End Range10.160.38.254
Broadcast10.160.38.255

Example IP Address 3: 172.16.1.62/24

For example 3, let's start off with a private IP address of 172.16.1.62 and a mask of /24:

10101100 00010000 00000001 00100111 IP                172.16.1.62
11111111 11111111 11111111 00000000 Mask            255.255.255.0
-----------------------------------
10101100 00010000 00000001 00000000 Network Address    172.16.1.0
TypeIP
Netmask/Mask255.255.255.0
Network ID172.16.1.0
Start Range172.16.1.1
End Range172.16.1.254
Broadcast172.16.1.255

Determine the Number of Hosts

To determine the number of hosts on a CIDR /24 subnet, we look at the start and end ranges. In all three of the above examples, the start range begins with X.X.X.1 and ends with X.X.X.254. Therefore, there are 254 maximum hosts allowed on these subnets because 1 to 254, inclusive of 1 and 254, is 254.

Example IP Address 4: 10.0.5.23/16

The first three examples show instances where the CIDR is set to /24. This only allows 254 maximum hosts on a subnet. If the CIDR is set to /16, then we can theoretically allow 65,534 hosts on a subnet.

For example 4, let's start off then with a private IP address of 10.0.5.23 and a mask of /16:

00001010.00000000.00000101.00010111 IP Address: 10.0.5.23
11111111.11111111.00000000.00000000 Mask:       255.255.0.0
-----------------------------------------------------------
00001010.00000000.00000000.00000000 Network ID: 10.0.0.0
TypeIP
IP Address10.0.5.23
Netmask/Mask255.255.0.0
Network ID10.0.0.0
Start Range10.0.0.1
End Range10.0.255.254
Broadcast10.0.255.255

Since the last two octets/bytes now vary, we count up by each octet. Therefore, the number of hosts is:

IPs
10.0.0.1
10.0.0.255= 256
10.0.1.1
10.0.255.255= 256
  • Number of Hosts = 256 x 256 = 65536
  • Subtract Network ID (1) and Broadcast (1) = 2 IP addresses
  • Number of Usable Hosts = 256 x 256 - 2 = 65534

IPv6 subnetting

We're not going to cover IPv6 subnetting, but if you're interested, this is a nice article: IPv6 subnetting overview

Conclusion

As a systems administrator, it's important to have a basic understanding of how networking works, and the basic models used to describe the internet and its applications. System administrators have to know how to create subnets and defend against various network-based attacks.

In order to acquire a basic understanding, this section covered topics that included:

  • the internet protocol suite
    • link layer
    • internet layer
    • transport layer
  • IP subnetting
    • private IP ranges
    • IP math

In the next section, we extend upon this and discuss the domain name system (DNS) and domain names.

DNS and Domain Names

The DNS (domain name system) is referred to as the phone book of the internet, and it's responsible for mapping IP addresses to memorable names. Thus, instead of having to remember:

https://128.163.35.46

We can instead remember this:

https://www.uky.edu

System administrators need to know about DNS because they may be responsible for administrating a domain name system on their network, and/or they may be responsible for setting up and administrating web site domains. Either case requires a basic understanding of DNS.

DNS Intro Videos

To help you get started, watch these two YouTube videos. The first one provides an overview of the DNS system:

How a DNS Server (Domain Name System) works

The second video illustrates how to use a graphical user interface to create and manage DNS records.

DNS Records

And here is a nice intro to recursive DNS:

https://www.cloudflare.com/learning/dns/what-is-recursive-dns/

FQDN: The Fully Qualified Domain Name

The structure of the domain name system is like the structure of the UNIX/Linux file hierarchy; that is, it is like an inverted tree.

The fully qualified domain name includes a period at the end of the top-level domain. Your browser is able to supply that dot since we often don't use it when typing website addresses.

Thus, for Google's main page, the FQDN is:

FQDN: www.google.com.

And the parts include:

.           root domain
com         top-level domain
google.     second-level domain
www.        third-level domain

This is important to know so that you understand how the Domain Name System works and which DNS servers are responsible for their part of the network.

Root Domain

The root domain is managed by root name servers. These servers are listed on the IANA, the Internet Assigned Numbers Authority, website, but are managed by multiple operators. The root servers manage the root domain, alternatively referred to as the zone, or the . at the end of the .com., .edu., etc.

Alternative DNS Root Systems

It's possible to have alternate internets by using outside root name servers. This is not common, but it happens. Read about a few of them here:

Russia, as an example, has threated to use it's own alternate internet based on a different DNS root system. This would essentially create a large, second internet. You can read about in this IEEE Spectrum article.

Top Level Domain (TLD)

We are all familiar with top level domains. Specific examples include:

  • generic TLD names:
    • .com
    • .gov
    • .mil
    • .net
    • .org
  • and ccTLD, country code TLDs
    • .ca (Canada)
    • .mx (Mexico)
    • .jp (Japan)
    • .uk (United Kingdom)
    • .us (United States)

We can download a list of those top level names from IANA, and get a total count of 1,487 (as of August 2022):

wget https://data.iana.org/TLD/tlds-alpha-by-domain.txt
sed '1d' tlds-alpha-by-domain.txt | wc -l

Second Level Domain Names

In the Google example, the second level domain is google. The second level domain along with the TLD together, along with any further subdomains, for the fully qualified domain name. Other examples include:

  • redhat in redhat.com
  • debian in debian.org.
  • wikipedia in wikipedia.org
  • uky in uky.edu
  • twitter in twitter.com

Third Level Domain Names / Subdomains

When you've purchased (leased) a top and second level domain like ubuntu.com, you can choose whether you add third level domains. For example: www is a third level domain or subdomain. If you owned example.org, you could dedicate a machine or a cluster of machines to www.example.org that resolve to a different location, or www.example.org could resolve to the second-level domain itself. That is:

  • www.debian.org can point to debian.org

It could also point to a separate server, such that debian.org and www.debian.org would be two separate servers with two separate websites or services, just like maps.google.com points to a different site than mail.google.com. Both maps and mail are subdomains of google.com. Although this is not common with third-level domains that start with www, it is common with others.

For example, with hostnames that are not www:

  • google.com resolves to www.google.com
  • google.com does not resolve to:
    • drive.google.com, or
    • maps.google.com, or
    • mail.google.com

This is because those other three provide different, but specific services.

DNS Paths

A recursive DNS server is the first DNS server to be queried in the DNS system, which is usually managed by an ISP. This is the resolver server in the first video above. This server queries itself (recursive) to check if the domain to IP mapping has been cached (remembered/stored) in its system.

If it hasn't been cached, then the DNS query is forwarded to a root server. There are thirteen root servers.

echo {a..m}.root-servers.net.

Those root servers will identify the next server to query, depending on the top level domain (.com, .net, .edu, .gov, etc.). If the site ends in .com or .net, then the next server might be something like: a.gtld-servers.net. Or if the top level domain ends in .edu, then: a.edu-servers.net.. If the top level domain ends in .gov, then: a.gov-servers.net.. And so forth.

Those top level domains should know where to send the query next. In many cases, the next path is to send the query to a custom domain server. For example, Google's custom name servers are: ns1.google.com to ns4.google.com. UK's custom name servers are: sndc1.net.uky.edu and sndc2.net.uky.edu. Finally, those custom name servers will know the IP address that maps to the domain.

We can use the dig command to query the non-cached DNS paths. Let's say we want to follow the DNS path for google.com, then we can start by querying any root server. In the output, we want to pay attention to the QUERY field, the ANSWER field, and the Authority Section. We keep digging until the ANSWER field returns a number greater than 0. The following commands query one of the root servers, which points us to one of the authoritative servers for .com sites, which points us to Google's custom nameserver, which finally provides an answer, in fact six answers, or six IP address that all map to google.com.

dig @e.root-servers.net google.com
dig @a.gtld-servers.net google.com
dig @ns1.google.com google.com

Alternatively, we can query UK's:

dig @j.root-servers.net. uky.edu
dig @b.edu-servers.net. uky.edu
dig @sndc1.net.uky.edu. uky.edu

We can also get this path information using dig's trace command:

dig google.com +trace

There are a lot of ways to use the dig command, and you can test and explore them on your own.

DNS Record Types

In the dig command output above, you will see various fields.

  • SOA: Start of Authority: describes the site's DNS entries
    • IN: Internet Record
  • NS: Name Server: state which name server provides DNS resolution
  • A: Address records: provides mapping hostname to IPv4 address
  • AAAA: Address records: provides mapping hostname to IPv6 address
dig google.com
google.com.     IN      A       142.251.32.78

Other record types include:

  • PTR: Pointer Record: provides mapping from IP Address to Hostname
  • MX: Mail exchanger: the MX record maps your email server.
  • CNAME: Canonical name: used so that a domain name may act as an alias for another domain name. Thus, say someone visits www.example.org, but if no subdomain is set up for www, then the CNAME can point to example.org.

DNS Toolbox

It's important to be able to troubleshoot DNS issues. To do that, we have a few utilities available. Here are examples and you should read the man pages for each one:

host: resolve hostnames to IP Address; or IP addresses to hostnames

man -f host
host (1) - DNS lookup utility
host uky.edu
host 128.163.35.46
host -t MX uky.edu
host -t MX dropbox.com
host -t MX netflix.com
host -t MX wikipedia.org

dig: domain information gopher -- get info on DNS servers

man -f dig
dig (1) - DNS lookup utility
dig uky.edu
dig uky.edu MX
dig www.uky.edu CNAME

nslookup: query internet name servers

man -f nslookup
nslookup (1) - query Internet name servers interactively
nslookup
> uky.edu
> yahoo.com
> exit

whois: determine ownership of a domain

man -f whois
whois (1) - client for the whois directory services
whois uky.edu | less

resolve.conf: local resolver info; what's your DNS info

man -f  resolv.conf
resolv.conf (5) - resolver configuration file
cat /etc/resolv.conf
resolvectl status

Conclusion

In the same way that phones have phone numbers, servers on the internet have IP addresses. Since we're only human, we don't remember every phone number that we dial or every IP address that we visit. In order to make such things human friendly, we use names instead.

Nameservers and DNS records act as the phone book and phone book entries of the internet. Note that I refer to the internet and not the web here. There is more at the application layer than the HTTP/HTTPS protocols, and so other types of servers, e.g., mail servers, may also have domain names and IP addresses to resolve.

In this section, we covered the basics of DNS that include:

  • FQDN: the Fully Qualified Domain Name
  • Root domains
  • Top level domains (TLDs) and Country Code TLDS (ccTLDs)
  • Second level and third level domains/subdomains
  • DNS paths, and
  • DNS record types

We'll come back to this material when we set up our websites.

Local Security

Introduction

Most security issues come from the network, but we also need to secure a system from insider attacks, too. We can do that by setting appropriate file permissions and by making sure users on a system do not have certain kinds of access (e.g., sudo access) to some utilities. For example, the /usr/bin/gcc program is the GNU C and C++ compiler. If users have unrestricted access to that compiler, then it's possible for them to compile programs that compromise the system.

In the next section, we'll cover how to set up a firewall, but in this section, we'll learn how to set up a chroot jail.

As we all know, the Linux file system has a root directory /, and under this directory are other directories like /home, /bin, and so forth. A chroot (change root) jail is a way to create a pseudo root directory at some specific location in the directory tree, and then build an environment in that pseudo root directory that offers some applications. One that environment is setup, we can then confine a user account(s) to that pseudo directory, and when they login to the server, they will only be able to see (e.g., with the cd command) what's in that pseudo root directory and only be able to use the applications that we've made available in that chroot.

Thus, a chroot jail is a technology used to change the "apparent root / directory for a user or a process" and confine that user to that location on the system. A user or process that is confined to the chroot jail cannot easily see or access the rest of the file system and will have limited access to the binaries (executables/apps/utilities) on the system. From its man page:

chroot (8) - run command or interactive shell with special root directory

Although it is not security proof, it does have some useful security use cases. Some use chroot to contain DNS servers, for example.

chroot is also the conceptual basis for some kinds of virtualization technologies that are common today, like Docker.

Chroot a Current User

In this tutorial, we are going to create a chroot for a human user account.

  1. Let's create a new user. After we create the new user, we will chroot that user going forward.

    sudo adduser vader
    
  2. Next, we chroot vader into a new directory. That directory will be located at /opt/chroot. Note that the root directory for our regular users is /, but user vader's root directory will be different /opt/chroot, even if they can't tell. We also want to check the permissions of the new directory and make sure it's owned by root. If not, use chown root:root /opt/chroot to set it.

    sudo mkdir /opt/chroot
    ls -ld /opt/chroot
    
  3. Now we set up available binaries for the user. We'll only allow bash for now. To do that, we'll create a /usr/bin directory in /opt/chroot, and copy bash to that directory.

    which bash
    sudo mkdir -p /opt/chroot/usr/bin
    cp /usr/bin/bash /opt/chroot/usr/bin/
    
  4. Large software applications have dependencies (aka, libraries). We need to copy those libraries to our chroot for the applications, like Bash, to function. To identify libraries needed by bash, we use the ldd command:

    ldd /usr/bin/bash
    

    Use the locate command to identify the locations of the libraries. The ones we need should be located in the /usr/lib/x86_64-linux-gnu/ directory. The locate command might need to be installed and updated first. Here I show how to use it to locate one of the dependencies, but there are more than one for you to locate:

    sudo apt install mlocate
    sudo updatedb
    locate libtinfo.so.6
    ...
    ...
    ...
    

    Create a /opt/chroot/lib/x86_64-linux-gnu/ directory for the libraries. We'll name the library directory after the originals to stay consistent with the main environment.

    sudo mkdir -p /opt/chroot/lib/x86_64-linux-gnu
    sudo cp /usr/lib/x86_64-linux-gnu/libtinfo.so.6 \
      /opt/chroot/lib/x86_64-linux-gnu/
    
  5. Create and test the chroot

    sudo chroot /opt/chroot/
    bash-5.1# ls
    bash: ls: command not found
    bash-5.1# help
    bash-5.1# dirs
    bash-5.1# cd bin/
    bash-5.1# dirs
    bash-5.1# cd ../lib64/
    bash-5.1# dirs
    bash-5.1# cd ..
    bash-5.1# exit
    
  6. Create a new group called chrootjail. We can add users to this group that we want to jail. Instructions are based on linuxconfig.org.

    groupadd chrootjail
    usermod -a -G chrootjail vader
    groups vader
    
  7. Edit /etc/ssh/sshd_config to direct users in the chrootjail group to the chroot directory. Add the following line at the end of the file. Then restart ssh server.

    sudo nano /etc/ssh/sshd_config
    Match group chrootjail
                ChrootDirectory /var/chroot/
    

    Exit nano, and restart ssh:

    systemctl restart sshd
    
  8. Test the ssh connection for the vader user.

    ssh vader@localhost
    -bash-5.1$ ls
    -bash: ls: command not found
    exit
    

    That works as expected. The user vader is now restricted to a special directory and has limited access to the system or to any utilities on that system.

Exercise

By using the ldd command, you can add additional binaries for this user. As an exercise, use the ldd command to locate the libraries for the nano editor, and make nano available to the user vader in the chrooted directory.

Conclusion

Systems need to be secure from the inside and out. In order to secure from the inside, system users should be given access and permissions as needed.

In this section, we covered how to create a chroot jail for a new user on our system. The jail confines the user to this pseudo location and provides the user limited access to that file system and to the software on the system.

We can jail non-human users, too. Any user listed in /etc/passwd can be jailed, and most users listed in that file are services.

Jailing a human user may not be necessary. On a multi-user system, proper education and training about the policies and uses of the system may be all that's needed. But in case a stricter environment is needed, now you know how to create a chroot jail.

Firewalls and Backups

Google Cloud Firewall and Ubuntu's UFW

A firewall program allows or denies connections for incoming (ingress) or outgoing (egress) traffic. Traffic can be controlled by link layer (e.g., a network interface such as an ethernet or wireless card), by IP layer, e.g., IPv4 or IPv6 address or address ranges; by transport layer, e.g., TCP, UDP, etc.; or by application layer via port numbers, e.g., HTTP (port 80), HTTPS (port 443), SSH (port 22), SMTPs (port 465), etc. Firewalls have other abilities. For example, they can also place limits on the number of attempts to connect.

As a side note, physical, bare metal servers may have multiple ethernet network interface cards (NICs). Each NIC would, of course, have its own MAC address, and therefore would be assigned different IP addresses. Thus, at the link layer, incoming connections can be completely blocked on one card and then outgoing connections can be completely blocked on the other. This is a made up scenario. In practice, whatever firewall rules in place would be such that would make sense for the person or organization creating them.

To control these types of connections, firewalls apply rules. A rule may block all incoming connections, but then allow SSH traffic through port 22, either via TCP or UDP, and then further restrict SSH connections to a specific IP range. And/or, another rule may block all incoming, unencrypted HTTP connections through port 80, but allow all incoming, encrypted HTTPS connections through port 443.

Let's briefly cover two ways to define firewall rules. When we set up our LAMP servers in the last part of this course, we'll need to implement some rules to allow outside connections to our server.

LAMP originally referred to Linux, Apache, MySQL, and PHP; these four technologies create a web server. Technically, only Linux (or some other OS) and Apache (or some other web server software) are needed to serve a website. PHP and MySQL provide additional functionality, like the ability for a website to interact with a relational database. The M in LAMP may also refer to MariaDB, which is a fully open source clone of MySQL. We'll use MariaDB later in this course.

First, our Google Cloud instance is pre-populated with default firewall rules at the network level, and the documentation provides an overview of these rules. Second, Ubuntu uses a firewall called ufw, which can be used to control additional connections at the operating system level. (Please read the documentation at those three links.)

It's important to know that these two firewalls provide protection at different traffic stops, so to speak. By that I mean, a Google Cloud firewall rule may allow SSH (port 22) traffic to a server instance, but if Ubuntu's ufw firewall blocks port 22 connections at the server level, then SSH traffic won't pass through. In other words, incoming connections must pass through the network firewall first, and then pass through the server firewall second. Outgoing connections must pass through the server firewall first, and then the network firewall second.

It's also important to know that Ubuntu's ufw firewall is disabled by default. In fact, it may be overkill to use both Google Cloud's firewall and Ubuntu's ufw, or it may not. It simply depends on our needs and our circumstances.

We'll return to firewalls and put some rules into practice when we work on our LAMP setup.

Backups

Catastrophes (natural, physical, criminal, or out of negligence) happen, and as a systems administrator, you may be required to have backup strategies to mitigate data loss.

How you backup depends on the machine. If I am managing physical hardware, for instance, and I want to backup a physical disk to another physical disk, then that requires a specific tool. However, if I am managing virtual machines, like our Google Cloud instance, then that requires a different tool. Therefore, in this section, I will briefly cover both scenarios.

rsync

If we were managing bare metal machines, then we might use a program like rsync to backup physical disk drives. rsync is a powerful program. It can copy disks, directories, and files. It can copy files from one location, and send the copies, encrypted, to a remote server.

For example, let's say I mount an external hard drive to my filesystem at /mnt/backup. To copy my home directory, I'd use:

rsync -av /home/me/ /mnt/backup/

where /home/me/ is the source directory, and /mnt/backup/ is the destination directory.

Syntax matters here. If I include the trailing slash on the source directory, then rsync will copy everything in /home/me/ to /mnt/backup/. However, if I leave the trailing slash off, like so:

rsync -av /home/me /mnt/backup/

then the result will be that the directory me/ will be copied to /mnt/backup/me/.

Let's see this in action. Say I have two directories. In the tmp1/ directory, there are two files: file1 and file2. The tmp2/ directory is empty. To copy file1 and file2 to tmp2, then:

ls tmp1/
file1 file2
rsync -av tmp1/ tmp2/
ls tmp2
file1 file2

However, if I leave that trailing slash off the source directory, then the tmp1/ will get copied to tmp2/:

ls tmp1
file1 file2
rsync -av tmp1 tmp2/
ls tmp2/
tmp1/
ls tmp2/tmp1/
file1 file2

rsync can also send a source directory to a directory on a remote server, and the directory and files being copied will be encrypted on the way. To do this, we use ssh style syntax:

rsync -av tmp1/ USER@REMOTE:~/tmp2/

For example:

rsync -av tmp1 linus@222.22.33.333:~/tmp2/

In fact, not only do I use rsync to backup my desktop computer to external hard drives, I also use a command like the above to copy local web projects to remote servers.

Delete Option

rsync has a --delete option. Adding this option means that rsync will synchronize the source directory with the destination directory. This means that if I had already created a backup of tmp1 to tmp2, and then delete file1 in tmp1 later, then run rsync with the delete option, then rsync will also delete file1 from tmp2/. This is how that looks:

ls tmp1/
file1 file2
rsync -av tmp1/ tmp2/
ls tmp2/
file1 file2
rm tmp1/file1
ls tmp1/
file2
rsync -av --delete tmp1/ tmp2/
ls tmp2
file2

Backups are no good if we don't know how to restore a backup to a disk. To restore with rsync, we just reverse the destination directory with the source directory:

rsync -av tmp2/ tmp1/

Google Cloud

Since our instance on Google Cloud is a virtual machine, we can use the Google Cloud console to create snapshots of our instance. A snapshot is a copy of a virtual machine at the time the snapshot was taken. What's great about taking a snapshot is that the result is basically a file of a complete operating system. Since it's a file, it can itself be used in other projects or used to restore a machine to the time the snapshot was taken.

Snapshots may also be used to document or reproduce other's work. For example, if I worked with programmers, as a systems administrator, I might help a programmer share snapshots of a virtual machine with other programmers. Those other programmers could then restore the snapshot in their own instances, and see and run the original work in the environment it was created in.

Taking snapshots in Google Cloud is very straightforward, but since it does take up extra storage, it will accrue extra costs. Since we want avoid that for now, please see the following documentation for how to take a snapshot in Google Cloud:

Create and manage disk snapshots

Conclusion

In this section, we covered firewalls and backups. Since we're running an Ubuntu server on Google Cloud, we have Google Cloud options for creating firewall rules at the network level and for backing up disks as snapshots, and we have Ubuntu options for creating firewall rules at the OS level and for backing up disks using commands like rsync.

How we go about either depends entirely on our needs or on our organization's needs. But knowing these options exist and the different reasons why we have these options, provides quite a bit of utility.

Creating a LAMP Server

In this section, we learn how to set up a LAMP (Linux, Apache, MariaDB, PHP) stack. This stack enables us to create a web server that provides extra funtionality via PHP and MariaDB. Even if we do not become web server administrators, knowing how to set up a LAMP stack is not only fun, but it's also a valuable skill to have.

Installing the Apache Web Server

Introduction

Apache is an HTTP server, otherwise called web server software. Other HTTP server software exists. Another big one is nginx. An HTTP server essentially makes files on a computer available to others who are able to establish a connection to the computer and view the files with a web browser.

It's important to understand the basics of an HTTP server, and therefore I ask you to read Apache's Getting Started page before proceeding with the rest of this section. Each of the main sections on that page describe the important elements that make up and serve a website, including

  • clients, servers, and URLs
  • hostnames and DNS
  • configuration files and directives
  • web site content
  • log files and troubleshooting

Installation

Before we install Apache, we need to update our systems first.

sudo apt update
sudo apt -y upgrade

Once the machine is updated, we can install Apache2 using apt. First we'll use apt search to identify the specific package name. I already know that a lot of results will be returned, so let's pipe the apt search command through head to look at the initial results:

sudo apt search apache2 | head

The package that we're interested in happens to be named apache2 on Ubuntu. This is not a given. On other distributions, like Fedora, the Apache package is called httpd. To learn more about the apache2 package, let's examine it with the apt show command:

apt show apache2

Once we've confirmed that apache2 is the package that we want, we install it with the apt install command. Press Y to agree to continue after running the command below:

sudo apt install apache2

Basic checks

One of the things that makes Apache2, and some other web servers, powerful is the library of modules that extend Apache's functionality. We'll come back to modules soon. For now, we're going to make sure the server is up and running, configure some basic things, and then create a basic web site.

To start, let's use systemctl to acquire some info about apache2 and make sure it is enabled and running:

systemctl list-unit-files apache2.service
systemctl status apache2

The output shows that apache2 is enabled, which means that it will start running automatically if the computer gets rebooted.

The output of the second command also shows that apache2 is enabled and that it is also active (running).

Creating a web page

Since apache2 is up and running, let's look at the default web page.

There are two ways we can look at the default web page. We can use a command line web browser. There are a number available, but I like w3m.

We can also use our regular web browsers and view the site by entering the IP address of the server in our browser URL bar.

To check with w3m, we have to install it first:

sudo apt install w3m

Once it's installed, we can visit our default site using the loopback IP address (aka, localhost). From the command line on our server, we can run either of these two commands:

w3m 127.0.0.1
w3m localhost

We can also get the subnet/private IP address using the ip a command, and then use that with w3m. For example, if ip a showed that my NIC has an IP address of 10.0.1.1, then I could use w3m with that IP address:

w3m 10.0.1.1

If the apache2 installed and started correctly, then you should see the following text at the top of the screen:

Apache2 Ubuntu Default Page
It works!

To exit w3m, press q and then y to confirm exit.

To view the default web page using a regular web browser, like Firefox, Chrome, Safari, Edge, or etc., you need to get our server's public IP address. To do that, log into the Google Cloud Console, in the left hand navigation panel, hover your cursor over the Compute Engine link, and then click on VM instances. You should see your External IP address in the table on that page. You can copy that external IP address or simply click on it to open it in a new browser tab. Then you should see the graphical version of the Apache2 Ubuntu Default Page.

Please take a moment to read through the text on the default page. It provides important information about where Ubuntu stores configuration files and what those files do, and document roots, which is where website files go.

Create a Web Page

Let's create our first web page. The default page described above provides the location of the document root at /var/www/html. When we navigate to that location, we'll see that there is already an index.html file located in that directory. This is the Apache2 Ubuntu Default Page that we described above. Let's rename that index.html file, and create a new one:

cd /var/www/html/
sudo mv index.html index.html.original
sudo nano index.html

If you know HTML, then feel free to write some basic HTML code to get started. Otherwise, you can re-type the content below in nano, and then save and exit out.

<html>
<head>
<title>My first web page using Apache2</title>
</head>
<body>

<h1>Welcome</h1>

<p>Welcome to my web site. I created this site using the Apache2 HTTP server.</p>

</body>
</html>

If you have our site open in your web browser, reload the page, and you should see the new text.

You can still view the original default page by specifying its name in the URL. For example, if your external IP address is 55.222.55.222, then you'd specify it like so:

http://55.222.55.222/index.html.original

User Directories

You may have visited sites in the past that have a tilde in the URL and look like this:

http://example.com/~user/

These are called user directories, and the provide additional path to the document root that's located in users' home directories in a directory called public_html. This is the default document root for user directories, but the default can be changed to different locations. Please read the documentation on what's called the Apache Module mod_userdir before proceeding.

By default, users with accounts on the server need to have a public_html directory in their home directories, and Apache2 needs to be configured to serve sites from those directories. For example, for the user linus, they should have the following file path available:

/home/linus/public_html/

Enable mod_userdir

The configuration file for mod_userdir is located in /etc/apache2/mods-available/ and is named userdir.conf. Files in this directory are modules that are available to Apache2 but that are not enabled (i.e., they're turned off) by default. We can view that the userdir.conf file with the less command:

less /etc/apache2/mods-available/userdir.conf

The default configuration does not need to be modified. Therefore, all we need to do is enable this module. To do that, we use the a2enmod Apache2 command (see man a2enmod for details.)

sudo a2enmod userdir

After enabling, we need to reload the HTTP service, and we can also check its status:

sudo systemctl restart apache2
systemctl status apache2

Create a User Directory Website

Let's say I am logged in as the user linus on the system and will use that to test if the user directory is working. First, let's go home. For me, as the user linus, that would /home/linus/, and I just have to type in the cd command and press Enter:

cd

Now I need to create a public_html directory in my home directory (make sure you're in your home directory!), and change into that directory:

mkdir public_html
cd public_html

By default, Apache2 looks for a file named index.html in the document root. I'll create that and add some basic HTML to it:

nano index.html

And in that file:

<html>
<head>
<title>My home site</title>
</head>
<body>

<p>This is my home site.</p>

</body>
</html>

Now simply add /~linus/ to your external IP address in your browser's URL bar. Like so (of course, replace the external IP address with your external IP address and the username with the username that you're using):

http://55.222.55.222/~linus/

Note that this process is pretty easy but that it will be different on other distributions. For example, the Fedora distribution has different Apache2 defaults. Also, on some distributions, we might need to change the directory permissions before this will work. By default, Ubuntu sets directory permissions to on our home directories to:

drwxr-xr-x

That means that any user can view the contents of our home directories. And Ubuntu sets directories created with mkdir in the home directory with these permissions by default:

drwxrwxr-x

These default settings make those directories world readable, but other distributions do not default to those permissions. If the last r-x was set to ---, then we would need to use the chmod command to make these directories executable and readable before files in our public_html directory could be accessed in a browser.

Conclusion

In this section, we learned about the Apache2 HTTP server. We learned how to install it on Ubuntu, how to use systemd (systemctl) commands to check its default status, how to create a basic web page in /var/www/html, how to view that web page using the w3m command line browser and with our regular graphical browser, how to enable the user directory module, and repeat the steps above to create a website in our home directories.

In the next section, we will learn how to make our sites applications by installing PHP and enabling the relevant PHP modules.

Installing and Configuring PHP

Introduction

Client-side programming languages, like JavaScript, are handled by the browser. Major browsers like Firefox, Chrome, Safari, Edge, etc. all include JavaScript engines that use just-in-time compilers to execute the JavaScript code (Mozilla has a nice description of the process.) From an end user's perspective, you basically install JavaScript when you install a web browser.

PHP, on the other hand, is a server-side programming language, which means it must be installed on the server in order to be used. From a system or web administrator's perspective, this means that not only does PHP have be installed on a server, but it must also be configured to work with the HTTP server, which in our case is Apache2.

The main use of PHP is to interact with databases, like MySQL, MariaDB, PostgreSQL, etc., in order to create dynamic page content. This is our goal in the last part of this class. To accomplish this, we have to:

  1. Install PHP and relevant Apache2 modules
  2. Configure PHP and relevant modules to work with Apache2
  3. Configure PHP and relevant modules to work with MariaDB

Install PHP

As normal, we will use apt install to install PHP and relevant modules and then restart Apache2 using the systemctl command:

sudo apt install php libapache2-mod-php
sudo systemctl restart apache2

We can check its status and see if there are any errors:

systemctl status apache2

Check Install

To check that it's been installed and that it's working with Apache2, we can create a small PHP file in our web document root. To do that, we change to the /var/www/html/ directory, and create a file called info.php:

cd /var/www/html/
sudo nano info.php

In that file, add the following text, then save and close the file:

<?php
phpinfo();
?>

No visit that file using the external IP address for your server. For example, in Firefox, Chrome, etc, go to (be sure to replace the IP below with your IP address):

http://55.333.55.333/info.php

You should see a page that provides system information about PHP, Apache2, and the server. The top of the page should look like Figure 1 below:

PHP install page
Fig. 1. A screenshot of the title of the PHP install page.

Basic Configurations

By default, when Apache2 serves a web page, it looks for and servers a file titled index.html, even if it does not display that file in the URL bar. Thus, http://example.com/ actually resolves to http://example.com/index.html in such cases.

However, if our plan is to provide for PHP, we want Apache2 to default to a file titled index.php instead and to the index.html file as backup. To configure that, we need to edit the dir.conf file in the /etc/apache2/mods-enabled/ directory. In that file there is a line that starts with DirectoryIndex. The first file in that line is index.html, and then there are a series of other files that Apache2 will look for in the order listed. If any of those files exist in the document root, then Apache2 will serve those before proceeding to the next. We simply want to put index.php first and let index.html be second on that line.

cd /etc/apache2/mods-enabled/
sudo nano dir.conf

And change the line to this:

DirectoryIndex index.php index.html index.cgi index.pl index.xhtml index.htm

Whenever we make a configuration change, we can use the apachectl command to check our configuration:

apachectl configtest

If we get an Syntax Ok message, you can reload the Apache2 configuration and restart the service:

sudo systemctl reload apache2
sudo systemctl restart apache2

Now create a basic PHP page. cd back to the document root directory and use nano to create and open and index.php file:

cd /var/www/html/
nano index.php

Creating an index.php File

Let's now create an index.php page, and add some HTML and PHP to it. The PHP can be a simple browser detector. Change to the /var/www/html/ directory, and use sudo nano to create and edit index.php. Then add the following code:

<html>
<head>
<title>Broswer Detector</title>
</head>
<body>
<p>You are using the following browser to view this site:</p>

<?php
echo $_SERVER['HTTP_USER_AGENT'] . "\n\n";

$browser = get_browser(null, true);
print_r($browser);
?>
</body>
</html>

Next, save the file and exit nano. In your browser, visit your external IP address site:

http://55.333.55.333/

Although your index.html file still exists in your document root, Apache2 now returns the index.php file instead. However, if for some reason the index.php was deleted, then Apache2 would revert to the index.html file since that's what's next in the dir.conf DirectoryIndex line.

Conclusion

In this section, we installed PHP and configured it work with Apache2. We also created a simple PHP test page that reported our browser user agent information on our website.

In the next section, we'll learn how to complete the LAMP stack by adding the MariaDB relational database to our setup.

Installing and Configuring MariaDB

Introduction

We started our LAMP stack when we installed Apache2 on Linux, and then we added extra functionality when we installed and configured PHP to work with Apache2. In this section, our objective is to complete the LAMP stack and install and configure MariaDB, a (so-far) compatible fork of the MySQL relational database.

If you need a refresher on relational databases, the MariaDB website can help. See: Introduction to Relational Databases.

It's also good to review the documentation for any technology that you use. MariaDB has good documentation and getting started pages.

Install and Set Up MariaDB

In this section, we'll learn how to install, setup, secure, and configure the MariaDB relational database so that it works with the Apache2 web server and the PHP programming language.

First, let's install MariaDB Community Server, and then log into the MariaDB shell under the MariaDB root account.

sudo apt install mariadb-server mariadb-client

This should also start and enable the database server, but we can check if it's running and enabled using the systemctl command:

systemctl status mariadb

Next we need to run a post installation script called mysql_secure_installation (that's not a typo) that sets up the MariaDB root password and performs some security checks. To do that, run the following command, and be sure to save the MariaDB root password you create:

sudo mysql_secure_installation

Again, here is where you create a root password for the MariaDB database server. Be sure to save that and not forget it! When you run the above script, you'll get a series of prompts to respond to like below. Press enter for the first prompt, press Y for the prompts marked Y, and input your own password. Since this server is exposed to the internet, be sure to use a complex password.

Enter the current password for root (enter for none):
Set root password: Y
New Password: XXXXXXXXX
Re-enter new password: XXXXXXXXX
Remove anonymous users: Y
Disallow root login remotely: Y
Remove test database and access to it: Y
Reload privilege tables now: Y

We can login to the database to test it. In order to do so, we have to become the root Linux user, which we can do with the following command:

sudo su

Note: we need to generally be careful when we enter commands on the command line, because it's a largely unforgiving computing environment. But we need to be especially careful when we are logged in as the Linux root user. This user can delete anything, including files that the system needs in order to boot and operate.

After we are root, we can login to MariaDB, run the show databases; command, and then exit with the \q command:

root@hostname:~# mariadb -u root
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 47
Server version: 10.3.34-MariaDB-0ubuntu0.20.04.1 Ubuntu 20.04

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
3 rows in set (0.002 sec)

Note: If we are logging into the root database account as the root Linux user, we don't need to enter our password.

Create and Set Up a Regular User Account

We need to reserve the root MariaDB user for special use cases and instead create a regular MariaDB user, or more than one MariaDB user, as needed.

To create a regular MariaDB user, we use the create command. In the command below, I'll create a new user called webapp with a complex password within the single quotes at the end (marked with a series of Xs here for demo purposes):

MariaDB [(none)]> create user 'webapp'@'localhost' identified by 'XXXXXXXXX';

If the prompt returns a Query OK message, then the new user should have been created without any issues.

Create a Practice Database

As the root database user, let's create a new database for a regular, new user.

The regular user will be granted all privileges on the new database, including all its tables. Other than granting all privileges, we could limit the user to specific privileges, including: CREATE, DROP, DELETE, INSERT, SELECT, UPDATE, and GRANT OPTION. Such privileges may be called operations or functions, and they allow MariaDB users to use and modify the databases, where appropriate. For example, we may want to limit the webapp user to only be able to use SELECT commands. It totally depends on the purpose of the database and our security risks.

MariaDB [(none)]> create database linuxdb;
MariaDB [(none)]> grant all privileges on linuxdb.* to 'webapp'@'localhost';
MariaDB [(none)]> show databases;

Exit out of the MariaDB database as the root MariaDB user, and then exit out of the root Linux user account, and you should be back to your normal Linux user account:

MariaDB [(none)]> \q
root@hostname:~# exit

Note: relational database keywords are often written in all capital letters. As far as I know, this is simply a convention to make the code more readable. However, in most cases I'll write the keywords in lower case letters. This is simply because, by convention, I'm super lazy.

Logging in as Regular User and Creating Tables

We can start doing MariaDB work. As a reminder, we've created a new MariaDB user named webapp and a new database for webapp that is called linuxdb. When we run the show databases command as the webapp user, we should see the linuxdb database (and only the linuxdb database). Note below that I use the -p option. This instructs MariaDB to request the password for the webapp user, which is required to log in.

mariadb -u webapp -p
MariaDB [(none)]> show databases;
MariaDB [(none)]> use linuxdb;

A database is not worth much without data. In the following code, I create and define a new table for our linuxdb database. The table will be called distributions, and it will contain data about various Linux distributions (name of distribution, distribution developer, and founding date).

MariaDB [(linuxdb)]> create table distributions
    -> (
    -> id int unsigned not null auto_increment,
    -> name varchar(150) not null,
    -> developer varchar(150) not null,
    -> founded date not null,
    -> primary key (id)
    -> );
Query OK, 0 rows affected (0.07 sec)

MariaDB [(linuxdb)]> show tables;
MariaDB [(linuxdb)]> describe distributions;

Congratulations! Now create some records for that table.

Adding records into the table

We can populate our linuxdb database with some data. We'll use the insert command to add our records into our distribution table:

MariaDB [(linuxdb)]> insert into distributions (name, developer, founded) values
    -> ('Debian', 'The Debian Project', '1993-09-15'),
    -> ('Ubuntu', 'Canonical Ltd.', '2004-10-20'),
    -> ('Fedora', 'Fedora Project', '2003-11-06');
Query OK, 3 rows affected (0.004 sec)
Records: 3  Duplicates: 0  Warnings: 0
MariaDB [(linuxdb)]> select * from distributions;

Success! Now let's test our table.

Testing Commands

We will complete the following tasks to refresh our MySQL/MariaDB knowledge:

  • retrieve some records or parts of records,
  • delete a record,
  • alter the table structure so that it will hold more data, and
  • add a record:
MariaDB [(linuxdb)]> select name from distributions;
MariaDB [(linuxdb)]> select founded from distributions;
MariaDB [(linuxdb)]> select name, developer from distributions;
MariaDB [(linuxdb)]> select name from distributions where name='Debian';
MariaDB [(linuxdb)]> select developer from distributions where name='Ubuntu';
MariaDB [(linuxdb)]> select * from distributions;
MariaDB [(linuxdb)]> alter table distributions
    -> add packagemanager char(3) after name;
MariaDB [(linuxdb)]> describe distributions;
MariaDB [(linuxdb)]> update distributions set packagemanager='APT' where id='1';
MariaDB [(linuxdb)]> update distributions set packagemanager='APT' where id='2';
MariaDB [(linuxdb)]> update distributions set packagemanager='DNF' where id='3';
MariaDB [(linuxdb)]> select * from distributions;
MariaDB [(linuxdb)]> delete from distributions where name='Debian';
MariaDB [(linuxdb)]> insert into distributions
    -> (name, packagemanager, developer, founded) values
    -> ('Debian', 'The Debian Project', '1993-09-15'),
    -> ('CentOS', 'YUM', 'The CentOS Project', '2004-05-14');
MariaDB [(linuxdb)]> select * from distributions;
MariaDB [(linuxdb)]> select name, packagemanager
    -> from distributions
    -> where founded < '2004-01-01';
MariaDB [(linuxdb)]> select name from distributions order by founded;
MariaDB [(linuxdb)]> \q

Install PHP and MySQL Support

The next goal is to complete the connection between PHP and MariaDB so that we can use both for our websites.

First install PHP support for MariaDB. We're installing some modules alongside the basic support. These may or may not be needed, but I'm installing them to demonstrate some basics.

sudo apt install php-mysql

And then restart Apache2 and MariaDB:

sudo systemctl restart apache2
sudo systemctl restart mariadab

Create PHP Scripts

In order for PHP to connect to MariaDB, it needs to authenticate itself. To do that, we will create a login.php file in /var/www/html. We also need to change the group ownership of the file and its permissions so that the file can be read by the Apache2 web server but not by the world, since this file will store password information.

cd /var/www/html/
sudo touch login.php
sudo chmod 640 login.php
sudo chown :www-data login.php
ls -l login.php
nano login.php

In the file, add the following credentials. If you used a different database name than linuxdb and a different username than webapp, then you need to substitute your names below. You need to use your own password where I have the Xs:

<?php // login.php
$db_hostname = "localhost";
$db_database = "linuxdb";
$db_username = "webapp";
$db_password = "XXXXXXXXX";
?>

Next we create a new PHP file for our website. This file will display HTML but will primarily be PHP interacting with our MariaDB distributions database.

Create a file titled distros.php.

sudo nano distros.php

Then copy over the following text (I suggest you transcribe it, especially if you're interested in learning a bit of PHP, but you can simply copy and paste it into the nano buffer):

<html>
<head>
<title>MySQL Server Example</title>
</head>
<body>

<?php

// Load MySQL credentials
require_once 'login.php';

// Establish connection
$conn = mysqli_connect($db_hostname, $db_username, $db_password) or
  die("Unable to connect");

// Open database
mysqli_select_db($conn, $db_database) or
  die("Could not open database '$db_database'");

// QUERY 1
$query1 = "show tables from $db_database";
$result1 = mysqli_query($conn, $query1);

$tblcnt = 0;
while($tbl = mysqli_fetch_array($result1)) {
  $tblcnt++;
}

if (!$tblcnt) {
  echo "<p>There are no tables</p>\n";
}
else {
  echo "<p>There are $tblcnt tables</p>\n";
}

// Free result1 set
mysqli_free_result($result1);

// QUERY 2
$query2 = "select name, developer from distributions";
$result2 = mysqli_query($conn, $query2);

$row = mysqli_fetch_array($result2, MYSQLI_NUM);
printf ("%s (%s)\n", $row[0], $row[1]);
echo "<br/>";

$row = mysqli_fetch_array($result2, MYSQLI_ASSOC);
printf ("%s (%s)\n", $row["name"], $row["developer"]);

// Free result2 set
mysqli_free_result($result2);

// Query 3
$query3 = "select * from distributions";
$result3 = mysqli_query($conn, $query3);

while($row = $result3->fetch_assoc()) {
  echo "<p>Owner " . $row["developer"] . " manages distribution " . $row["name"] . ".</p>";
}

mysqli_free_result($result3);

$result4 = mysqli_query($conn, $query3);
while($row = $result4->fetch_assoc()) {
  echo "<p>Distribution " . $row["name"] . " was released on " . $row["founded"] . ".</p>";
}

// Free result4 set
mysqli_free_result($result4);

/* Close connection */
mysqli_close($conn);

?>

</body>
</html>

Save the file and exit out of nano.

Test Syntax

After you save the file and exit the text editor, we need to test the PHP syntax. If there are any errors in our PHP, these commands will show the line numbers that are causing errors or leading up to errors. Nothing will output if all is well with the first command. If all is well with the second command, HTML should be outputted:

sudo php -f login.php
sudo php -f index.php

Conclusion

Congratulations! If you've reached this far, you have successfully created a LAMP stack. In the process, you have learned how to install and set up MariaDB, how to create MariaDB root and regular user accounts, how to create a test database with play data for practicing, and how to connect this with PHP for display on a webpage.

In regular applications of these technologies, there's a lot more involved, but completing the above process is a great start to learning more.

Conclusion

I consider this book to be a live document. Perhaps, then, this is version 0.8, or something like that. In any case, it will be continually updated throughout the year but probably more often before and during the fall semesters when I teach my Linux Systems Administration course.

This book in no way is meant to provide a comprehensive overview of systems administration nor of Linux. It's meant to act as a starting point for those interested in systems administration, and it's meant to get students, many of whom grew up using only graphical user interfaces, familiar with command line environments. In that respect, this book, and the course that I teach, is aimed at empowering students to know their technology and become comfortable and more experienced with it, especially the behind the scenes stuff. That said, I'm proud that some of my students have gone on to become systems administrators. Other courses in our program and their own work and internships have probabably contributed more to that motivation, but I know that this course has been a factor.

If you're not a student in our program but have stumbled upon this book, I hope it's helpful to you, too. This is, in fact, why I've made it available on my website and not simply dropped it in my course shell.

C. Sean Burns, PhD
August 13, 2022