Pro Terminal Commands: Working with Awk


Awk is a pro terminal command for data processing built right into your Mac, but that’s not all it is. It also has the power and capabilities of a general-purpose programming language. And while you won’t be writing the next big app in awk, you get access to an incredible range of functionality within the language.

The awk language is built to take input in the form of data and run actions based on that input. When processing text, awk will beat out a general-purpose language for simplicity and directness. Awk is also an interpreted language, skipping the compiling and debugging process of a compiled language like C.

Awk’s basic syntax

When you call awk from the command line, you’ll follow the basic scheme below:

pattern { action } file
pattern { action } file
...

Awk executes the action when the pattern is found in the file specified. In the absence of a specified file, awk will run on the Terminal’s standard input. Pattern-matching is extremely flexible, supporting detailed and complex matching structures. Let’s consider a simple example below:

awk '/com/ { print $0 }' emails

This one-line awk program prints any line in the file emails that contains the string com. The $0 specifies the entire line, but that’s also awk’s default behavior when finding a pattern match. If we had specified only '/com/' emails we would receive the same output.

Sharp-eyed readers will notice the / around com. Those forward slashes specify the beginning and end of a regular expression. In this case, our regular expression just matches the literal string com.

Printing fields

One of awk’s most useful features is identifying field separators and parsing data based on fields. This allows users to print out specific cells of data in long documents, like CSVs you might open in Excel. To start out, we will use the simple /etc/passwd file for this example.

Open the passwd file in TextEdit and you’ll see lines of data separated by colons.

Those colons are the field separators. They tell a parsing program like awk or Excel when to begin and end cells. Each colon stops the previous field and starts a new one, just like periods in English both indicate the end of a sentence and provide for the beginning of a new sentence.

We need to tell awk what character to use as a field separator; otherwise, it will default to white space. And you’ll notice there’s virtually no white space in the passwd file.

awk -F":" '{ print $1 }' /etc/passwd

The flag -F indicates that the following character (: in this example) should be interpreted as the field separator. With the fields identified, awk can manipulate the data contained within those fields. In the above command, awk prints the first field, specified by $1. If you think of these fields as an array, this particular array starts at 1. 0 is reserved for the entire line, as we saw above.

We’re not just limited to one field at a time either. Nor are we forced to print only text from the data source. If we specify more than one field, awk will print more than one field.

awk -F":" '{ print $4 " " $5}' /etc/passwd

Run that command in Terminal, and you’ll get something like this:

Let’s look at that awk program in more detail:

awk -F":" '{ print $4 " " $5}' /etc/passwd

The program prints the fourth ($4) and fifth ($5) field of the passwd file. The : symbol will be used as a field separator (-F":").

This program also prints a space in between the two fields to make the output more legible. Eagle-eyed readers will notice the double-quotes (") around the space. Double quotes make awk interpret the contents of the double quotes as literal. That means awk will print whatever is between the quotes exactly, rather than interpreting it as part of the program. Thanks to the double quotes, the space is a literal character within the print command and is printed as written.

If we want to get a little more daring with our pretty print, we can label our fields and include tabs between them to create columns:

awk -F":" '{ print "process: " $5 "\t\t " "directory: "$6}' /etc/passwd

This includes labels for identification in our output and separates fields with the \t character to insert tabs. And as with any command, we can save the output of awk to a new file using a caret (>).

awk -F":" '{ print "process: " $5 "\t\t " "directory: "$6}' /etc/passwd > processes.txt

Selecting and printing data is the meat and potatoes of the awk command. And we don’t have to just print numbered fields. We can also search for data within those fields using regular expressions.

If you’re unfamiliar, regular expressions are a search tool. By setting up a regular expression pattern, we can find any text that matches the defined pattern. And these patterns can get outrageously complex and, unfortunately, cryptic. Take the command below, for example, which includes a regular expression to match standard formatted US phone numbers. Looking at that, would you expect that result without being told? Regular expressions are powerful, but their syntax is often opaque. Even experienced users require regular syntax refreshers after a break from regex.

As mentioned above, this program will print any line in the file contacts that matches our regular expression searching for properly-formatted phone numbers.

awk '/^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$/ { print }' contacts

If you’re new to regular expressions, this Stack Overflow post is a good starting point. You can also read a textbook on regex if you’re interested in the finer points. Returning users can check out this regex cheat sheet.

Expanding Awk’s matching power

As expected of a programming language, awk can use operands to compare things. You have your standard operands ==, <, >, <=, >=, and !=. You also have access to the awk-specific operands ~ and !~, which mean “matches” and “does not match,” respectively.

Awk Command Examples

awk 'length($0) > 80' data

The above program will prints all lines longer than eighty characters in the file data by counting the length of the line (remember, $0) with the built-in length function. We need not include a print command here. When no action is specified, awk will print the full line ($0) whenever it finds a pattern match. It’s like an automatic '{ print $0 }' appended each command until you specify otherwise.

$1 == "user" { print }

Note that we’ve left off the awk command and quotations: we’re assuming this is inside an awk program at this point.

This command will print all lines where the first field equals the string “user.” The $0 is not necessary because print alone will output the whole line. Furthermore, without an -F flag, awk will use white space as the default field separator.

$5 ~ /root/ { print $3 }

This command will print the third field ($3) whenever the fifth field ($5) matches the regular expression /root/, which matches the string literal root through regular expressions. This allows us to set up the if/then relationship that’s crucial to programming.

{ 
  if ( $5 ~ /root/ ) { 
          print $3 
  } 
}

We can also write our commands in more familiar C-like fashion, using curly braces, indentation, and line breaks. This command does exactly the same thing as the one above, just with a cleaner and more legible presentation. Legibility and organization is especially important for long awk files saved as separate documents. So how do we save our awk programs?

Saving awk programs in files

Awk programs can be saved in files with their very own awk extension. This allows for complex, editable, repeatable programs. To create an awk program file, write your awk program in a plain text file and save it with the extension .awk. Then execute the file in the command line like so:

awk -f ~/scripts/program.awk data

When using the -f flag (mind the lower case), awk grabs the specified file and runs it as an awk program. The commands in that program will process the file data.

If you desire, you can begin your awk programs with #!/usr/bin/awk -f in the manner of a shell script. This will allow text editors like vim to provide syntax coloring, and also allow you to execute the commands as a shell script. As with many programming languages, the # symbol is the comment specifier in awk. As a result, this heading has no effect on the awk program itself.

When you write awk programs, header and footer actions can be helpful. These actions happen at the beginning and end of the program’s execution. Actions can be run before the body of the program with BEGIN and after the body of the program with END. It’s a great idea to put field separators in BEGIN and cleanup or finalize in END.

BEGIN { FS=":" } # indicates that : is the field separator for the program.

#operations

END { print "You're done" } # prints a joyful message for the user

Awk comments start with # and last until the end of their line. Multi-line comments require multiple # symbols or fancy-pants programming.

Further Exploration

Since awk is a full programming language, one post can’t hope to capture everything about it. Fortunately, some bright folks have written books all about the language’s use. You can explore the GNU documentation for awk for a detailed web-based guide.

More traditionally-minded users can thumb The Awk Programming Language, the textbook written by the developers of the program. Fun fact: the programming language’s odd name is an acronym of their names: Alfred Aho, Peter Weinberger, and Brian Kernighan

You might also like the following pro terminal posts:

Pro Terminal Commands: Using diskutil

Pro Terminal: Install Linux Software on a Mac with MacPorts

Update Mac Apps Using Terminal


Alexander Fox

0 Comments

Your email address will not be published.