Development

complete-awk-guide

Captain Salem 16 min read

Beginners Guide to AWK Programming

AWK is a Turing-complete pattern matching programming language. The name AWK is derived from the family names of its three authors: Alfred Aho, Peter Weinberger and Brian Kernighan. AWK is often associated with sed, which is a UNIX command line tool. However, sed is more appropriate for one line UNIX shell commands and is typically used only for text processing.

The basic function of awk is to search files for lines (or other units of text) that contain certain patterns. When a line matches one of the patterns, awk performs specified actions on that line. awk continues to process input lines in this way until it reaches the end of the input files.

AWK is great for data reporting, analysis, and extraction and supports arrays, associative arrays, functions, variables, loops, and regular expressions. Current Linux systems use improved versions of the original AWK utility. The main enhancement to these AWK variants is support for a larger set of built-in functions and variables. The most widely used variants of AWK are: Gawk, Mawk, and Nawk.

In this tutorial, we will use the GNU version of AWK - Gawk

There are many practical uses of AWK. For example, you can use AWK and the history command to find your top 10 most frequently issued commands:

history | awk '{CMD[$2]++;count++;} END {for (a in CMD)print CMD[a] " "CMD[a]/count*100 " % " a;} ' | grep -v "./" | column -c3 -s " " -t | sort -rn | head -n10

This tutorial assumes familiarity with programming language concepts and is meant to provide an overview of some basic elements of the AWK programming language. Here are some of the concepts you will learn.

AWK Basics

In this section you will learn basics of the AWK programming language, including:

  • How to execute AWK from the command line with one-off commands and by storing AWK code in files.
  • Creating and using variables, arrays, and functions.
  • Special patterns, like BEGIN and END.

A pattern in AWK controls the execution of rules and a rule is executed when its pattern is a match for the current input record.

Run an AWK Program

A program in AWK can be written via the command line or by executing a file containing the program. If you want to reuse your code, it is better to store it in a file. AWK reads input from standard input or from files specified as command line arguments. Input is divided into individual records and fields. By default, new lines are parsed as a record and whitespace is parsed as a field. After a record is read, it is split into fields. AWK does not alter the original input.

The next two sections will walk you through creating a Hello World! program that you will run, both as a one-off program on the command line, and as reusable code saved in a file.

Hello World! - Command Line

When an AWK program contains the BEGIN pattern without another special pattern, AWK will not expect any further command line input and exit. Typically, when an AWK program is executed on the command line, without the BEGIN special pattern, AWK will continue to expect input until you exit by typing CTRL-D. The example Hello World! program below will print and immediately exit.

Execute the command as follows:

awk 'BEGIN { print "Hello World!" }'

The output will be as follows:

Hello World!

Hello World! - Input File

In this section, you will create an AWK program in an input file and then run it from the command line.

  1. You can also run AWK programs as executable scripts.

Open test.awk and add a bash script line to the top of the file and save it without the .awk extension.

#!/usr/bin/awk -f
BEGIN { print "Hello World!" }

The output will be as follows:

Hello World!

On the command line, run the helloworld.awk program. The -f option tells AWK to expect a source file as the program to run.

awk -f helloworld.awk

Create a file called helloworld.awk with the following content:

BEGIN { print "Hello World!" }

The #!/usr/bin/awk -f line defines the start of script execution.

The output will resemble the following:

Hello World!

Execute the helloworld program:

./helloworld

Add execute permissions to file:

chmod +x test

Variables in AWK

AWK supports built-in and user defined variables. Built-in variables are native to AWK, whereas user defined variables are ones you define.

Built-in Variables

AWK has many built-in variables that are automatically initialized. Some of the most important ones are the following:

Variable Definition
NF Holds the number of fields in the current input record. Each record can have a different number of fields.
FS Defines the input field separator. The default value is a whitespace, but it also matches any sequence of spaces and tabs. Additionally, any number of leading or trailing whitespaces and tabs are ignored. If the value of FS is set to the null string, then each character in the current line becomes a separate field.
FILENAME Stores the filename of the current input file. You cannot use FILENAME inside a BEGIN block, because there are no input files being processed.
NR Keeps track of the total number of records that have been read so far.
FNR Stores the total number of records that have been read from the current input file.
IGNORECASE Tells AWK whether or not to ignore case in all of its comparisons or regular expressions. If IGNORECASE stores a non-zero or null value, then AWK will ignore case.
ARGC Holds the number of command line arguments.
ARGV Stores the actual command line arguments of an AWK program.

User Defined Variables

User defined variables can store numeric or string values. AWK dynamically assigns variables a type based on the variable’s initial value. User defined variables, by default, are initialized to the empty string. If you convert a variable from a string to a number, the default value is zero. You can convert a string to a number and vice versa as long as the string can be converted to a valid number. It is important to keep in mind that AWK is not a type safe programming language, since this can sometimes generate bugs.

To initialize variables within an input file, you can use the form myvariable = "myvar" for strings and myvariable = 10 for numeric values. Create a file named count.awk and add the following content:

BEGIN {
count = 10
print count
}

To run this file, switch back to the command line and execute the following command:

awk -f count.awk

Your output should display:

10

You can set a variable via the command line using the -v option. This command will initialize the variable count and print its value:

awk -v count=8 'BEGIN { print count }'

Special Patterns

AWK uses patterns to control how a rule should be executed against an input record. The two main categories of patterns in AWK are regular expressions and expressions. Regular expressions use a special format to target specific sets of strings, while expressions encompass various ways to target patterns in AWK, like comparison expressions that may utilize regular expressions. Special patterns in AWK include reserved keywords that perform special actions within your AWK programs. The sections below discuss the special patterns BEGIN, END, BEGINFILE, and ENDFILE.

BEGIN and END

BEGIN and END are executed only once: before receiving any input and after processing all input, respectively. In this way, they can be used to perform startup and cleanup actions in your AWK programs.

Although it is not required to use BEGIN and END at the beginning and end of your AWK programs, it is considered good practice to do so. Additionally, you can include multiple BEGIN and END blocks in one program.

If an AWK program uses only BEGIN rules without any other code, the program terminates without reading any of the specified input. However, if an AWK program contains only END rules without any additional code, all the specified input is read. This is necessary in case the END rule references the FNR and NR variables.

BEGINFILE and ENDFILE

BEGINFILE and ENDFILE only work with gawk.

Two other patterns with special functionality are BEGINFILE and ENDFILE. BEGINFILE is executed before AWK reads the first record from a file, whereas ENDFILE is executed after AWK is done with the last record of a file.

ENDFILE is convenient for recovering from I/O errors during processing. The AWK program can pass control to ENDFILE, and instead of stopping abnormally it sets the ERRNO variable to describe the error that occurred. AWK clears the ERRNO variable before it starts processing the next file. Similarly, the nextfile statement – when used inside BEGINFILE – allows gawk to move to the next data file instead of exiting with a fatal error and without executing the ENDFILE block.

As an example, create a file named testfile.awk:

   ```bash
   BEGIN {
       numberOfFiles = 0
       }
   BEGINFILE {
       print "New file", FILENAME
   ```

    # Check if there is an error while trying to read the file
    if (ERRNO) {
        print "Cannot read", FILENAME, "– processing next file!"
        nextfile
    }

}

ENDFILE {
    numberOfFiles++
}

END {
    print "Total number of files processed: ", numberOfFiles
}
  • This program showcases the usage of BEGIN, END, BEGINFILE, and ENDFILE by printing the total number of files read as well as the filename of each file.
  • If there is a problem while reading a file, the code will report it.
  • Printing the filename is done with the help of the FILENAME variable.

The output will be similar to the following example. The program does not stop abnormally when it does not find an input file and provides a useful error message.

New file hw.awk
Cannot read hw.awk – processing next file!
New file beginfile.awk
New file givenLine.awk
Cannot read givenLine.awk – processing next file!
New file doesNotExist
Cannot read doesNotExist – processing next file!
Total number of files processed:  1

Execute the file with the following command:

gawk -f hw.awk testfile.awk givenLine.awk doesNotExist

Looping in AWK

AWK supports for, do-while, and while loops that behave similarly to control flow statements in other programming languages. Loops execute code contained within a code block as many times as specified in the control flow statement. To illustrate loops in AWK, a working example is provided below.

Create and save a file named loops.awk:

1. BEGIN {
   for (i = 0; i < ARGC; i++)
       printf "ARGV[%d] = %s\n", i, ARGV[i]

   k = 0
   while ( k < ARGC ) {
       printf "ARGV[%d] = %s\n", k, ARGV[k]
       k++
   }

   m = 0
   do {
       printf "ARGV[%d] = %s\n", m, ARGV[m]
       m++
   } while ( m < ARGC )
   }

END {
    for (i = 0; i < 10; i++)
        printf "%d ", i
    printf "\n"
}
  • The program uses the value of the ARGC built-in variable to control how many times to loop through each separate block of code. The result will vary depending on how many command line arguments you pass to AWK when executing the program.
  • The for loop after the END special pattern will print numbers from 0 - 9.

The output will be similar to the following:

ARGV[0] = awk
ARGV[0] = awk
ARGV[0] = awk
0 1 2 3 4 5 6 7 8 9

Execute the loops.awk input program with the following command:

echo "" | awk -f loops.awk

Arrays

AWK does not require array indices to be consecutive integers. Instead, strings and numbers may be used. This is because AWK uses string keys internally to represent an array’s indices, and so arrays in AWK are more like associative arrays that store a collection of pairs. Unlike other programming languages, you do not need to declare an array and its size before using it, and new pairs can be added at any time. The file below serves to illustrate the behavior of arrays in AWK.

Create the file arrays.awk:

BEGIN {
a[0] = 1;
a[1] = 2;
a[2] = 3;
a[3] = 4;

for (i in a)
    print "Index:", i, "with value:", a[i];
    print "Adding two elements and deleting a[0]";

    a["One"] = "One_value";
    a["Two"] = "Two_value";
    delete a[0];

for (i in a)
    print "Index:", i, "with value:", a[i];

if (a["1"] == a[1])
    printf "a[1] = a[\"1\"] = %s\n", a["1"];
}
  • The program creates the a[] array and initializes it with four separate numeric values.
  • The for block will loop through the array and print the current index and value.
  • It then adds two new elements to array a[] that use string indices instead of numbers.
  • It demonstrates how to delete an element from an array by deleting the a[0] element.
  • Finally, the if statement evaluates if a["1"] and a[1] are equivalent. Since AWK stores all array elements as string keys, both indices point to the same array element and the code in the if statement executes.
  1. The output will look similar to the following:```bash Index: 0 with value: 1 Index: 1 with value: 2 Index: 2 with value: 3 Index: 3 with value: 4 Adding two elements and deleting a[0] Index: Two with value: Two_value Index: One with value: One_value Index: 1 with value: 2 Index: 2 with value: 3 Index: 3 with value: 4 a[1] = a[“1”] = 2

Run the program with the following command:

awk -f arrays.awk

The order of the array indices may be out of order. This is because arrays in AWK are associative and not assigned in blocks of contiguous memory.

Functions

Like most programming languages, AWK supports user-defined functions and ships with several useful built-in functions. This section will provide examples demonstrating how to use both types of functions.

Predefined Functions

AWK’s built-in functions provide mechanisms for string manipulation, numeric operations, and I/O functions to work with files and shell commands. The example below utilizes the built-in numeric functions rand() and int() to show how to call built-in functions.

  1. Create and save a file named rand.awk:```awk BEGIN { while (i < 20) { n = int(rand()*10); print “value of n:”, n; i++; } }
  • The rand.awk program uses the rand() function to generate a random number and stores it in the n variable. By default, rand() returns a random number between 0 and 1. To generate numbers larger than 1, the program multiplies the returned random number by 10.
  • AWK’s int() function rounds the result of the rand() function to the nearest integer.
  1. The output will resemble the following:```bash value of n: 2 value of n: 2 value of n: 8 value of n: 1 value of n: 5 value of n: 1 value of n: 8 value of n: 1

Execute the rand.awk program with the following command:

awk -f rand.awk

User Defined Functions

The AWK programming language allows you to define your own functions and call them throughout an AWK program file. A function definition must include a name and can include a parameter list. Function names can only contain a sequence of letters, digits, and underscores. The function name cannot begin with a digit. In the example below, you will declare a function definition and utilize it within the AWK program.

Create and save the myFunction.awk file:

function isnum(x) { return(x==x+0) }

function sumToN(n) {
    sum = 0
    if (n < 0) { n = -n }
    if ( isnum(n) ) {
        for (j = 1; j <= n; j++)
            sum = sum + j
    } else { return -1 }
    return sum
}
{
    for (i=1; i<=NF; i++)
        print $i, "\t:", sumToN($i)
}
  • The user defined function sumToN() takes a single parameter n and uses a for loop to increment its value and stores it in the sum variable.
  • The program will take command line input, and pass it as a parameter to the sumToN() function and print the calculated sum.

Your output will resemble the example below. If you use a different set of numbers, your output will differ from the example.

10 : 55
12 : 78

Execute myFunction.awk with the following command:

echo "10 12" | awk -f myFunction.awk

Practical Examples

This section of the guide provides a variety of practical examples to further demonstrate the AWK programming language. You can try out each example on your own Linux machine or expand on the examples for your own specific needs.

Printing

Printing a Given Line from a File

To use AWK to print a given line from a text file, create and save the givenLine.awk file:

{
if (NR == line)
    print $0;
}
  • This program will print out the record that corresponds to the value passed to the line variable. The program will require input either from the command line or from a file.
  • You should pass the value of the line variable to the AWK program as a command line argument using the -v option.

This time the output is as follows:

sum = 0

Execute givenLine.awk again, passing line 4:

awk -v line=4 -f givenLine.awk myFunction.awk

By executing the givenLine.awk program as follows, it will print out the first line found in the myFunction.awk program written in the previous section. (You could similarly pass it any text file.)

awk -v line=1 -f givenLine.awk myFunction.awk

The output will resemble the following:

function isnum(x) { return(x==x+0) }

Printing Two Given Fields from a File

In this example, the AWK program will print the values of the first and third fields of any text file.

Create and save the file field1and3.awk:

{
    print $1, $3;
}

Create and save the file words.txt:

one two three

Execute field1and3.awk passing words.txt as input:

awk -f field1and3.awk words.txt

The output will print only the first and third words (fields) contained in the file:

one three

You can also execute the contents of field1and3.awk on the command line and pass words.txt as input:

awk '{print $1, $3}' words.txt

Counting

Counting Lines

The following example AWK program will count the number of lines that are found in the given text file(s).

FNR stores the total number of records that have been read from the current input file.

Create and save the countLines.awk file:

{
    if (FNR==1)
        print "Processing:", FILENAME;
    }

END {
    print "Read", NR, "records in total";
}
  • The use of FNR makes sure that the filename of each processed file will be printed only once.
  • END makes sure that the results will be printed just before AWK finishes executing countLines.awk.
  1. Create and save the data.txt file. This file will be passed to AWK as input for processing.
one
two
three
4
6
seven not eight

Execute countLines.awk with the following command, passing data.txt as input:

Execute countLines.awk with the following command, passing data.txt as input:

awk -f countLines.awk data.txt

The output will resemble the following:

Processing: data.txt
Read 7 records in total

Execute countLines.awk with multiple files for processing. You can use words.txt from the previous exercise.

awk -f countLines.awk data.txt words.txt
  1. You should see a similar output:```bash Processing: data.txt Processing: words.txt Read 8 records in total

Counting Lines with a Specific Pattern

The following AWK code uses the variable n to count the number of lines that contain the string three:

awk '/three/ { n++ }; END { print n+0 }'
  • The code above tells AWK to execute n++ each time there is a match to the /three/ regular expression.
  • When the processing is done, the code in END is executed. This code prints the current value of n converted to a number by adding the numeral zero.
  1. The output will look as follows:```bash 2 #### Counting Characters

Execute the example code and pass dataFile.txt as input:

awk '/three/ { n++ }; END { print n+0 }' dataFile.txt

Create a file named dataFile.txt to pass to AWK as input for processing:

one
two
three
four
three
two
one

In this example, the countChars.awk file calculates the number of characters found in an input file.

Create and save the file countChars.awk:

BEGIN {
    n = 0;
    }
{
    if (FNR==1)
        print "Processing:", FILENAME;

    n = n + length($0) + 1;
}

END {
    print "Read", n, "characters in total";
}
  • This program makes use of the built-in string function length(), which returns the number of characters in a string. In the case of the program, the string will be provided by the entirety of the current record, which is indicated by $0.
  • The + 1 appended to the length() function is used to account for the new line character that each line includes.

Execute countChars.awk by running the following command and pass it the countLines.awk file from the previous exercise.

awk -f countChars.awk countLines.awk

The output will look similar to the following:

Processing: countLines.awk
Read 110 characters in total

Execute countChars.awk with multiple files to process as follows:

awk -f countChars.awk countLines.awk field1and3.awk

Output

Processing: countLines.awk
Processing: field1and3.awk
Read 132 characters in total

Calculating Word Frequencies

This example demonstrates some of the advanced capabilities of AWK. The file wordFreq.awk reads a text file and counts how many times each word appears in the text file using associative arrays.

Create and save the file wordFreq.awk:

{
    for (i= 1; i<=NF; i++ ) {
        $i = tolower($i)
        freq[$i]++
    }
    }

END {
    for (word in freq)
        print word, ":", freq[word]
}
  • wordFreq.awk uses a for loop to traverse through an input file and add each record to the freq[] array.
  • The tolower() built-in string function is used to ensure the program does not count the same word multiple times based on differences in case, e.g., seven and Seven are not counted as different words.
  • Before the program exits, the END block prints out each word and its frequency with the input file.

Create and save the file wordFreq.txt to use as an input file.

one two
three one four seven Seven
One Two TWO
one three five

Execute the wordFreq.awk program and pass wordFreq.txt as input:

awk -f wordFreq.awk wordFreq.txt | sort -k3rn

The sort -k3rn command is used to sort the output of wordFreq.awk based on a numeric sort in reverse order.

The output will resemble the following:

one : 4
two : 3
seven : 2
three : 2
five : 1
four : 1

Updating Docker Images

Use the following series of piped commands to update all Docker images found on your local machine to their latest version:

docker images | grep -v REPOSITORY | awk '{print $1}' | xargs -L1 docker pull
  • In this example, AWK is just a piece of the entire command. AWK does the job of extracting the first field from the result of executing the docker images command.

Finding

Finding the Top-10 Commands of your Command History

Use The following shell command to find your top 10 most used commands by piping the output of history to AWK as input:

history | awk '{CMD[$2]++;count++;} END {for (a in CMD)print CMD[a] " "CMD[a]/count*100 " % " a;} ' | grep -v "./" | column -c3 -s " " -t | sort -rn | head -n10
  • First, the command executes the history command to be used as AWK’s input.
  • This is processed by a complex awk command that calculates the number of times each command appears in history by considering the second field of each record. This is the field that corresponds to the previously issued commands. These values are stored in the CMD[] associative array.
  • At the same time, the total number of commands that have been processed are stored in the count variable.
  • The frequency of each command is calculated with the CMD[a]/count*100 statement and printed on the screen along with the command name.
  • The formatting and the sorting of the output is handled by the grep, column, sort, and head command line utilities.

Your output should resemble the following:


2318  18.4775    %  git
1224  9.75688    %  ll
1176  9.37425    %  go
646   5.14946    %  docker
584   4.65524    %  cat
564   4.49582    %  brew
427   3.40375    %  lenses-cli
421   3.35592    %  cd
413   3.29215    %  vi
378   3.01315    %  rm

Finding the Number of Records that Appear More than Once

This program’s logic utilizes the behavior of AWK associative arrays. The associative array’s keys are the entire lines of the passed input. This means that if a line appears more than once, it will be found in the associative array and will have a value that is different from the default, which is 0.

  1. The output will look similar to the following:``` Found 42 lines with records that already existed

Execute the command again, passing the file twice to itself:

awk -f nDuplicates.awk nDuplicates.awk nDuplicates.awk

The output will look similar to the following:

Found 5 lines with duplicate records.

Execute the nDuplicates.awk file and pass the file to itself as input:

awk -f nDuplicates.awk nDuplicates.awk

Create and save the file nDuplicates.awk:

BEGIN {
    total = 0;
    }

{
    i = tolower($0);
    if (freq[i] == 1) {
        total = total + 2;
    } else if (freq[i] > 1) {
        total++;
    }
        freq[i]++;
}

END {
    print "Found", total, "lines with duplicate records.";
}

You can learn more about AWK and its alternative here: https://www.gnu.org/software/gawk/

NOTE: Tutorial Inspired By Linode Docs

Conclusion

We hope you enjoyed this tutorial. Feel free to share and check out Linode Cloud Services to learn how you expand your cloud, Linux, and DevOps knowledge (Not Sponsored!!).

Thanks for reading, catch you in the next one.

Share
Comments
More from Cloudenv

Cloudenv

Developer Tips, Tricks and Tutorials.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Cloudenv.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.