Installation

We are going to play with the sales.csv file and the output of ls -l command in this chapter.

  • Searching all the lines which contain J

The above command is equivalent of grep "J" sales.csv

    code ⏰  awk '/J/ {print}' sales.csv
    Jan,100
    June,1000
    July,10

Note: If you don't know what the ls -l command does then please research on bash shell before proceeding.

  • Print the names of files that are larger than 300 bytes

      book ⏰  ls -l | awk '$5>300 {print}'
    
      -rw-r--r--  1 admin  staff     403 Aug 28  2016 CONTRIBUTORS.md
      -rw-r--r--  1 admin  staff    5313 Mar 25  2017 README.md
      -rw-r--r--  1 admin  staff    1641 Jul 11 23:53 SUMMARY.md
      -rw-r--r--  1 admin  staff  230613 Aug 23  2016 cover.jpg
      -rw-r--r--  1 admin  staff  692088 Aug 23  2016 cover.xcf

Explanation:

We conditionally print the line in the above command. The condition should always be outside the curly braces and inside the quotes.

However, if one really wants the condition inside the curly brackets then they can use the if statement which we will explore later.

  • Print the names of files that has the character c in its name somewhere.

      awk book ⏰  ls -l | awk '$NF ~ /c/ {print}'
      drwxr-xr-x  5 admin  staff     160 Jul 12 21:41 code
      -rw-r--r--  1 admin  staff  230613 Aug 23  2016 cover.jpg
      -rw-r--r--  1 admin  staff  692088 Aug 23  2016 cover.xcf

Explanation:

If the last field of ls -l contains c then print the entire line.

  • Print the names of file which don't have c in their name

      awk book ⏰  ls -l | awk '$NF !~ /c/ {print}'

The meaning of ~:

The ~ syntax is used to match the field with a particular value. Here, we used it to match with /c/ i.e. with the character c.

Explanation:

If the last field of ls -l does not contain c then print the entire line. The ! mark denotes the negation.

Note: We have used $NF, but you can very well use $8.

  • Run two commands at the same time (Logical OR)

      awk book ⏰  ls -l | awk '$NF ~ /c/ {print} $5>300 {print}'
      -rw-r--r--  1 admin  staff     403 Aug 28  2016 CONTRIBUTORS.md
      -rw-r--r--  1 admin  staff    5313 Mar 25  2017 README.md
      -rw-r--r--  1 admin  staff    1641 Jul 11 23:53 SUMMARY.md
      -rw-r--r--@ 1 admin  staff      60 Jul 11 23:52 awkbook.code-workspace
      drwxr-xr-x  5 admin  staff     160 Jul 12 21:41 code
      -rw-r--r--  1 admin  staff  230613 Aug 23  2016 cover.jpg
      -rw-r--r--  1 admin  staff  230613 Aug 23  2016 cover.jpg
      -rw-r--r--  1 admin  staff  692088 Aug 23  2016 cover.xcf
      -rw-r--r--  1 admin  staff  692088 Aug 23  2016 cover.xcf
      drwxr-xr-x  8 admin  staff     256 Jul 12 23:21 manuscript

Note that this does not make it a logical and. It just executes both the commands.

We can re-write the above command as:

    awk book ⏰  ls -l | awk '$NF ~ /c/ ||  $5>300 {print}'

Here, || stands for logical OR.

Explanation:

We have two conditions in the above command and one block of code. When either of the two condition is matched as a True, the line will be printed.

  • Logical AND

      awk book ⏰  ls -l | awk '$NF ~ /c/ &&  $5>300 {print}'
      -rw-r--r--  1 admin  staff  230613 Aug 23  2016 cover.jpg
      -rw-r--r--  1 admin  staff  692088 Aug 23  2016 cover.xcf

Explanation:

Here, the line will be printed only if the file name i.e. the last field has c in it and the file's size is greater than 300.

  • File names starting with a

      awk book ⏰  ls -l | awk '$NF ~ /^a/  {print}'
      -rw-r--r--@ 1 admin  staff      60 Jul 11 23:52 awkbook.code-workspace 

Explanation:

^ is used to match the start of line. /^c/: matches all the strings that start with c.

  • File names ending with md

      awk book ⏰  ls -l | awk '$NF ~ /\.md$/  {print}'
      -rw-r--r--  1 admin  staff     403 Aug 28  2016 CONTRIBUTORS.md
      -rw-r--r--  1 admin  staff    5313 Mar 25  2017 README.md
      -rw-r--r--  1 admin  staff    1641 Jul 11 23:53 SUMMARY.md

Explanation:

$ is used to match end of line. /\.md$/ matches with all the lines where the file ends with .md. Here, we needed to escape the . because it carried special significance. That's why, we added a backslash before the ..

  • BEGIN, BODY, END

      ls -l | awk 'BEGIN { print "This is BEGIN"} { print "BODY", $0 } END {print "This is END"}'

Explanation: There are three blocks in awk, BEGIN, BODY and END. When you execute the above statement, you'll see output like

    This is BEGIN
    BODY total 4256
    BODY -rw-r--r--   1 admin  staff     2389 Jul 15 23:26 README.md
    BODY -rw-r--r--   1 admin  staff      130 Jul 12 23:58 SUMMARY.md
    BODY -rw-r--r--@  1 admin  staff       60 Jul 11 23:52 awkbook.code-workspace
    BODY -rw-r--r--   1 admin  staff      119 Dec 12  2016 book.json
    BODY drwxr-xr-x  13 admin  staff      416 Jul 15 23:31 code
    BODY -rw-r--r--@  1 admin  staff  2159800 Jul 12 23:51 gawk.pdf
    BODY drwxr-xr-x   9 admin  staff      288 Jul 15 23:43 manuscript
    This is END

How did it work?

  • BEGIN: executed once before the first input line

  • BODY: executed for '' each line in the input file

  • END: executed once after the last input line

  • Find out the total size of all files

      awk book ⏰  ls -l | awk 'BEGIN {sum=0} {sum+=$5} END{print sum}'
      2163202

Explanation:

We created the variable sum in the BEGIN block, as we know that the BEGIN block gets executed before the first line is processed.

In the BODY block, we added sum with $5 (size of a file). This block gets executed for every input line.

in the END block, we printed the output, because END will be executed once after the last line of the input.

  • Finding the sum of sales for all lines that has J in it.

      awk -F"," 'BEGIN{sales=0} /J/ {sales+=$2} END {print sales}' sales.csv

Explanation:

Here, we have updated the matching condition and the column upon which we have to do the sum.

  • Printing data into seperate files according to Financial Quarters

A financial year has four quarters, Q1 to Q4 of three months each. What we want to do is split our file into four such files depending on the year.

    code ⏰  awk -F"," '/Jan/||/Feb/||/Mar/ {print}' sales.csv > Q1.csv
    code ⏰  cat Q1.csv
    Jan,100
    Feb,200
    March,100

Explanation:

We have already seen logical OR earlier, what we are doing is, we are searching for Jan, Feb, March and redirecting it into the Q1.csv file.

Exercise: Write the code for other three quarters. This is a bit cumbersome code, in the later chapters when we go to a bit advanced awk, we'll learn how to do all this in one line of awk code.

  • Split the data according to sales

If the sales of a particular month is less than 500, write it to 500.csv, if greater than 500 and less than 1000 then in 1000.csv, any value above it should go in large.csv

    code ⏰  awk -F"," '$2 <=500 {print}' sales.csv > 500.csv
    code ⏰  cat 500.csv
    Jan,100
    Feb,200
    March,100
    April,50
    July,10
    August,20
    September,456
    October,134

    code ⏰  awk -F"," '$2 > 500 && $2 < 1000 {print}' sales.csv > 1000.csv
    code ⏰  cat 1000.csv
    May,900
    November,934
    December,545

    code ⏰  awk -F"," '$2 > 1000 {print}' sales.csv > large.csv
    code ⏰  cat large.csv
    month,sales

Explanation:

The issue here is that we don't have any sales that is above 1000, but still, we have an entry in the large file. That is because we did a very silly mistake: not removing the header. For any file, the header only gives information about the columns and we need to remove it.

Typically, you can use sed to delete the header.

    code ⏰  sed -i '1d' sales.csv

But because we are learning awk, we'll use awk to delete the header, or rather, ignore the header while processing the text.

    code ⏰  awk 'NR>1 {print}' sales.csv

Explanation:

NR: an internal awk variable which stores the line number of the file being processed.

    code ⏰  awk '{print NR, $0}' sales.csv
    1 month,sales
    2 Jan,100
    3 Feb,200
    4 March,100
    5 April,50
    6 May,900
    7 June,1000
    8 July,10
    9 August,20
    10 September,456
    11 October,134
    12 November,934
    13 December,545

Explanation:

So when we say NR>1, awk will give all the other lines. We can then redirect it to any file which we want. FOr that matter, you can have any condition with NR, if you want all even number lines then you can print after the condition of NR%2==0

  • Print all lines longer than 10 characters

      code ⏰  awk 'length > 10 {print}' sales.csv
      month,sales
      September,456
      October,134
      November,934
      December,545
    
      code ⏰  awk 'length($0) > 10 {print}' sales.csv
      month,sales
      September,456
      October,134
      November,934
      December,545

Note: length($0) is equivalent to length. Both of them calculate the length of a line.

As we want to skip the header, we use the following command.

    code ⏰  awk 'NR>1 && length($0) > 10 {print}' sales.csv
    September,456
    October,134
    November,934
    December,545

Explanation:

All the lines longer than 10 except the header.

I encourage the reader to try out playing with NR and length to understand it better. Both are very important concepts.

  • Finding max length of a file

      code ⏰  awk 'BEGIN {max=0} {if (length($0)>max) max = length($0)} END {print "max length is ", max}' sales.csv
      max length is  13

Explanation:

Create a variable called max.

If the length of current line is greater than max then store it in max. Do this until the end of file.

At the end, print the variable.

The if condition is executed inside the BODY block.

    {if (condition) {statement}}

Note: Do note that the above block is just one statement. Since awk supports multiple lines in a block, you can have as many statements as you like.

    code ⏰  awk '{if (length($0)>max) {max = length($0)}; print "HELLO"} END {print "max length is ", max}' sales.csv

Then, the END block gets executed after the last input line and we print the max length. In the above example, the hello is quite unnecessarily printed, but it is just to show that you can include other commands also after an if statement.

  • Ignore empty line

Add multiple empty lines to sales.csv and then save the file. I added two, one after Jan,100 and one after Feb,200

    code ⏰  awk -F"," 'NF>1 {print}' sales.csv
    month,sales
    Jan,100
    Feb,200
    March,100
    April,50
    May,900
    June,1000
    July,10
    August,20
    September,456
    October,134
    November,934
    December,545

Explanation:

This removed the blank lines which we had inserted.

To understand how this works, we must understand how NF works.

    code ⏰  awk -F"," '{print NF, $0}' sales.csv
    2 month,sales
    2 Jan,100
    0
    2 Feb,200
    0
    2 March,100
    2 April,50
    2 May,900
    2 June,1000
    2 July,10
    2 August,20
    2 September,456
    2 October,134
    2 November,934
    2 December,545

We have two fields in December,545 which is why NF is 2 for this line. For a blank line, NF is 0.

  • Print number of lines of a file

      code ⏰  awk 'END {print NR}' sales.csv
      15
      code ⏰  wc -l sales.csv
      15 sales.csv

Explanation:

After the last line is processed, NR will hold the line number of the last line. Thus we print it in the END block to get the number of lines in a file.

  • Print even odd lines

      code ⏰  awk 'NR % 2 == 0 {print}' sales.csv
      Jan,100
      Feb,200
      March,100
      May,900
      July,10
      September,456
      November,934

Explanation:

We already have seen this above.

  • Print the number of files that were updated in a particular month

You need to find out in which all months have your files being updated on.

    awk book ⏰  ls -l | awk '{print $6}' | sort | uniq
    Dec
    Jul

Run the above command in your terminal. In my terminal, I have files updated only in Dec and Jul, so I'll use one of those in the below command

    awk book ⏰  ls -l | awk '{if ($6 == "Jul") { x = x + 1 }} END {print x}'
    6
    awk book ⏰  ls -l | grep Jul | wc -l
    6

The first command can be rewritten as

    awk book ⏰  ls -l | awk '$6 == "Jul" { x = x + 1} END {print x}'

Note: Do check the output of ls -l on your terminal. $6 for me is the month in which the file was last updated, it could very well not be the same in your case.

-Previous section

-Next section

Last updated