Bash Magic - Word Frequency

Calculating the frequency of each word in a regular text file using Bash

Bash Magic - Word Frequency

In this post we'll look at Bash and its somewhat unique ability to provide easy solutions to relatively complex problems. Let's consider the classic problem of calculating the frequency of each word in a regular text file, input.txt. Words can appear in uppercase or lowercase letters. We assume that words are separated by any number of whitespaces.

Given the following input text, for example:

input.txt
docker Kubernetes  
  AWS  kubernetes aws
    Docker aws  
EKS   eks Azure

We expect the following output:

aws 3
kubernetes 2
eks 2
docker 2
azure 1

With Bash, we can write a simple one-liner producing the aforementioned result. Here's the related command:

tr '[:upper:]' '[:lower:]' < input.txt | tr -s ' ' '\n' | sort | uniq -c | sort -r | awk '{print $2, $1}'

We'll look at the anatomy of the command next.

The command explained

Let's lay out our command into several lines for easier reference:

tr '[:upper:]' '[:lower:]' < input.txt | \
tr -s ' ' '\n' | sort | \
uniq -c | sort -r | \
awk '{print $2, $1}'

Line 1 transforms the input text to contain lowercase letters only:

tr '[:upper:]' '[:lower:]' < input.txt

Line 2 extracts the words based on the whitespace separator ' ', prints each word on a new line and sorts the output:

tr -s ' ' '\n' | sort 

The resulting output is:

aws
aws
aws
azure
docker
docker
eks
eks
kubernetes
kubernetes

Line 3  consolidates the adjacent identical lines into unique entries and prefixes the result with the number of occurences for each. The output is then sorted in reverse order:

uniq -c | sort -r

The output is:

3 aws
2 kubernetes
2 eks
2 docker
1 azure

Finally, line 4 swaps the two parts of each line, with the word first and the number of occurrences next.

awk '{print $2, $1}'

We get our final output as:

aws 3
kubernetes 2
eks 2
docker 2
azure 1

This solution may not be the most efficient one, but it's relatively simple and gets the job done without extensive coding.