This lesson is in the early stages of development (Alpha version)

Text Processing Tools & Pipeline

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How do we process text-based information using UNIX tools?

  • How do we build a processing pipeline by combining UNIX tools?

Objectives
  • Learning basic UNIX tools to process text data such as filtering, sorting, selecting data.

  • Learning how UNIX tools can be combined to make a pipeline.

In the previous episode, we learned to use UNIX commands to manipulate files and directories. However, UNIX also contain versatile tools which would allow us to process text-only data. One such tool was cat: It allows us to concatenate the contents of one or more text files.

In this episode …

We will learn several tools that will make us productive using only UNIX shell at our fingertip: echo, wc, head, tail, cut, sort, uniq, grep. We will also learn how to use output redirection and pipes to combine make useful new tools out of these basic tools.

Background: Processing Spam Emails with UNIX Tools

In this lesson, we will be processing a large number of spam emails. Spam and other types of unwanted emails are not only an annoying problem in our digital lives: they can present threat to our data security by sending malicious links and/or maliciously malformed data to gain unauthorized access to our devices.

Mr. Holmes from the Deep Threat Research Group is studying the demographic statistics of spam emails collected from all over the world. He wants to collect the originating IP of these emails, which he knows can be gathered from the emails themselves. We will learn how to do this in a latter episode. But for now, let us assume that the IP addresses have been harvested, as well as the countries associated with those IP addresses. Mr. Holmes has obtained massive tables in the text format that look like this:

1998/03/890929468.24864.txt|204.31.253.89|US|United States
1998/03/890929472.24865.txt|153.37.75.113|CN|China
1998/03/890929475.24866.txt|153.37.88.4|CN|China
1998/03/890929479.24867.txt||Fail to get source IP|
1998/03/890929482.24868.txt|153.36.90.245|CN|China
1998/03/890929485.24869.txt|209.84.113.62|US|United States
1998/03/890929489.24870.txt|153.37.97.151|CN|China
1998/03/890929492.24871.txt|198.81.17.36|US|United States
1998/03/890929496.24872.txt|198.81.17.41|US|United States
1998/03/890929499.24873.txt|207.158.157.36|US|United States
...
1998/12/914438100.19914.txt|159.226.5.151|CN|China
1998/12/914561519.28497.txt|209.149.111.45|US|United States
1998/12/914690993.5712.txt|202.84.12.129|CN|China
1998/12/914945890.7710.txt|199.174.210.45|US|United States
1998/12/914949946.7826.txt|203.120.165.212|SG|Singapore
1998/12/914950939.7970.txt|198.54.223.1|ZA|South Africa
1998/12/914951833.7981.txt|204.177.236.127|US|United States
1998/12/915115558.12813.txt|206.175.96.141|US|United States
1998/12/915115559.12813.txt|206.175.96.141|US|United States
1998/12/915115560.12813.txt|206.175.101.93|US|United States

This table has four columns, which we shall give names:

Each line forms a single record, and it corresponds to a particular spam email. (The path of the file name contains the year and month the spam was received, as well as a sequence of numbers as the email file name.) The columns are separated by the vertical bar (|) characters.

Investigative Questions

The results in the table above are not that insightful yet. Mr. Holmes wants to further analyze these to obtain some insight. He came up with the following questions.

  1. For a given year Y (say, Y=1998), how many spam emails come from country X? Sort these by the number of emails per country to find the ten leading countries from which spams were sent out.

  2. Are there “major spamming centers”, defined as IP addresses which come up most frequently in the table above?

  3. With respect to the major contributing countries, is there a trend observed across the years? Say, if U.S. turned out to be the top spam contributor in 1998, is it still number one in year 2008?

To answer these questions, we have to take a deep drill into the data. Let us learn some nifty UNIX tools to help Mr. Holmes answer these questions!

Go to Your results Folder

Please go to the ~/CItraining/module-hpc/results directory where you will be working throughout this episode. It contains the harvested IP addresses which are suspected to be the origins of the spam emails from years 1998 through 2000:

$ cd ~/CItraining/module-hpc/results
$ ls -l
-rw-r--r-- 1 USER_ID users  62003 Sep 12 13:29 1998.dat
-rw-r--r-- 1 USER_ID users  73854 Sep 12 14:15 1999.dat
-rw-r--r-- 1 USER_ID users 168175 Sep 12 14:15 2000.dat
-rw-r--r-- 1 USER_ID users  12580 Sep 12 13:29 countries-1998.txt
-rw-r--r-- 1 USER_ID users  15131 Sep 12 14:15 countries-1999.txt
-rw-r--r-- 1 USER_ID users  34792 Sep 12 14:15 countries-2000.txt
-rw-r--r-- 1 USER_ID users  14657 Sep 12 13:29 ip-1998.txt
-rw-r--r-- 1 USER_ID users  17979 Sep 12 14:15 ip-1999.txt
-rw-r--r-- 1 USER_ID users  40227 Sep 12 14:15 ip-2000.txt
-rwxr-xr-x 1 USER_ID users    270 Sep 12 14:14 make-1998.sh
-rwxr-xr-x 1 USER_ID users    270 Sep 12 14:14 make-1999.sh
-rwxr-xr-x 1 USER_ID users    341 Sep 12 14:15 make-2000.sh

For example, the 1998.dat file contains the harvested IP addresses and the associated countries for the 1998 spams, as described earlier.

When you receive a new dataset, it is always good to ask some initial questions, even before doing any analysis:

  1. What is the size of the dataset?

  2. If the dataset is in tabular format, how many rows are there?

  3. What does the dataset look like?

We will answer these initial questions as we introduce the individual commands. Then we will make these commands to work together to answer Mr. Holmes’ investigative questions posed earlier.

Printing Out Message

The UNIX shell provides an easy way to output (print out) a message:

$ echo 'Hello, world!'
Hello, world!

(The message is enclosed in single quotes because it contains a whitespace character and an exclamation mark, which have special meaning to the shell.) This simple capability is very useful when we create a script to reuse a sequence of UNIX commands, which we will elaborate in the upcoming episode.

wc — Counting Lines, Word, Bytes

The wc command counts the number of lines, words, and bytes in the given files (from left to right, in that order). Let’s try this now:

$ wc 1998.dat
1097  2052 62003 1998.dat

There are 1097 lines, 2052 words (separated by whitespace characters), and 62003 bytes in 1998.dat.

Exercise

Find out the lines/words/bytes statistics for all the *.dat files in the results directory.

If we run wc -l instead of just wc, the output shows only the number of lines per file:

$ wc -l 1998.dat 1999.dat 2000.dat
  1097 1998.dat
  1309 1999.dat
  2872 2000.dat
  5278 total

The last line shows the total number of lines of all the files. This only shows up when more than one file is fed to wc.

Activity: Extra Options for wc

We already tried out wc with the -l option. Now let’s try some other options.

  • wc -c
  • wc -m
  • wc -w
  • wc -L

Could you explain what the flags mean? What are they short for?

Hint: Read the wc manual page or the help provided by wc --help.

Selecting Parts of Text Data

Frequently, Mr. Holmes wants to peek into parts of a text file to see what the data looks like—but without printing the whole file onto the terminal. Imagine printing and reading through 20000 lines?

head and tail — Output the Lines at the Top or Bottom of a File

The head command displays the first N lines of a text file. The converse is the tail command: It displays the last N lines of a file. By default, N is 10.

Let’s take a peek at the beginning of the 1998.dat:

$ head 1998.dat
1998/03/890929468.24864.txt|204.31.253.89|US|United States
1998/03/890929472.24865.txt|153.37.75.113|CN|China
1998/03/890929475.24866.txt|153.37.88.4|CN|China
1998/03/890929479.24867.txt||Fail to get source IP|
1998/03/890929482.24868.txt|153.36.90.245|CN|China
1998/03/890929485.24869.txt|209.84.113.62|US|United States
1998/03/890929489.24870.txt|153.37.97.151|CN|China
1998/03/890929492.24871.txt|198.81.17.36|US|United States
1998/03/890929496.24872.txt|198.81.17.41|US|United States
1998/03/890929499.24873.txt|207.158.157.36|US|United States

Hmmm…the dataset begins with March 1998 emails.

Let’s try tail to check out the end of the table:

$ tail 1998.dat
1998/12/914438100.19914.txt|159.226.5.151|CN|China
1998/12/914561519.28497.txt|209.149.111.45|US|United States
1998/12/914690993.5712.txt|202.84.12.129|CN|China
1998/12/914945890.7710.txt|199.174.210.45|US|United States
1998/12/914949946.7826.txt|203.120.165.212|SG|Singapore
1998/12/914950939.7970.txt|198.54.223.1|ZA|South Africa
1998/12/914951833.7981.txt|204.177.236.127|US|United States
1998/12/915115558.12813.txt|206.175.96.141|US|United States
1998/12/915115559.12813.txt|206.175.96.141|US|United States
1998/12/915115560.12813.txt|206.175.101.93|US|United States

The records in this file is in chronological order so you can see the date stamps at the bottom of the file.

We can specify how many lines we would like to get with head and tail, using the -n option. Here is an example:

$ head -n 20 1998.dat
1998/03/890929468.24864.txt|204.31.253.89|US|United States
1998/03/890929472.24865.txt|153.37.75.113|CN|China
1998/03/890929475.24866.txt|153.37.88.4|CN|China
1998/03/890929479.24867.txt||Fail to get source IP|
1998/03/890929482.24868.txt|153.36.90.245|CN|China
1998/03/890929485.24869.txt|209.84.113.62|US|United States
1998/03/890929489.24870.txt|153.37.97.151|CN|China
1998/03/890929492.24871.txt|198.81.17.36|US|United States
1998/03/890929496.24872.txt|198.81.17.41|US|United States
1998/03/890929499.24873.txt|207.158.157.36|US|United States
1998/03/890929562.24883.txt|208.29.152.2|US|United States
1998/03/890929566.24884.txt|210.61.114.1|TW|Taiwan, Province of China
1998/03/890929569.24885.txt|206.175.229.130|US|United States
1998/03/890929572.24886.txt|207.115.33.46|US|United States
1998/03/890956849.27937.txt||Malformed IP address|
1998/03/891002827.28090.txt||Fail to get source IP|
1998/03/891020025.3222.txt|206.212.231.88|US|United States
1998/03/891020028.3223.txt|206.175.101.79|US|United States
1998/03/891020032.3224.txt|205.232.128.185|US|United States
1998/03/891020035.3225.txt|205.232.128.185|US|United States

EXERCISE: Do the same for tail.

cut — Extracting Sections from Text Lines

Suppose Mr. Holmes now just want to extract the IP addresses from the table above. This is easy to do with the cut command. cut selects or "cut out" certain sections of *each line* in the input file. It works very well for tabular data like our 1998.dat`.

By default, cut expects the items (also called fields or columns) on each line to be separated by the Tab character. A character used in this way is a called a delimiter. You can use the -d option to specify a custom delimiter. We need to use the -f option to specify which column to pick. The IP address lies on the second column on every line. Let’s try this now:

$ cut -d "|" -f 2 2000.dat
63.17.146.248
63.38.73.109
63.38.73.109
200.28.31.2
$ cut -d "|" -f 3 2000.dat
US
US
US
CL

EXERCISE: Extract the countries from the table.

Redirection, Filters, Pipes

The tools above is cool, but quite often we want to save the output to a file. How can we do this? In this subsection, we introduce output redirection using the > and >> operators. Then we will introduce the concept of UNIX pipe which is a special type redirection: it connects the output of one program to the input of another program. This is a very handy feature in UNIX that enables us to build complex pipeline to process our data.

The > and >> Operators — Output Redirection

The > operator is used to redirect the output of a command to a file. By default, program’s output is printed to your terminal—that’s why you can see them at all. The syntax for > operator is as follow:

COMMAND [ARGUMENTS]... > OUTPUT_FILE

Suppose we have this command:

$ echo Hello world.
Hello world.

To save this output to hello.txt, we add a few bits to the command line:

$ echo Hello world. > hello.txt

This second echo didn’t print anything to the terminal, because we redirected its output to a file named hello.txt. Check the (new) contents of hello.txt using the cat command.

The > operator does two things:

What if hello.txt already exists, and we do not want to delete its existing contents? Rather, we want to append the command output to hello.txt? In this situation, we need to use the >> operator instead:

$ echo hello again >> hello.txt
$ cat hello.txt
Hello world.
hello again

Creating IP List and Country List

From the original 1998.dat data file,

  • extract all the IP addresses into a file named ip-1998.txt;

  • extract all the countries (full names, not two-letter codes) into a file named countries-1998.txt.

Solution

$ cut -d "|" -f 2 1998.dat > ip-1998.txt
$ cut -d "|" -f 4 1998.dat > countries-1998.txt

Do the same for years 1999 and 2000.

The | Operator — Pipes

What if we want to get the first 5 countries listed in 1998.dat? In this case, we actually need to do two steps:

The pipe operator (|) can combine the two commands into one long command:

$ cut -d "|" -f 4 1998.dat  |  head -n 5
United States
China
China

China

Do not confuse the first vertical bar (which is quoted, therefore is a literal string) with the actual pipe operator (the second vertical bar).

The pipe can be arbitrarily long: a chain of three or more commands are not unusual. Further, the output from the last command can be redirected to a file:

$ cut -d "|" -f 4 1998.dat  |  head -n 5  >  first-five.txt

UNIX Tools as Filters

Many UNIX tools share a common characteristics: They read input from one or more files and write output to the standard output. These include cat, head, tail, cut, as well as tools we will introduce shortly: sort, grep. Did you notice that in the last example, the head command was used without specifying the input filename? In this case, head acts as a filter by reading from the standard input. (Without the pipe operator, standard input is read from the terminal, i.e., from the keyboard.) The head command still performs the very same action as with the input file, i.e. print the first N lines. Programs that (can) read input from standard input and print the processed output to standard output are often called filter.

In daily life, the term “filter” refers to a process or device by which unwanted elements are not allowed to pass through (think of air filter or water filter). A UNIX filter may do more than that just standard “filtering” action. Some filters may instead transform the input to a different kind of output; sort, which we will learn shortly, can be considered as an example of this. Still some other filters may do statistical or aggregating operations; wc is a good example.

UNIX filters, combined with the pipe operators, allows multiple tools to be chained to form a pipeline: The data flows from one tool to another, where each filter processes the data in a particular way. We will see and use this feature repeatedly soon.

Sorting

The sort command does what it says: it reads text lines from one or more files, sorts the lines, and prints the sorted data. sort has many options to tweak how the lines sorted. Some notable examples:

Please refer to sort manual page for more details.

Let’s try some sort commands:

$ sort ip-1998.txt
... (lots of blank lines) ...
000.000.000.000
10.200.50.100
12.14.24.5
12.14.24.5
12.14.24.5
12.14.24.5
12.14.24.5
12.14.24.5
... (lines omitted) ...
38.30.134.7
38.30.141.28
38.30.22.156
38.30.22.202
38.30.22.93
38.9.32.2
4.12.29.235
4.4.18.88

Notice Anything Interesting?

ip-1998.txt contains the harvested IP addresses from the 1998 spam emails. Do you notice anything interesting by sorting these IP addresses? Use less to examine the sorted IP addresses and discuss your observation with someone near you.

Solution

Here are some sample observations. You may notice additional observation.

  1. There are a lot of blank lines! These correspond to the failure to harvest any meaningful IP address from some emails. (This is due to the limitation in the current email analysis tool, which is beyond the scope of this training. Interested reader will be able to find this analysis tool in the Spam_analyser directory.)

  2. There are some IP addresses that were identified as the origins in multiple spams. In the result shown above, we notice that 12.14.24.5 appears nine times in the sorted list. That is not the top sending IP address, by the way. Can you find the most used IP address?

  3. The sort function did not sort the IP addresses in the numerical order that we intuitively expect.

Determination of Ordering

By default, sort uses the alphabetical order, also known as lexicographic order, to determine how to compare and order the strings (i.e. the contents of the text lines). In a nutshell, lexicographic ordering is the natural ordering of texts as we usually encounter in dictionaries or indexes.

Determining How Characters Are Sorted

In computers, characters are actually represented (encoded) as numbers. These numbers are called the ordinal values of the characters. The most widely used representations today are ASCII and Unicode. ASCII is an old standard where only 128–256 different characters can be represented. The first 128 character set defined by ASCII is a universal standard even today. Of notable interest:

character(s) number representation in ASCII
tab 9
new line (line feed) 10
white space 32
09 48 – 57
AZ 65 – 90
az 97 – 122

Please consult the ASCII character table to learn the encoding of characters in ASCII. The number representation above is used to determine the order of characters when sorting text. For A through Z, it represents a natural order as we know it (e.g. bargain is placed before begin). But it also means that lowercase characters would be placed after the uppercase characters: Begin is before bargain, and that numbers are placed before letters. Another consequence when using lexicographic ordering is that a character sequence 100 will be placed before 9 because 1 is placed before 9—the numerical value does not matter.

This explains why 38.9.32.2 appears before 4.12.29.235, and 4.12.29.235 before 4.4.18.88 in the sort output above.

Changing Sort Order

The -n option will alter sort behavior: when encountering digits (e.g. 5, 74, 235), that part of the text will be converted into numerical values, which then will be compared to determine the ordering of data.

The -r option will cause sort to print the data in descending order.

Let’s observe how the output changes when using sort -n:

$ sort -n ip-1998.txt
... (lots of blank lines) ...
000.000.000.000
4.12.29.235
4.4.18.88
10.200.50.100
12.14.24.5
12.14.24.5
12.14.24.5
12.14.24.5
... (lines omitted) ...
210.225.159.74
210.34.0.18
210.34.0.18
210.61.114.1
210.61.114.1
210.69.7.197
216.0.22.11
226.232.201.8

Also observe the change in the output when you use the -r option.

uniq — Unique Line Matching

The sort command does a nice job ordering IP addresses, but we notice a lot of duplicates. The uniq command is useful to remove duplicates by printing lines that are unique in a text file. It does so by detecting duplicate lines that are adjacent. (It is an important feature of uniq.) Let’s try this now:

$ uniq countries-1998.txt
United States
China

China
United States
China
United States
Taiwan, Province of China
United States

United States
Taiwan, Province of China
United States

United States
Lebanon
Canada
United States
Thailand
United States
... (lines omitted) ...

If uniq is supposed to print unique lines, why were some countries still mentioned several times?

What Happened?

Compare the contents of the original countries-1998.txt file to the output of uniq countries-1998.txt command above. Are they the same? If they are different, in what way?

Solution

It turns out that uniq only eliminates duplicates for identical lines that are located next to each other. So for text data that looks like this:

United States
China
China

China
United States
China
United States
United States
United States
United States
Taiwan, Province of China
United States
United States

uniq will delete the first duplicate of China (line 3), as well as duplicates of United States in lines 9–11 and line 14. This behavior explains the output of uniq earlier.

What if we really want to remove all duplicates anywhere in the file? To do so, we have to place all the duplicates next to each other—and we have seen that sort will do exactly this! Here, the UNIX pipe comes in handy:

$ sort countries-1998.txt | uniq

-
Argentina
Australia
Austria
Bangladesh
Belgium
Bolivia, Plurinational State of
Brazil
Brunei Darussalam
... (lines omitted) ...
Spain
Sweden
Switzerland
Syrian Arab Republic
Taiwan, Province of China
Thailand
Turkey
United Kingdom
United States
Venezuela, Bolivarian Republic of

The uniq command also has a -c option which gives a count of the number of times a line occurs in its input. The last command yields the list of originating countries for the 1998 spam emails, but it did not tell how many emails were sent from every country. uniq -c will do this for us:

$ sort countries-1998.txt | uniq -c
  65
  10 -
   5 Argentina
   4 Australia
   1 Austria
   1 Bangladesh
   1 Belgium
   1 Bolivia, Plurinational State of
   3 Brazil
   1 Brunei Darussalam
... (lines omitted) ...
   5 Spain
   4 Sweden
   4 Switzerland
   1 Syrian Arab Republic
  10 Taiwan, Province of China
   5 Thailand
   4 Turkey
  22 United Kingdom
 669 United States
   1 Venezuela, Bolivarian Republic of

The Top Spamming Countries

Now we are in position to generate a list of top countries which sent the most spam emails in a given year. Please devise a UNIX pipeline to generate the list of “top 10 spamming countries” for the 1998 spam emails.

Hint: The last pipeline is already halfway there. You only need to pipe the uniq -c output through two more filters. Pick those from what we already learned so far.

Solution

$ sort countries-1998.txt | uniq -c | sort -r -n | head -n 10

The Top Spamming Centers

Create a similar pipeline to determine the top 10 IP addresses that sent the most spam emails in 1998. Use ip-1998.txt as the input file.

To make this task slightly more challenging, you can start from 1998.dat instead of ip-1998.txt.

Spam Sending Trend by the Years

To answer the third question asked at the beginning of this episode: “is there a trend observed across the years?”, one must repeat the analysis above for multiple years then tabulate the results for different years.

uniq also has other switches that can be useful at times: -u to print only unique lines (those without adjacent duplicates), and -d to print only duplicate lines. Please refer to the documentation (man page) for more details, and give them a try using our data files.

Searching and Filtering

grep — Grep is used to parse a file for certain content

The grep command is useful for searching and printing lines from a text file that match a given pattern. In this sense, grep is both a search tool as well as a filter.

grep uses a powerful pattern-matching language called regular expression, which can match not only literal substrings, but also a wide variety patterns (e.g. digits, alphabetical letter, arbitrary letter, repetitions of certain types of characters or strings, etc.).

We can use grep to return certain lines of a file (or files) without looking for them using a GUI editor’s search tool. This can be very useful particulary when we need to search for a word, phrase, or pattern in multiple files.

Please follow along:

$ grep Republic countries-1998.txt
Korea, Republic of
Korea, Republic of
Syrian Arab Republic
Venezuela, Bolivarian Republic of
Korea, Republic of
Korea, Republic of
Dominican Republic
Korea, Republic of
Korea, Republic of

Combined Exercises

  1. How to find unique country names that have “Republic” word in it?

  2. How many spam emails were suspected to have originated in India? How about France?

EXERCISE: Try other phrases, such as: China, Japan, Indonesia, United, ia, Africa.

$ grep 206\.170 1998.dat
1998/03/891219236.5426.txt|206.170.68.60|US|United States
1998/05/895160496.11005.txt|206.170.31.182|US|United States
1998/05/895252660.20888.txt|206.170.31.182|US|United States
1998/05/896579625.31405.txt|206.170.185.101|US|United States

In this example we used grep to look for a specific IP address range prefix in the file 1998.dat. Please note the \ before the . in the grep command. This is used to escape the .. Otherwise the . is treated as a wildcard and will match any character.

We can also use grep to search for a string of characters in many files using the command:

$ grep 206.170 *

We can specify that we only want to look for matches at the beginning or end of the file using the^ or $ respectively.

The below command looks for the string at the beginning of each line:

$ grep ^1998/03/891219 1998.dat
1998/03/891219128.5403.txt|206.175.103.56|US|United States
1998/03/891219139.5404.txt|205.184.187.47|US|United States
1998/03/891219144.5405.txt|205.139.129.162|US|United States
1998/03/891219148.5406.txt|208.17.113.108|US|United States
1998/03/891219152.5407.txt|207.159.82.7|US|United States
1998/03/891219156.5408.txt|207.105.189.121|US|United States
1998/03/891219203.5423.txt||Malformed IP address|
1998/03/891219210.5424.txt|207.115.33.229|US|United States
1998/03/891219215.5425.txt|209.152.84.95|US|United States
1998/03/891219236.5426.txt|206.170.68.60|US|United States
$ grep na$ 1998.dat

This command will look for the string na at the end of each line.

A very usefull function of grep, is the ability to use it in conjuction with pipe(|). We can pipe the output of one command into grep. In this way we can pull out only the information we are interested in.

Here is an example:

$ head -n 50 1998.dat | grep 206.212
1998/03/891020025.3222.txt|206.212.231.88|US|United States
1998/04/891608754.26624.txt|206.212.231.88|US|United States
1998/04/891661453.27625.txt|206.212.231.88|US|United States
1998/04/891661486.27627.txt|206.212.231.88|US|United States
1998/04/891665321.3293.txt|206.212.231.88|US|United States

This command does two things. First it outpus the first 50 lines of the file 1998.dat. Finally it pipes this output through grep and filters for lines containing 206.212 (where, again, . stands for any character).

Key Points

  • echo prints a message.

  • wc counts the number of lines, words, and bytes in a file.

  • head prints the first few lines of a text file.

  • tail prints the last few lines of a text file.

  • cut selects a particular column or columns of text data from a text file.

  • sort sorts lines of text.

  • uniq prints the unique lines of text.

  • grep filters lines of text matching a given text pattern.