Text Processing Tools & Pipeline
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How do we process text-based information using UNIX tools?
How do we build a processing pipeline by combining UNIX tools?
Objectives
Learning basic UNIX tools to process text data such as filtering, sorting, selecting data.
Learning how UNIX tools can be combined to make a pipeline.
In the previous episode,
we learned to use UNIX commands to manipulate files and directories.
However, UNIX also contain versatile tools
which would allow us to process text-only data.
One such tool was cat
: It allows us to concatenate
the contents of one or more text files.
In this episode …
We will learn several tools that will make us productive using only UNIX shell at our fingertip:
echo
,wc
,head
,tail
,cut
,sort
,uniq
,grep
. We will also learn how to use output redirection and pipes to combine make useful new tools out of these basic tools.
Background: Processing Spam Emails with UNIX Tools
In this lesson, we will be processing a large number of spam emails. Spam and other types of unwanted emails are not only an annoying problem in our digital lives: they can present threat to our data security by sending malicious links and/or maliciously malformed data to gain unauthorized access to our devices.
Mr. Holmes from the Deep Threat Research Group is studying the demographic statistics of spam emails collected from all over the world. He wants to collect the originating IP of these emails, which he knows can be gathered from the emails themselves. We will learn how to do this in a latter episode. But for now, let us assume that the IP addresses have been harvested, as well as the countries associated with those IP addresses. Mr. Holmes has obtained massive tables in the text format that look like this:
1998/03/890929468.24864.txt|204.31.253.89|US|United States
1998/03/890929472.24865.txt|153.37.75.113|CN|China
1998/03/890929475.24866.txt|153.37.88.4|CN|China
1998/03/890929479.24867.txt||Fail to get source IP|
1998/03/890929482.24868.txt|153.36.90.245|CN|China
1998/03/890929485.24869.txt|209.84.113.62|US|United States
1998/03/890929489.24870.txt|153.37.97.151|CN|China
1998/03/890929492.24871.txt|198.81.17.36|US|United States
1998/03/890929496.24872.txt|198.81.17.41|US|United States
1998/03/890929499.24873.txt|207.158.157.36|US|United States
...
1998/12/914438100.19914.txt|159.226.5.151|CN|China
1998/12/914561519.28497.txt|209.149.111.45|US|United States
1998/12/914690993.5712.txt|202.84.12.129|CN|China
1998/12/914945890.7710.txt|199.174.210.45|US|United States
1998/12/914949946.7826.txt|203.120.165.212|SG|Singapore
1998/12/914950939.7970.txt|198.54.223.1|ZA|South Africa
1998/12/914951833.7981.txt|204.177.236.127|US|United States
1998/12/915115558.12813.txt|206.175.96.141|US|United States
1998/12/915115559.12813.txt|206.175.96.141|US|United States
1998/12/915115560.12813.txt|206.175.101.93|US|United States
This table has four columns, which we shall give names:
filename
– email’s file nameorigin_ip
– originating IP addressCC2
– two-letter country codecountry
– full country name
Each line forms a single record, and it corresponds to a particular spam email.
(The path of the file name contains the year and month the spam was received,
as well as a sequence of numbers as the email file name.)
The columns are separated by the vertical bar (|
) characters.
Investigative Questions
The results in the table above are not that insightful yet. Mr. Holmes wants to further analyze these to obtain some insight. He came up with the following questions.
-
For a given year
Y
(say, Y=1998), how many spam emails come from countryX
? Sort these by the number of emails per country to find the ten leading countries from which spams were sent out. -
Are there “major spamming centers”, defined as IP addresses which come up most frequently in the table above?
-
With respect to the major contributing countries, is there a trend observed across the years? Say, if U.S. turned out to be the top spam contributor in 1998, is it still number one in year 2008?
To answer these questions, we have to take a deep drill into the data. Let us learn some nifty UNIX tools to help Mr. Holmes answer these questions!
Go to Your
results
FolderPlease go to the
~/CItraining/module-hpc/results
directory where you will be working throughout this episode. It contains the harvested IP addresses which are suspected to be the origins of the spam emails from years 1998 through 2000:$ cd ~/CItraining/module-hpc/results $ ls -l
-rw-r--r-- 1 USER_ID users 62003 Sep 12 13:29 1998.dat -rw-r--r-- 1 USER_ID users 73854 Sep 12 14:15 1999.dat -rw-r--r-- 1 USER_ID users 168175 Sep 12 14:15 2000.dat -rw-r--r-- 1 USER_ID users 12580 Sep 12 13:29 countries-1998.txt -rw-r--r-- 1 USER_ID users 15131 Sep 12 14:15 countries-1999.txt -rw-r--r-- 1 USER_ID users 34792 Sep 12 14:15 countries-2000.txt -rw-r--r-- 1 USER_ID users 14657 Sep 12 13:29 ip-1998.txt -rw-r--r-- 1 USER_ID users 17979 Sep 12 14:15 ip-1999.txt -rw-r--r-- 1 USER_ID users 40227 Sep 12 14:15 ip-2000.txt -rwxr-xr-x 1 USER_ID users 270 Sep 12 14:14 make-1998.sh -rwxr-xr-x 1 USER_ID users 270 Sep 12 14:14 make-1999.sh -rwxr-xr-x 1 USER_ID users 341 Sep 12 14:15 make-2000.sh
For example, the
1998.dat
file contains the harvested IP addresses and the associated countries for the 1998 spams, as described earlier.
When you receive a new dataset, it is always good to ask some initial questions, even before doing any analysis:
-
What is the size of the dataset?
-
If the dataset is in tabular format, how many rows are there?
-
What does the dataset look like?
We will answer these initial questions as we introduce the individual commands. Then we will make these commands to work together to answer Mr. Holmes’ investigative questions posed earlier.
Printing Out Message
The UNIX shell provides an easy way to output (print out) a message:
$ echo 'Hello, world!'
Hello, world!
(The message is enclosed in single quotes because it contains a whitespace character and an exclamation mark, which have special meaning to the shell.) This simple capability is very useful when we create a script to reuse a sequence of UNIX commands, which we will elaborate in the upcoming episode.
wc
— Counting Lines, Word, Bytes
The wc
command counts the number of lines, words, and bytes in the given files
(from left to right, in that order).
Let’s try this now:
$ wc 1998.dat
1097 2052 62003 1998.dat
There are 1097 lines,
2052 words (separated by whitespace characters),
and 62003 bytes in 1998.dat
.
Exercise
Find out the lines/words/bytes statistics for all the
*.dat
files in theresults
directory.
If we run wc -l
instead of just wc
,
the output shows only the number of lines per file:
$ wc -l 1998.dat 1999.dat 2000.dat
1097 1998.dat
1309 1999.dat
2872 2000.dat
5278 total
The last line shows the total number of lines of all the files.
This only shows up when more than one file is fed to wc
.
Activity: Extra Options for
wc
We already tried out
wc
with the-l
option. Now let’s try some other options.
wc -c
wc -m
wc -w
wc -L
Could you explain what the flags mean? What are they short for?
Hint: Read the
wc
manual page or the help provided bywc --help
.
Selecting Parts of Text Data
Frequently, Mr. Holmes wants to peek into parts of a text file to see what the data looks like—but without printing the whole file onto the terminal. Imagine printing and reading through 20000 lines?
head
and tail
— Output the Lines at the Top or Bottom of a File
The head
command displays the first N
lines
of a text file.
The converse is the tail
command: It displays the last N
lines
of a file.
By default, N
is 10.
Let’s take a peek at the beginning of the 1998.dat
:
$ head 1998.dat
1998/03/890929468.24864.txt|204.31.253.89|US|United States
1998/03/890929472.24865.txt|153.37.75.113|CN|China
1998/03/890929475.24866.txt|153.37.88.4|CN|China
1998/03/890929479.24867.txt||Fail to get source IP|
1998/03/890929482.24868.txt|153.36.90.245|CN|China
1998/03/890929485.24869.txt|209.84.113.62|US|United States
1998/03/890929489.24870.txt|153.37.97.151|CN|China
1998/03/890929492.24871.txt|198.81.17.36|US|United States
1998/03/890929496.24872.txt|198.81.17.41|US|United States
1998/03/890929499.24873.txt|207.158.157.36|US|United States
Hmmm…the dataset begins with March 1998 emails.
Let’s try tail
to check out the end of the table:
$ tail 1998.dat
1998/12/914438100.19914.txt|159.226.5.151|CN|China
1998/12/914561519.28497.txt|209.149.111.45|US|United States
1998/12/914690993.5712.txt|202.84.12.129|CN|China
1998/12/914945890.7710.txt|199.174.210.45|US|United States
1998/12/914949946.7826.txt|203.120.165.212|SG|Singapore
1998/12/914950939.7970.txt|198.54.223.1|ZA|South Africa
1998/12/914951833.7981.txt|204.177.236.127|US|United States
1998/12/915115558.12813.txt|206.175.96.141|US|United States
1998/12/915115559.12813.txt|206.175.96.141|US|United States
1998/12/915115560.12813.txt|206.175.101.93|US|United States
The records in this file is in chronological order so you can see the date stamps at the bottom of the file.
We can specify how many lines we would like to get with head
and tail
,
using the -n
option.
Here is an example:
$ head -n 20 1998.dat
1998/03/890929468.24864.txt|204.31.253.89|US|United States
1998/03/890929472.24865.txt|153.37.75.113|CN|China
1998/03/890929475.24866.txt|153.37.88.4|CN|China
1998/03/890929479.24867.txt||Fail to get source IP|
1998/03/890929482.24868.txt|153.36.90.245|CN|China
1998/03/890929485.24869.txt|209.84.113.62|US|United States
1998/03/890929489.24870.txt|153.37.97.151|CN|China
1998/03/890929492.24871.txt|198.81.17.36|US|United States
1998/03/890929496.24872.txt|198.81.17.41|US|United States
1998/03/890929499.24873.txt|207.158.157.36|US|United States
1998/03/890929562.24883.txt|208.29.152.2|US|United States
1998/03/890929566.24884.txt|210.61.114.1|TW|Taiwan, Province of China
1998/03/890929569.24885.txt|206.175.229.130|US|United States
1998/03/890929572.24886.txt|207.115.33.46|US|United States
1998/03/890956849.27937.txt||Malformed IP address|
1998/03/891002827.28090.txt||Fail to get source IP|
1998/03/891020025.3222.txt|206.212.231.88|US|United States
1998/03/891020028.3223.txt|206.175.101.79|US|United States
1998/03/891020032.3224.txt|205.232.128.185|US|United States
1998/03/891020035.3225.txt|205.232.128.185|US|United States
EXERCISE: Do the same for tail
.
cut
— Extracting Sections from Text Lines
Suppose Mr. Holmes now just want to extract the IP addresses from the table above.
This is easy to do with the cut
command.
cut selects or "cut out" certain sections of *each line* in the input file.
It works very well for tabular data like our
1998.dat`.
By default, cut
expects the items (also called fields or columns) on each line
to be separated by the Tab character.
A character used in this way is a called a delimiter.
You can use the -d
option to specify a custom delimiter.
We need to use the -f
option to specify which column to pick.
The IP address lies on the second column on every line.
Let’s try this now:
$ cut -d "|" -f 2 2000.dat
63.17.146.248
63.38.73.109
63.38.73.109
200.28.31.2
$ cut -d "|" -f 3 2000.dat
US
US
US
CL
EXERCISE: Extract the countries from the table.
Redirection, Filters, Pipes
The tools above is cool, but quite often we want to save the output to a file.
How can we do this?
In this subsection, we introduce output redirection
using the >
and >>
operators.
Then we will introduce the concept of UNIX pipe which
is a special type redirection:
it connects the output of one program to the input of another program.
This is a very handy feature in UNIX that enables us to build complex pipeline
to process our data.
The >
and >>
Operators — Output Redirection
The >
operator is used to redirect the output of a command to a file.
By default, program’s output is printed to your terminal—that’s why
you can see them at all.
The syntax for >
operator is as follow:
COMMAND [ARGUMENTS]... > OUTPUT_FILE
Suppose we have this command:
$ echo Hello world.
Hello world.
To save this output to hello.txt
, we add a few bits to the command line:
$ echo Hello world. > hello.txt
This second echo
didn’t print anything to the terminal,
because we redirected its output to a file named hello.txt
.
Check the (new) contents of hello.txt
using the cat
command.
The >
operator does two things:
-
It creates a file named
hello.txt
if it doesn’t exist; otherwise, it overwrites the existing output file. -
It replaces the contents of
hello.txt
with the output of the command on the left hand side of the>
operator.
What if hello.txt
already exists, and we do not want to delete its existing contents?
Rather, we want to append the command output to hello.txt
?
In this situation, we need to use the >>
operator instead:
$ echo hello again >> hello.txt
$ cat hello.txt
Hello world.
hello again
Creating IP List and Country List
From the original
1998.dat
data file,
extract all the IP addresses into a file named
ip-1998.txt
;extract all the countries (full names, not two-letter codes) into a file named
countries-1998.txt
.Solution
$ cut -d "|" -f 2 1998.dat > ip-1998.txt $ cut -d "|" -f 4 1998.dat > countries-1998.txt
Do the same for years 1999 and 2000.
The |
Operator — Pipes
What if we want to get the first 5 countries listed in 1998.dat
?
In this case, we actually need to do two steps:
-
Extract the countries using the
cut
command. -
Print the short list using
head
.
The pipe operator (|
) can combine the two commands into one long command:
$ cut -d "|" -f 4 1998.dat | head -n 5
United States
China
China
China
Do not confuse the first vertical bar (which is quoted, therefore is a literal string) with the actual pipe operator (the second vertical bar).
The pipe can be arbitrarily long: a chain of three or more commands are not unusual. Further, the output from the last command can be redirected to a file:
$ cut -d "|" -f 4 1998.dat | head -n 5 > first-five.txt
UNIX Tools as Filters
Many UNIX tools share a common characteristics: They read input from one or more files and write output to the standard output. These include
cat
,head
,tail
,cut
, as well as tools we will introduce shortly:sort
,grep
. Did you notice that in the last example, thehead
command was used without specifying the input filename? In this case,head
acts as a filter by reading from the standard input. (Without the pipe operator, standard input is read from the terminal, i.e., from the keyboard.) Thehead
command still performs the very same action as with the input file, i.e. print the first N lines. Programs that (can) read input from standard input and print the processed output to standard output are often called filter.In daily life, the term “filter” refers to a process or device by which unwanted elements are not allowed to pass through (think of air filter or water filter). A UNIX filter may do more than that just standard “filtering” action. Some filters may instead transform the input to a different kind of output;
sort
, which we will learn shortly, can be considered as an example of this. Still some other filters may do statistical or aggregating operations;wc
is a good example.UNIX filters, combined with the pipe operators, allows multiple tools to be chained to form a pipeline: The data flows from one tool to another, where each filter processes the data in a particular way. We will see and use this feature repeatedly soon.
Sorting
The sort
command does what it says:
it reads text lines from one or more files,
sorts the lines, and prints the sorted data.
sort
has many options to tweak how the lines sorted.
Some notable examples:
-r
specifies sorting in descending order (by default ‘sort` uses ascending order);-k
determines which column(s) are used to determine the sort order;-n
requests that values be sorted according to its numerical values instead of lexical order.
Please refer to
sort
manual page
for more details.
Let’s try some sort
commands:
$ sort ip-1998.txt
... (lots of blank lines) ...
000.000.000.000
10.200.50.100
12.14.24.5
12.14.24.5
12.14.24.5
12.14.24.5
12.14.24.5
12.14.24.5
... (lines omitted) ...
38.30.134.7
38.30.141.28
38.30.22.156
38.30.22.202
38.30.22.93
38.9.32.2
4.12.29.235
4.4.18.88
Notice Anything Interesting?
ip-1998.txt
contains the harvested IP addresses from the 1998 spam emails. Do you notice anything interesting by sorting these IP addresses? Useless
to examine the sorted IP addresses and discuss your observation with someone near you.Solution
Here are some sample observations. You may notice additional observation.
There are a lot of blank lines! These correspond to the failure to harvest any meaningful IP address from some emails. (This is due to the limitation in the current email analysis tool, which is beyond the scope of this training. Interested reader will be able to find this analysis tool in the
Spam_analyser
directory.)There are some IP addresses that were identified as the origins in multiple spams. In the result shown above, we notice that
12.14.24.5
appears nine times in the sorted list. That is not the top sending IP address, by the way. Can you find the most used IP address?The
sort
function did not sort the IP addresses in the numerical order that we intuitively expect.
Determination of Ordering
By default, sort
uses the alphabetical order, also known as
lexicographic order,
to determine how to compare and order the strings
(i.e. the contents of the text lines).
In a nutshell, lexicographic ordering is the natural ordering of texts
as we usually encounter in dictionaries or indexes.
Determining How Characters Are Sorted
In computers, characters are actually represented (encoded) as numbers. These numbers are called the ordinal values of the characters. The most widely used representations today are ASCII and Unicode. ASCII is an old standard where only 128–256 different characters can be represented. The first 128 character set defined by ASCII is a universal standard even today. Of notable interest:
character(s) number representation in ASCII tab 9 new line (line feed) 10 white space 32 0
–9
48 – 57 A
–Z
65 – 90 a
–z
97 – 122 Please consult the ASCII character table to learn the encoding of characters in ASCII. The number representation above is used to determine the order of characters when sorting text. For
A
throughZ
, it represents a natural order as we know it (e.g.bargain
is placed beforebegin
). But it also means that lowercase characters would be placed after the uppercase characters:Begin
is beforebargain
, and that numbers are placed before letters. Another consequence when using lexicographic ordering is that a character sequence100
will be placed before9
because1
is placed before9
—the numerical value does not matter.This explains why
38.9.32.2
appears before4.12.29.235
, and4.12.29.235
before4.4.18.88
in thesort
output above.
Changing Sort Order
The -n
option will alter sort
behavior:
when encountering digits (e.g. 5
, 74
, 235
),
that part of the text will be converted into numerical values,
which then will be compared to determine the ordering of data.
The -r
option will cause sort
to print the data in descending order.
Let’s observe how the output changes when using sort -n
:
$ sort -n ip-1998.txt
... (lots of blank lines) ...
000.000.000.000
4.12.29.235
4.4.18.88
10.200.50.100
12.14.24.5
12.14.24.5
12.14.24.5
12.14.24.5
... (lines omitted) ...
210.225.159.74
210.34.0.18
210.34.0.18
210.61.114.1
210.61.114.1
210.69.7.197
216.0.22.11
226.232.201.8
Also observe the change in the output when you use the -r
option.
uniq
— Unique Line Matching
The sort
command does a nice job ordering IP addresses,
but we notice a lot of duplicates.
The uniq
command is useful to remove duplicates
by printing lines that are unique in a text file.
It does so by detecting duplicate lines that are adjacent.
(It is an important feature of uniq
.)
Let’s try this now:
$ uniq countries-1998.txt
United States
China
China
United States
China
United States
Taiwan, Province of China
United States
United States
Taiwan, Province of China
United States
United States
Lebanon
Canada
United States
Thailand
United States
... (lines omitted) ...
If uniq
is supposed to print unique lines,
why were some countries still mentioned several times?
What Happened?
Compare the contents of the original
countries-1998.txt
file to the output ofuniq countries-1998.txt
command above. Are they the same? If they are different, in what way?Solution
It turns out that
uniq
only eliminates duplicates for identical lines that are located next to each other. So for text data that looks like this:United States China China China United States China United States United States United States United States Taiwan, Province of China United States United States
uniq
will delete the first duplicate ofChina
(line 3), as well as duplicates of United States in lines 9–11 and line 14. This behavior explains the output ofuniq
earlier.
What if we really want to remove all duplicates anywhere in the file?
To do so, we have to place all the duplicates next to each other—and
we have seen that sort
will do exactly this!
Here, the UNIX pipe comes in handy:
$ sort countries-1998.txt | uniq
-
Argentina
Australia
Austria
Bangladesh
Belgium
Bolivia, Plurinational State of
Brazil
Brunei Darussalam
... (lines omitted) ...
Spain
Sweden
Switzerland
Syrian Arab Republic
Taiwan, Province of China
Thailand
Turkey
United Kingdom
United States
Venezuela, Bolivarian Republic of
The uniq
command also has a -c
option which gives a count of the
number of times a line occurs in its input.
The last command yields the list of originating countries for the 1998 spam emails,
but it did not tell how many emails were sent from every country.
uniq -c
will do this for us:
$ sort countries-1998.txt | uniq -c
65
10 -
5 Argentina
4 Australia
1 Austria
1 Bangladesh
1 Belgium
1 Bolivia, Plurinational State of
3 Brazil
1 Brunei Darussalam
... (lines omitted) ...
5 Spain
4 Sweden
4 Switzerland
1 Syrian Arab Republic
10 Taiwan, Province of China
5 Thailand
4 Turkey
22 United Kingdom
669 United States
1 Venezuela, Bolivarian Republic of
The Top Spamming Countries
Now we are in position to generate a list of top countries which sent the most spam emails in a given year. Please devise a UNIX pipeline to generate the list of “top 10 spamming countries” for the 1998 spam emails.
Hint: The last pipeline is already halfway there. You only need to pipe the
uniq -c
output through two more filters. Pick those from what we already learned so far.Solution
$ sort countries-1998.txt | uniq -c | sort -r -n | head -n 10
The Top Spamming Centers
Create a similar pipeline to determine the top 10 IP addresses that sent the most spam emails in 1998. Use
ip-1998.txt
as the input file.To make this task slightly more challenging, you can start from
1998.dat
instead ofip-1998.txt
.
Spam Sending Trend by the Years
To answer the third question asked at the beginning of this episode: “is there a trend observed across the years?”, one must repeat the analysis above for multiple years then tabulate the results for different years.
uniq
also has other switches that can be useful at times:
-u
to print only unique lines (those without adjacent duplicates),
and -d
to print only duplicate lines.
Please refer to the documentation (man page) for more details,
and give them a try using our data files.
Searching and Filtering
grep
— Grep is used to parse a file for certain content
The grep
command is useful for searching and printing lines from
a text file that match a given pattern.
In this sense, grep
is both a search tool as well as a filter.
grep
uses a powerful pattern-matching language called regular expression,
which can match not only literal substrings, but also a wide variety patterns
(e.g. digits, alphabetical letter, arbitrary letter,
repetitions of certain types of characters or strings, etc.).
We can use grep
to return certain lines of a file (or files)
without looking for them using a GUI editor’s search tool.
This can be very useful
particulary when we need to search for a word, phrase, or pattern in multiple files.
Please follow along:
$ grep Republic countries-1998.txt
Korea, Republic of
Korea, Republic of
Syrian Arab Republic
Venezuela, Bolivarian Republic of
Korea, Republic of
Korea, Republic of
Dominican Republic
Korea, Republic of
Korea, Republic of
Combined Exercises
How to find unique country names that have “Republic” word in it?
How many spam emails were suspected to have originated in India? How about France?
EXERCISE: Try other phrases, such as:
China
, Japan
, Indonesia
, United
, ia
, Africa
.
$ grep 206\.170 1998.dat
1998/03/891219236.5426.txt|206.170.68.60|US|United States
1998/05/895160496.11005.txt|206.170.31.182|US|United States
1998/05/895252660.20888.txt|206.170.31.182|US|United States
1998/05/896579625.31405.txt|206.170.185.101|US|United States
In this example we used grep to look for a specific IP address range prefix in the
file 1998.dat. Please note the \
before the .
in the grep command. This is used to
escape the .
. Otherwise the .
is treated as a wildcard and will match any character.
We can also use grep to search for a string of characters in many files using the command:
$ grep 206.170 *
We can specify that we only want to look for matches at the beginning or end of the file using the^
or $
respectively.
The below command looks for the string at the beginning of each line:
$ grep ^1998/03/891219 1998.dat
1998/03/891219128.5403.txt|206.175.103.56|US|United States
1998/03/891219139.5404.txt|205.184.187.47|US|United States
1998/03/891219144.5405.txt|205.139.129.162|US|United States
1998/03/891219148.5406.txt|208.17.113.108|US|United States
1998/03/891219152.5407.txt|207.159.82.7|US|United States
1998/03/891219156.5408.txt|207.105.189.121|US|United States
1998/03/891219203.5423.txt||Malformed IP address|
1998/03/891219210.5424.txt|207.115.33.229|US|United States
1998/03/891219215.5425.txt|209.152.84.95|US|United States
1998/03/891219236.5426.txt|206.170.68.60|US|United States
$ grep na$ 1998.dat
This command will look for the string na
at the end of each line.
A very usefull function of grep, is the ability to use it in conjuction with pipe(|). We can pipe the output of one command into grep. In this way we can pull out only the information we are interested in.
Here is an example:
$ head -n 50 1998.dat | grep 206.212
1998/03/891020025.3222.txt|206.212.231.88|US|United States
1998/04/891608754.26624.txt|206.212.231.88|US|United States
1998/04/891661453.27625.txt|206.212.231.88|US|United States
1998/04/891661486.27627.txt|206.212.231.88|US|United States
1998/04/891665321.3293.txt|206.212.231.88|US|United States
This command does two things.
First it outpus the first 50 lines of the file 1998.dat.
Finally it pipes this output through grep and filters for lines
containing 206.212
(where, again, .
stands for any character).
Key Points
echo
prints a message.
wc
counts the number of lines, words, and bytes in a file.
head
prints the first few lines of a text file.
tail
prints the last few lines of a text file.
cut
selects a particular column or columns of text data from a text file.
sort
sorts lines of text.
uniq
prints the unique lines of text.
grep
filters lines of text matching a given text pattern.