Search and Sort

A very common operation is to want to search through a file for lines containing a particular string, or to search through a list for a particular item. Another common operation is to sort lines into one kind of order. Linux has commands which can perform both of these actions either from the command line or from within a pipeline.

grep

The first command to look at is a filter called grep. The name is an acronym for General Regular Expression Parser, which translated into English means that grep can be used to search for string patterns in its input. The output from grep consists of all the lines of input which contained the specified string pattern. For example, to find out whether or not a login for the user carey exists on the system, you just need to search for this name string in the password file, as follows:

	$ grep carey /etc/passwd
	carey:Yt1a4ffkG2rO2: 501:500:: /usri/carey: /bin/bash

The fact that the grep command produced this line of output shows that, indeed, this user name exists on the system. Note, however, that grep was not only looking for the string in the user name field of each password file entry, but anywhere in the file. This means that just because grep finds the string you were looking for it doesn't necessarily imply that it was in the context you expected:

	$ grep public /etc/passwd
	uucp: *: 10: 14: uucp: /var/spool/uucppublic:

In this example, there is an output line from grep, even though there is no user name public registered on the system.

If you wanted to get grep to find only the user names in the password file, then you need to make sure that it only looks for the search pattern at the start of each line. The solution to this problem in grep is similar to the solution used in vi for its search string specifications - regular expressions.

The basic set of special characters used for regular expressions in grep is the same as those used in vi. There is also an extended set of special characters available which you can use in grep if you include the -E switch on the command line. The following table gives a list of most of the special character sequences available in grep and their meanings; the B or E in the middle column shows which features are basic and which are extended:

^	B	matches the start of the line;
$	B	matches the end of the line;
\<	B	matches the start of a word;
\>	B	matches the end of a word;
.	B	matches any single text character;
[str]	B	matches any single character in str;
[^str]	B	matches any single character not in str;
[a-b]	B	matches any character between a and b;
\	B	turns off any special meaning of the character following;
*	B	matches zero or more repeats of the previous item;
+	E	matches one or more repeat of the previous item;
?	E	matches zero or one repeats of the previous item;
{j}	E	matches exactly j repeats of the previous item;
{j,}	E	matches j or more repeats of the previous item;
{,k}	E	matches at most k repeats of the previous item;
{j,k}	E	matches j to k repeats of the previous item;
s\t	E	match one of item s or item t;
(exp)	E	treats exp as a single item.

Don't forget that there is one other major difference between using regular expressions in vi and grep. In the editor, any characters you type are only seen and used by the editor itself, whereas, any parameters that grep sees have already passed through the shell first. Since some of the special characters used by grep in regular expressions also have special meaning to the shell, you need to tell the shell to ignore them so that grep gets the chance to see them instead. This is done by quoting the characters you want the shell to ignore. If you need a refresher, the shell's quoting mechanisms were described back in an earlyier tutorial (see the bash tutorial).

Returning to the original problem of searching the password file just for login names, it is now an easy task to construct a grep command line to show that 'public' is not a valid login name:

	$ grep '^public' /etc/passwd

The ^ before the word 'public' forces grep only to look for the word at the start of each line. The whole search pattern is enclosed in single quotes (') so that the shell won't interfere with it, but will pass it (minus the quotes) onto grep. The fact that there is no output from grep shows that none of the lines in the password file starts with the word 'public'.

A slightly more complex example is to display lines from the password file where the login name has no password set on it. This is true if the second field on the line (between the first and second colon characters) is empty:

	$ grep -E '^[^:]+:: ' /etc/passwd 
	sync::3:2::1:/bin/sync
	mtos::9876:9876:student login: /mtos/home : /bin/bash 
	load::9876:9876:mount floppy:/:/mtos/bin/load
	unload::9876:9876:umount floppy:/:/mtos/bin/unload

You can see from the output that there are four such logins in this example, with login names: sync, mtos, load and unload.

In order to understand it fully, the search pattern ( '^[^:]+::' ) used in the example may need a little explanation. The first and last single-quote characters just tell the shell to ignore the string contents and pass it, unchanged, to grep. The search pattern itself needs to look for three things: the start of a line, followed by a login name, followed by a pair of colons which indicate an empty password field. The start of line is picked up by the initial ^ character. The empty password field is matched by the pair of colons (::) at the end of the string. That just leaves [^:]+ to match any login name. To understand this last part you need to think of a login name as a variable number of characters up to the first colon character, and that is just what [^:]+ means - one or more non-colon characters. The -E command line switch to grep is required in order to be able to use the + repetition operator from the extended feature set.

If you use a grep command without specifying any files to search, then grep takes its input from the standard input device. In this mode grep can have its input piped to it from a previous command. Here is an example pipeline to list the names of any subdirectories under the root directory that ordinary users have write permission for:

	$ ls -l / | grep '^d.......w' 
	drwxrwxrwt 10 root root 1024 May 30 14:30 tmp

As you would have expected, this pipeline shows that ordinary users have write permission in the directory /tmp. It operates by using ls -l / to generate a long listing of all the files and directories contained in the root directory. The output lines from ls are then piped into grep, which is set up to search for those lines that begin (^) with a d (directories) and then have a set of permission flags which can have any values (...) for the owner, any values (...) for the group, any value (.) for the world read permission flag and which finally has the world write permission flag set to allow world write access to this directory.

Some useful command line switches for use with grep are:

-E	use extended regular expressions for pattern matching;
-i	treat upper and lower case letters as the same;
-n	display line number within file for each output line;
-q	suppress output display, used with other commands;
-s	suppress file read permission error messages;
-num	display the num lines before and after each match.

find

The primary action of the find command is to traverse the tree structured directory hierarchy from a given starting point right down through all the branches and on to the leaf nodes. For each of the files and subdirectories that find discovers on the way, it can be made to perform a set of actions which you program at the command line. The most common use of find is just to generate a list of all the files and directories below a given starting point in the directory hierarchy:

	$ cd
	$ find . -print 
	./backup
	./backup/motd.bak
	./backup/passwd.bak
	./text
	./text/passwd
	./text/motd

This example shows a cd command that changes the current working directory to your home directory, followed by the find command. The two parameters passed into find are the dot, which tells find to start its search with the current directory, and the -print, which just tells find to display the names of all the files it finds. Since the file names are written to the standard output device by find, they can also be written to a pipe for use by other commands, as the following example demonstrates:

	$ find . -print | grep passwd
	./backup/passwd.bak
	./text/passwd

In fact, as you will soon see, this particular application can be performed directly by find without having to use grep at all.

The general format for the find command is:

	find pathnames -expressions

where pathnames can be any list of directory names which will each be searched recursively to generate the file names, and -expressions is any list of expressions which define conditions and actions to be performed on each of the files found by the search. If no pathnames are specified, then the default is the current directory, and if no -expressions are given, then the default expression -print is used:

	$ find
	./backup
	./backup/motd.bak
	./backup/passwd.bak
	./text
	./text/passwd
	./text/motd

There are three kinds of expression that can be used with find: options, conditions and actions. Each of these types returns a value which is either true or false depending on which expression it is and on how its evaluation progressed. Whenever a find command has more than one expression specified, then logical operators are used between successive pairs of expressions to control their execution. Given the expressions e1 and e2:

e1 -a e2	evaluates e2 only if e1 is true;
e1 e2	same as e1 -a e2;
e1 -o	evaluates e2 only if e1 is false;
e1 , e2	evaluates both, in the order e1 then e2.

The values of these compound expressions are the values returned by the last component expression that was actually evaluated. Compound expressions with more than two components are evaluated from left to right unless parentheses () are used to alter the evaluation order, so that:

	e1 -o e2 -a e3

will evaluate as:

	( e1 -o e2 ) -a e3

unless brackets are added explicitly to change this, as in:

	e1 -o ( e2 -a e3 )

A logical not operator is also available, so that:

! e1

is true when e1 is false, and vice versa

The following details some of the expressions that are available with find, along with the type of each expression and the value it returns:

-mount: An option expression which prevents find from searching outside the boundaries of the current filesystem. The return value is always true.
-group grp: A condition expression which checks the current file to see if it has a group ID or group name that matches grp. Return value is true if file's group ID or group name matches grp or false otherwise.
-name pattern: A condition expression which looks for a file name that matches pattern. The pattern can be given as a regular expression which may need quoting from the shell. Returns true if the current file name matches pattern or false otherwise.
-type t: A condition expression which checks to see if the current file is of type t, where t can be d for a directory, f for an ordinary file, l for a symbolic link, etc. Returns true if the current file is of type t or false otherwise.
-user usr: A condition expression which checks the current file to see if it has owner ID or owner name that matches usr. Return value is true if file's owner ID or owner name matches usr or false otherwise.
-exec cmd ;: An action expression which executes the command cmd. If you need to pass the current file name to cmd this is done with the notation {}. The semicolon (;) is used to mark the end of cmd and separate it from any more option expressions that may follow. The return value is true if the cmd executes successfully without error or false otherwise.
-print: An action expression which just prints the current file name to the standard output device. Always returns true.

One obvious point to note about find is that it will not allow you to bypass the system's file and directory permission arrangements. Consequently, if you start find off in a place which has directories below it to which your access is denied, then find will generate an error message, sent to the standard error device, for each one. As the default device for standard error messages is the screen, this can make the real output you are looking for (assuming you are using the -print action) rather difficult to see:

	$ find / -name passwd -print
	find: /var/spool/cron: Permission denied
	find: /var/spool/atjobs: Permission denied 
	find: /var/spool/atspool: Permission denied 
	/usr/bin/passwd
	find: /usr/doc/ttysnoop: Permission denied 
	/etc/passwd
	find: /root: Permission denied

The simplest thing to do to suppress this unwanted output is to redirect the error output to a file. Ordinarily, this would mean that you would have an unwanted file to delete afterwards; however, there is one file you can use which is specially provided for you to send things to that you really want to throw away. The file's pathname is:

	/dev/null

This file is specially arranged so that any time you write something to it, the characters just get thrown away. It is also arranged so that when you try to read any characters from it you always get an end-of-file indication straight away with no other characters read.

Using this file to suppress the error output from find, the previous example becomes:

	$ find / -name passwd -print 2>/dev/null 
	/usr/bin/passwd
	/etc/passwd

Notice that this example is using two find expressions, with no operator between them (this arrangement defaults to using the -a operator). For this operator, the first expression is evaluated for each file name generated by find and only if the result is true will the second expression be evaluated. The first expression in this example is -name passwd, which looks for the file name passwd at the end of the pathname find is currently supplying. If a match is found, the expression returns the value true; otherwise it returns false. This return value determines whether or not to evaluate the second expression. If -name returns the value true then the second expression (-print) will display find's current pathname to the standard output device. The overall effect of this is to print a list of all the pathnames till find discovers with the file name passwd on the end.

In this way you can find a list of pathnames without piping to grep as in previous examples. There is a difference, however: the -name expression will only look for complete words on the end of a pathname, whereas grep can look for substrings

	$ cd
	$ find . -name motd -print 2>/dev/null
	./text/motd
	$ find . -print 2>/devlnull | grep motd
	./text/motd
	./backup/motd.bak

As a final, and more complex, example, let us see how we might generate a list of pathnames of all the ordinary files in a directory subtree whose contents include a particular string. One possible solution is to use find to generate the pathnames and then to use grep to see if the specified string appears in any of the files, printing the pathname of any files for which this is true. The only real problem is that grep normally displays any lines that it finds to the standard output as well as sending any error messages to the standard error device. One way round this problem is to redirect both the standard output and error output to /dev/null, but, in fact, grep provides an easier solution than this because it has command line switches built in to suppress these outputs for you:

	$ find /etc -type f -exec grep -q -s mycroft {} \; -print 
	/etc/HOSTNAME
	/etc/hosts
	/etc/lilo.conf

What this command line does is to use find to generate a list of all the pathnames under the directory /etc. It then evaluates the expression -type f which returns the value true if the current pathname is an ordinary file. The second expression (-exec grep -q -s mycroft {} \;) is thus evaluated for all the ordinary files, and searches each of them in turn for the string mycroft. Don't forget that the notation {} is replaced by the current pathname when grep is executed. The -q and -s options stop grep from producing its normal display, so that, effectively, this expression is just being used to generate a true or false value which decides whether or not to evaluate the third expression. The \ before the semi-colon at the end of the second expression is just to quote the semi-colon so the shell will ignore it and pass it to find unchanged. If the -exec grep expression returns true, then the -print expression will be evaluated which, as its action, will display the current pathname, as required.

sort

The sort command is a filter which can sort the lines of text in its input into the order you specify on the command line and then send the sorted lines to the standard output device. The sort command is very versatile in that it can treat the input lines as a collection of several fields and perform sorting operations using one or more of these fields as the sort key. You can also choose the character that sort will use as its field delimiter so that you may sort files of different types.

In the examples that follow we shall make use of the following small (and fictitious) password file, called pw.test:

	root:awmku76tr43d6:0:0::/root/:/bin/sh
	pc:bdhd74hs9jh3h:500:50::/usrl/pc:/bin/bash
	carey:esJ9ohd8HH89i:501:50::/usr1/carey:/bin/bash
	mot:dhjd83kjdJS6D:1500:60::/usr1/mot : /bin/bash 
	grex:cj8AjoWE8h8fs:1500:60::/usr1/mot:/bin/sh

By default, sort uses whitespace characters (spaces, tabs, etc.) as its field separators. Therefore, the lines of pw.test, containing no whitespace characters, are each treated as a single field. If sort is run on this file, it will treat each line as though it is a single word and sort them into alphabetical order:

	$ sort pw.test
	carey: esJ9ohd8HH89i 501:50::/usri/carey:/bin/bash 
	grex:cj8AjoWE8h8fs:1500:60::/usr1/mot:/bin/sh
	mot:dhjd83kjdJS6D:1500:60::/usr1/mot:/bin/bash
	pc:bdhd74hs9jh3h:500:50::/usrl/pc:/bin/bash
	root:awmlku76tr43d6:0:0::/root/:/bin/sh

The sort command will take its input from any files specified on the command line, or from the standard input device if there are none given.

Many command line switches are available with sort; the most useful are given in the following table:

-b	ignore leading space characters in sort keys;
-f	treat upper and lower case letters as the same;
-n	treat the sort key as numeric rather that text;
-r	sort from high to low rather than from low to high;
-o file	send output to file instead of 'standard output;
-t s	use s as field separator, instead of whitespace;
-k s1,s2	use fields s1 to (s2 -l) as the sort key.

Actually, the specification of the sort key is a little more complex than is obvious from the previous table. To start with, it is permitted to give repeated -k switches, so that any combination of fields can be specified. The sort command uses the fields in the order given on the command line, which need not be the same as the order of the fields in the file. Each of the field specifiers (s1 and s2) can be given as f.c, where f is the field number and c is the character position within the field. Both of these values start counting at one. Some examples might help to make it clearer:

-k3	The sort key is from the start of the third field to the end of the line.
-k3 6	The sort key here is made up of the third, fourth and fifth fields on the line.
-k4,5 -k1,3	The sort key is field 4 followed by field 1 followed by field 2.
-k3.3,4	The sort key is field 3, but omitting the first two characters in the field.
-k3.2,3.6	The sort key in this case is four characters long, starting at the second character of field 3.

Returning to the pw.test file, if you want to do anything sophisticated with this file, the first thing to do is to change the field separator from the whitespace default to a colon, which will allow sort to access the individual fields in the file. The following example will sort the file on the UID contained in field 3:

	$ sort -t: -k3,4 pw.test
	root:awmku76tr43d6 :0:0:: /root/:/bin/sh
	grex:cj8AjoWE8h8fs:1500:60::/usr1/mot:/bin/sh 
	mot:dnjd83kjd3S6D:1500:60::/usr1/mot:/bin/bash 
	pc:bdnd74hs9jh3h:500:50::/usr1/pc:/bin/bash
	carey:esJ9ohd8HH89i:501:50::/usr1/carey:/bin/bash

By default, if two lines have the same value for the sort key then they will be ordered as though the whole lines were just single words to be sorted. You can see this in the second and third lines of the output in the last example, which both have the value 1500 in the key field.

Instead of this, you can choose to resolve the problem yourself by specifying a second (or more if needed) key field to use for the sort when the first key field cannot determine the order. If we apply this to the last example and specify that if field 3 cannot determine the order, then field 7 (the shell) field should be used next, we obtain the following result:

	$ sort -t: -k3,4 -k7 pw.test
	root:awmku76tr43d6:0:0::/root/:/bin/sh
	not:dhjd83kjdJS6D:1500:60::/usrl/mot:/bin/bash 
	grex:cj8AjoWE8h8fs:1500:60::/usrl/mot:/bin/sh 
	pc:bdhd74hs9jh3h:500:50::/usrl/pc:/bin/bash
	carey:esJ9ohd8HH89i:501:50::/usr1/carey:/bin/bash

One small problem with both of the previous examples is that the file does not seem to have been sorted properly on the UID field, as the lines have been displayed in key field order 0, 1500, 1500, 500, 501. This is because sort is treating the content of the UID field as though it is words and not numbers, in which case sort has got it right as 0 comes before 1, which comes before 5. It's not what was required, though. The solution is to use the -n switch so that the field contents will be sorted into numeric order as required:

	$ sort -t: -n -k3,4 -k7 pw.test
	root:awmku76tr43d6 :0:0::/root/:/bin/sh
	pc:bdhd74hs9jh3h:500:50::/usrl/pc:/bin/bash
	carey:esJ9ohd8HH89i:501:50::/usr1/carey:/bin/bash
	grex:cj8AjowE8h8fs:1500:60::/usr1/mot:/bin/sh 
	mot:dhjd83kjdJS6D:1500:60::/usr//mot:/bin/bash

and this seems to have solved the problem.

One final example of the use of sort shows how to use just a small part of a field as the sort key. For this example we shall sort the file on the first two characters of the password field, and also obtain the output displayed in reverse order:

	$ sort -t: -r -k2.1,2.3 pw.test
	carey:esJ9ohd8HH89I:501:50::/usr1/carey:/bin/bash 
	mot:dhjd83kjdJS6D:1500:60::/usr1/mot:/bin/bash
	grex:cj8AjoWE8h8fs:1500:60::/usr1/mot:/bin/sh
	pc:bdhd74hs9jh3h:500:50::/usrl/pc:/bin/bash
	root:awmku76tr43d6:0:0::/root/:/bin/sh

All of the examples of sort that you have seen have resulted in a display on the screen. As this is just the standard output device being used, the output could, obviously, have been redirected to a pipe or a file. One word of warning is in order here don't redirect the output from sort back into the file you are reading as the input, or the shell will delete your input file in the process of setting up the redirection. (i.e. don't do this):

	$ sort somefile >somefile

If you need the output to go back into the same file then use the -o switch to sort to do it, as in the following example:

	$ sort somefile -o somefile

Next: Modifying Files