NEXT UP previous
Next: Other Tools

Simple Data Manipulation

This section details a set of simple commands which allow you to manipulate, in specific and useful ways, the contents of text and data files and the output from other commands.

cut and paste

Commands like sed and tr are very good at manipulating the rows (or records) in a file. What they don't do so well is to allow the manipulation of columns (specific fields in each record) within a file. The cut command provides a simple way to extract particular columns from the lines in its input:

	$ cut -d: -f1 pw.test 
	root
	pc
	carey
	mot
	grex

This command extracts field 1 (-f 1) using the colon as a field delimiter (-d:) from each of the lines in the file pw.test (i.e. the login name field from our test password file). Notice that there are a few unfortunate inconsistencies between commands, that will come to light from time to time, in their command line switches. Here in particular cut uses -d to introduce a field delimiter where the sort command we looked at earlier used -t for the same job - oops!

The next example shows how to extract the login name and home directory information into two separate files (/tmp/p1 and /tmp/p6) and then use a paste command to build up a table of this information from the two files:

	$ cut -d: -f1 pw.test >/tmp/p1
	$ cut -d: -f6 pw.test >/tmp/p6
	$ paste /tmp/p1 /tmp/p6
	root 	/root/
	pc 	/usr1/pc
	carey 	/usr1/caray
	mot 	/usr1/mot
	grex 	/usr1/mot
	$ rm 	/tmp/p1 /tmp/p6

When you use paste to connect fields together like this, it automatically inserts a delimiting character between the fields, which, by default, is a Tab character (ctrl-i). If you prefer, you can specify your own delimiting character (or even a a separate character between each pair of fields) with the -d switch.

The previous examples of cut have only used fields delimited by particular characters, but it is also possible for the fields to be in fixed-byte positions within a record, and not to use field delimiters at all. In these cases you can specify byte positions with the -b command line switch. For both the -f and -b switches to cut, the field or byte positions can be given as either a single number or a set of numbers or ranges of numbers separated by commas. In both cases the numbers count from 1.

A slightly more complex pipeline results as the solution to the next problem. Find a way to extract the minute number from the date and time display output by the date command. Just using the date command with no parameters will display the date and time in a fixed format output string to the standard output device:

	$ date
	Sun Jun 4 17:27:23 BST 1995

This string can be treated as a number of fields separated by space characters. These fields are:

Sunday name;
Junmonth name;
4day number in month;
17:27:23time of day (24 hr clock - hour:minute:second);
BSTtime zone (British Summer Time);
1995year number.

The only problem with using cut to extract fields from this string is that the day number within the month (4 in this example) has an extra space in front of it, to pad the field out to the correct width, if the number is less than 10. This will effectively count as an extra field when we tell cut that fields are space separated. A simple solution is to use tr to squeeze multiple spaces to a single space, as follows:

	$ date | tr -s ' '
	Sun Jun 4 17:27:23 BST 1995

The cut command can now be used with a space field delimiter to extract the time of day field (field four):

	$ date | tr -s ' ' | cut -d' ' -f4
	17:27:23

The output from this pipeline is now a string which itself consists of three colon delimited fields of which we want to extract the value in the middle field. Again, cut is the ideal solution, as follows:

	$ date | tr -s ' ' | cut -d' ' -f4 | cut -d: -f2
	27

A simpler solution to the same problem will still use cut but relies on noticing that all the fields in the output from date are of fixed size. Even when numeric fields are less than 10, so that they are only one digit instead of two, the sizes of the fields are maintained by padding them out. This allows for a solution which uses the -b switch to cut with a range of byte positions carefully calculated to coincide with the field of interest (bytes 15 to 16 in this example):

	$ date | cut -b15-16
	27

Comparing File Contents

There are several commands available for comparing the contents of two files for any differences. The simplest command is effectively used just to tell you whether or not two files are the same. This command is called cmp. All it does is to compare the two files given as parameters byte by byte until it either finds a difference or until it reaches the end of the files. The general form of the command is:

	cmp filel file2 

If either file1 or file2 is just a hyphen (-) then the text for this file will be taken from the standard input device, which will be from the keyboard unless input redirection is used. The output from the cmp command is the first byte and the line number where the two files differ.

If the lines in the two files to be compared are sorted into order then it is possible to use the comm command to do the comparison between them. The general form for the comm command is:

	comm options file1 file2

If no options are specified then the output from comm is displayed in three columns. Output lines beginning in column 1 are those lines that appear in file1 but not in file2. Column 2 contains the lines that appear in file2 but not in file1 and column 3 lists lines contained in both files.

As an example, let us sort the contents of pw.test into the file pw1.test:

	$ sort pw.test -o pw1.test
	$ cat pw1.test
	carey:esJ9ohd8HH89i:501:50::/usr1/Carey:/bin/bash 
	grex:Cj8AjoWE8h8fs:1500:60::/usr1/mot:/bin/sh 	
	mot:dhjd83kjdJS6D:1500:60::/usr1/mot:/bin/bash 
	pc:bdhd74hs9jh3h:500:50::/usrl/pc:/bin/bash 	
	root:awmku76tr43d6:0:0::/root/:/bin/sh

Then, modify some of the lines in pw.test with sed and sort the result, sending the sorted output into the file pw2.test:

	$ sed /pc/,/carey/s/bash/sh/ <pw.test | sort >pw2.test
	$ cat pw2.test
	Carey:esJ9ohd8HH89i:501:50::/usr1/carey:/bin/sh 
	grex:Cj8AjoWE8h8fs:1500:60::/usrl/mot:/bin/sh 
	mot:dhjd83kjdJS6D:1500:60::/usr1/mot:/bin/bash 
	pc:bdhd74hs9jh3h:500:50::/usr1/pc:/bin/sh 
	root:awmku76tr43d6:0:0::/root/:/bin/sh

Finally, use comm on the two sorted files to observe the result:

	$ comm pw1.test pw2.test
	carey:esJ9ohd8HH89i:501:50::/usr1/carey:/bin/bash
		Carey:esJ9ohd8HH89i:501:50::/usr1/carey:/bin/sh 
			grex:Cj8AjoWE8h8fs:1500:60::/usr1/mot:/bin/sh 
			mot:dhjd83kjdJS6D:1500:60::/usr1/mot:/bin/bash
	pc:bdhd74hs9jh3h:500:50::/usr1/pc:/bin/bash 
		pc:bdnd74hs9jh3h:500:50::/usr1/pc:/bin/sh
			root:awmku76tr43d6:0:0::/root/:/bin/sh

The main options you can specify are -1, -2 and -3 which suppress the output of the corresponding columns. For example, to see just the lines in pwl.test which do not occur in pw2.test you need to suppress the column two and column three outputs, as follows:

	$ comm -2 -3 pw1.test pw2.test
	carey:esJ9ohd8HH89i:501:50::/usr1/carey:/bin/bash 
	pc:bdhd74hs9jh3h:500:50::/usr1/pc:/bin/bash

Similarly, to see only the lines which appear in both files, you need to suppress the output from columns 1 and 2:

	$ comm -1 -2 pw1.test pw2.test
	grex:Cj8AjowE8h8fs:1500:60::/usr1/mot:/bin/sh 
	mot:dhjd83kjdJS6D:1500:60::/usr1/mot:/bin/bash 
	root:awmku76tr43d6:0:0::/root/:/bin/sh

The other main command for comparing file contents is diff. This command performs a really sophisticated check compared to cmp and comm. In fact, this command will systematically work through a pair of files, which need not be sorted in any way, and display all the lines where the two files differ:

	$ diff pw1.test pw2.test
	1c1
	< carey:esJ9ohd8HH89i:501:SO::/usr1/carey:/bin/bash 
	---
	> carey:esJ9ohd8HH89i:501:SO::/usrl/carey:/bin/sh
	4c4
	< pc:bdbd74hs9jh3h:500:50::/usr1/pc:/bin/bash 
	---
	> pc:bdhd74hs9jh3h:500:50::/usr1/pc:/bin/sh 

The lines in the output display which are preceded by a less than (<) symbol are the lines in the first input file which do not match the corresponding lines in the second file. Similarly, lines preceded by a greater than (>) symbol are those contained in the second input file, but not in the first.

uniq

Once a file has been manipulated, for example, cut and then ordered with sort, it is possible for there to be duplicate lines in the output file. These duplicate records can be removed from the file to leave only unique instances of each record with the command uniq:

	$ cat pw.test | cut -d: -f7 | sort | uniq 
	/bin/bash
	/bin/sh

This pipeline sends the contents of the file pw.test into the cut command, which extracts each user's shell name (i.e. field seven). This is then piped into sort, which ensures that any pairs of duplicate lines are next to each other in its output. Finally, this sorted list is piped into uniq, which will remove one of any pair of duplicate lines discovered on a pass through its input.


NEXT UP previous
Next: Other Tools