Modifying Files

Several Linux commands enable you to modify the contents of some text, either in a file or as it passes by, in a pipe. This section examines some of these commands and demonstrates their power.

sed

If you had a text file which you needed to edit, then you could just do it interactively at the keyboard with vi. If the text you wanted to edit only existed in a pipe between two processes then it would be a different proposition. This is where sed comes in. It is an editor specifically designed to take lines of text from the standard input and send them on to the standard output, performing editing actions on them as they pass through. In order to use sed, you need to specify what editing actions you want it to perform on the command line when you enter the command. Because sed only sees the text once as it passes by, there is a limit to what you can reasonably expect it to do. For instance, trying to compose editing actions that need the editor to move backwards and forwards within the text are out of the question. The standard action for sed is repeatedly to read a line of input and pass it on to the output unchanged. This sequence continues until the end-of-file is reached. If you introduce any editing actions into this sequence then they are applied, if they are appropriate, to each line after it is read in. The result is then usually passed to the output instead of the original line. The editor also arranges to keep a count of the number of lines passing by, so that your editing actions can refer to particular line numbers or ranges of line numbers in the text.

Usually, sed will take its text from the standard input, but it is also possible to take file input instead if the names of one or more files are specified on the command line. The general form of a sed command is:

	sed actions files

Here, actions is a list of one or more editing actions that you want sed to perform. If there is only one action it can just be entered on the command line as it is. If you want to perform more than one action with a particular sed command then there are two choices: either the actions can be introduced on the command line each preceded by a -e switch, or the list of actions can be written to a file and sed directed to consult the file with the -f file switch.

Usually, you will need to specify a line or range of lines in the text to which each editing action is appropriate. This is done by giving zero-, one- or two-line addresses before each action. A line address can be given as either a line number or as a regular expression, which then specifies a string pattern to search for on the line. The regular expressions available with sed are just the basic set available with grep. Go back to the section on Search and Sort if you want a reminder of the format of the basic regular expressions.

Editing actions with no line addresses are applied to every line in the input text. If one line address is given, then the action is just applied to lines which match the address specification. When two line addresses are supplied, the action is applied to a range of lines that starts with the first line that matches the first address and extends to the next line after that which matches the second address. Then, beginning with the next line after the selected range, sed starts looking for another matching range of lines, starting with the first address again.

Let us take a look at some examples of sed in action, using the data file pw.test, which was introduced when we looked at grep. The first example shows the delete action:

	$ cat pw.tast | sed '4,$d'
	root:awmku76tr43d6:0:0::/root/:/bin/sh
	pc:bdhd74hs9jh3h:500:50::/usrl/pc:/bin/bash
	carey:esJ9ohd8HH89i:501:50::/usr1/caray:/bin/bash

The cat command is used to list the pw.test file to its standard output. This output from cat is piped into sed, which has a single editing action ('-4,$d'). It is necessary to quote this editing action so that it does not get tangled up with the shell. If you are ever in doubt about whether or not something like this needs quoting, just quote it anyway to be on the safe side. The action itself says: delete the range of lines from line 4 to the end of the file (the $ means last line in file).

The same effect could be achieved by getting sed to quit after processing line 3, with the q action, as in the next example:

	$ sed 3q pw.test
	root:awmku76tr43d6:0:0::/root/:/bin/sh
	pc:bdhd74hs9jh3h:500:50::/usr1/pc:/bin/bash
	carey:asJ9ohd8HH89i:501:50::/usr1/carey:/bin/bash

Here, the pw.test file is given as a parameter to sed rather than being piped into it as in the previous example.

Sometimes, you will want to suppress the default action of sending all edited lines to the standard output. This can be done with the -n command line switch. Obviously, if you use the -n switch, you will then need to specify explicitly which lines are to be printed to sed's standard output. This is done with the p editing action. The following example demonstrates these things by just selecting and displaying two lines out of the middle of the file:

	$ sed -n 2,3p pw.tast
	pc:bdhd74hs9jh3h:500:50::/usrl/pc:/bin/bash
	carey:esJ9ohd8HH89i:501:50::/usr1/carey:/bin/bash

The next example is a little more complex, and uses the translate characters action (y). The basic idea behind character translation is that you specify two strings of equal length, then, on any lines that are selected by the line addressing, each of the characters in the first string is translated into the corresponding character in the second string. In the following example, all lines containing the word bash are selected and on those lines all the colons are translated to underscore (_) characters and all the zeros are translated to percent (%) symbols:

	$ sed /bash/y/:0/_%/ pw.test
	root:awmku76tr43d6:0:0::/root/:/bin/sh
	pc_bdhd74hs9jh3h_5%%_5%__/usr1/pc_/bin/bash
	caray_esJ9ohd8HH89I_5%1_5%_/usr1/carey_/bin/bash
	mot_dhjd83kjdJS6D_15%%_6%__/usr1/mot_/bin/bash
	grex:cj8AjoWE8h8fn:1500:60:/usr1/mot:/bin/sh

Inserting an exclamation mark (!) into an editing action after the line addressing, applies the action to all the lines NOT selected by the line addressing rather than applying the action to the lines that ARE selected:

	$ sed '/bash/!y/:0/_%/' pw.test
	root_awmku76tr43d6_%_%__/root/_/bin/sh
	pc:bdhd74hs9jh3h:500:50::/usrl/pc:/bin/bash
	carey:esJ9ohd8HH89i :501:50::/usr1/caray:/bin/bash 
	mot:dhjd83kjdJS6D:1500:60::/usr1/mot:/bin/bash
	grex_cj8AjoWE8h8fs_15%%_6%__/usr1/mot_/bin/sh

The other main editing command (s) gives you the ability to substitute one string, specified as a regular expression, for another. The general format for the substitute action is:

	s/expr/new/flags

where expr is the regular expression for which sed is to search, new is the text that will replace the regular expression and flags can be selected from the following list:

num: Normally, only the first occurrence of expr will be substituted, but if num is specified (in the range 0 to 9) then that occurrence will be substituted instead.
g: If this flag is specified then every occurrence of expr will be replaced with new.
p: This flag causes the current line to be printed to the standard output if any substitution has taken place.
w file: This flag causes the current line to be appended to the end of the specified file if any substitution has taken place. The file will be created if it does not exist.

The next example uses the substitution action to modify the example password file so as to prevent all users between pc and mot inclusive from logging in. It does this by removing their encrypted passwords and substituting the string ' off ' instead. A list of all the modified logins is also appended to the file banned:

	$ sed '/^p/,/^m/s/:.............:/:: off :/w banned' pw.test 
	root:awmku76tr43d6:0:0::/root/:/bin/sh
	pc: off :500:50::/usrl/pc:/bin/bash
	Carey: off :501:50::/usrl/carey:/bln/bash
	mot: off :1500:60::/usrl/mot:/binlbash
	grax:cj8AjoWE8h8fs:1500:60::/usr1/mot:/bin/sh

Two quick notes here should explain the operation of this example. The character ^ means the start of the line, so the range /^p/,/^m/ means 'from a line starting with a p to a line starting with an m'. The other thing to notice is that all standard encrypted passwords are 13 characters long so the regular expression :.............: matches the first 13 character string enclosed between a pair of colons on the selected lines (i.e. the encrypted passwords).

A quick look in the file banned shows that it contains the expected lines, listing those users who have had their logins disabled:

	$ cat banned
	pc: off :500:50::/usrl/pc:/bin/bash
	Carey: off :501:50::/usr1/caray:/bin/bash 
	mot: off :1500:60::/usr1/mot:/bin/bash

The next example is interesting in that it introduces several new concepts. To give the example a context, suppose that new versions of both the shell programs run by users of this system (/bin/sh and /bin/bash) have become available. The new versions are stored in the directory /usr/local/bin and you want to modify the shell pathnames in the password file to get the users to try out the new shells:

	$ sed 's?/bin/.*sh$?/usr/local&?' pw.tast
	root:awmku76tr43d6:0:0::/root/:/usr/local/bin/sh
	pc:bdhd74hs9jh3h:500:50::/usrl/pc:/usr/local/bin/bash
	carey:asJ9ohd8HH89i:501:50::/usr1/carey:/usr/local/bin/bash
	mot:dhjd83kjdJS6D:1500:60::/usr1/mot:/usr/local/bin/bash 
	grex:cj8AjoWE8h8fs:1500:60::/usr1/mot:/usr/local/bin/sh

The first thing to notice is that the regular expression for which you need to search contains the character (/) normally used as the substitution delimiter. The sed program allows you to overcome this problem by choosing another delimiting character. This can be almost anything you like (? in this example). The next point is that a $ at the end of a regular expression matches the end of the line. Consequently, the expression /bin/.*sh$ will match either ./bin/sh or /bin/bash at the end of a line. The final point is that an ampersand (&) appearing in the replacement string will itself be replaced by the text matched by the regular expression. This just puts the correct shell name after the string /usr/local.

The final example shows a sed command containing multiple editing actions and shows that in these cases the editing actions are applied one at a time in a strictly left to right order:

	$ sed -e s/:/~:/ -e /caray_/s//cew_/ pw.test 
	root_:awmku76tr43d6:O:O::/root/:/bin/sh
	pc_ :bdhd74hs9jh3h:500:50:: /usr1/pc:/bin/bash
	cew_:asJ9ohd8HH89i:501:50::/usr1/caray:/bin/bash
	mot_:dhjd83kjdJS6D:1500:6O::/usr1/mot:/bin/bash 
	grax_:cj8AjoWE8h8fs:1500:60::/usr1/mot:/bin/sh

The idea here is to append an underscore to the end of each login name and additionally, change the login name caray_ to cew_. These things are achieved with two editing actions, the first dealing with the underscore for all users, and the second dealing with the name change for carey. The only thing to notice here is that the substitution in the second editing action doesn't specify a regular expression In cases like this, the last regular expression value encountered as part of a line address or previous substitute command will be used instead. This will be carey in this example, from the line address of the second editing action.

Character Translation

The tr command performs a character translation function similar to the y editing action of sed, with a few extras on top. It is a very quick and simple program to use, with typical applications like translating upper case letters to lower case or deleting punctuation. The tr program is a true filter in that it can only read text from the standard input and send the translated version to the standard output. The basic form of the tr command is:

	tr stri str2

where str1 and str2 are strings of characters. In general, the characters appearing in str1 will be translated into the corresponding characters in str2. An example of this is:

	tr ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdEfghijklmnopqrstuvwxyz

Obviously, this will translate all upper case letters to their lower case equivalents. A rather simpler way to achieve the same thing is to specify a contiguous range of characters by just the first and last characters in the sequence, separated by a hyphen (-), as in:

	tr A-Z a-z

The implication of these examples is that the number of characters specified by the two string parameters needs to be the same. Obviously, there are two ways that this can be different. If str1 is shorter than str2 then the second string is truncated to the length of the first. If str2 is shorter than str1 then the second string is padded out to the length of first using the rightmost character of str2. So that:

	tr ABC abcde

is the same as:

	tr ABC abc

And the other example is that:

	tr ABCDE abc

is the same as:

	tr ABCDE abccc

The first of these two examples is of little practical value, but the second can be put to good use, especially when combined with the action of the -c command line switch. The action of the -c switch is to replace the characters in str1 with a string containing a list of all the characters not in str1 (i.e. its complement) This allows you to do things like translate all non-alphabetic and non-numeric characters into space characters:

	tr -c A-Za-zO-9 ' '

This particular command line has the side effect that it can leave a lot of contiguous space characters in the output. What is required to solve this problem is some way to squeeze occurrences of multiple spaces down into single space characters. There is a tr command line switch which can perform this task the -s switch. The function of the -s switch is to squeeze multiple occurrences of any characters that appear in str2 into single occurrences of those characters, as follows:

	tr -cs A-Za-zO-9 ' '

Applying this tr command to the file pw.test gives output like:

	$ tr -cs A-Za-zO-9 ' ' <pw.tast
	root awmku76tr43d6 0 0 root bin sh pc bdhd74hs9jh3h 500
	50 usr1 pc bin bash caray asJ9ohd8HH89i 501 50 usr1 car
	ey bin bash mot dhjd83kjdJS6D 1500 60 usr1 mot bin bash
	grex cj8AjoWE8h8fs 1500 60 usr1 mot bin sh

If you just want to use tr to squeeze multiple occurrences of some characters into single occurrences, then this is done with a command of the form:

	tr -s str1

where only str1 is specified and the characters in it are used for the squeeze operation.

Some special characters (such as a newline or escape for instance) can be difficult to enter into the translation strings, so a special notation exists for them, consisting of a backslash character followed by a code to specify which special character is required. The following table is a list of the main codes available:

\b	backspace character (Ctrl-h);
\n	newline character (Ctrl-j);
\r	carriage return character (Ctrl-m);
\t	tab character (Ctrl-i);
\\	backslash character;
\ooo	any character, given as three octal digits, e.g;
\033	ASCII code for escape character (Ctrl-[).

These are the same special sequences that are used in programming in C when you want to use these characters in a literal string.

One final command line switch available with tr is -d:

	tr -d str1

The purpose of this switch is to cause tr to delete characters. For this option only str1 is specified, which is used to provide the list of characters to be deleted.

Next: Simple Data Manipulation