Sequential Files

An ordinary Linux file is just a set of data bytes stored sequentially one after another in the file. Linux imposes no kind of internal structure on the data so that, as far as Linux is concerned, the data is not split up into records or fields. If for a particular application some kind of internal data structure is required, then it is up to you as the programmer to impose that structure on the data for yourself. This makes the Linux interface to files particularly simple.

File Descriptors

In Linux, files are accessed via a file descriptor. Each process can use up to OPEN_MAX file descriptors at the same time. The value OPEN_MAX is accessed through the standard header file <limits h> and for Linux it has a value of 256.

The definition of OPEN_MAX is actually in <linux/limits.h> but to make your programs portable you should not include this header file in your code directly, because it gets included automatically when you #include <limits.h> anyway.

By convention, file descriptors 0, 1 and 2 are already allocated and available for use when a process starts up. File descriptor 0 is used as the standard input device, file descriptor 1 is used as the standard output device and file descriptor 2 is used as the standard error output device.

Each allocated file descriptor is associated with an open file description. The open file descriptions are just structures of information related to the files. This information includes an offset value which specifies where in the file the next access to it will take place, the file's access mode, which specifies whether the associated file can be used for input or output or both, and some other related flags.

The relationship between file descriptors and open file descriptions need not be one-to-one, it can also be many-to-one. This means that it is possible for several file descriptors, even belonging to separate processes, to point at the same open file description. The implication here is that information stored in the open file description data structure will then be shared between all the file descriptors pointing to it.

In fact, as we shall see later, there is also some information stored in the file descriptor structure as well. This means that it is possible for two or more file descriptors to have different values for this information even if they all point to the same open file description.

open System Call

There are several ways of obtaining a file descriptor which will allow you access to a file, the most common is to use the open() system call:

	#include <sys/types.h>
	#include <sys/stat.h>
	#include <fcntl.h>

	int open(const char *path, int flags);
	int open(const char *path, int flags, mode_t mode);

Notice that there are two different forms for the open() system call; the first takes two parameters and the second has three. In fact you will only use the two-parameter version unless there is the possibility that the file you are trying to open does not exist and that in this case you want the file to be created, in which case the three parameter version of open() is required.

The path parameter is just a pointer to the pathname of the file you wish to open. The flags parameter specifies how you want to open the file. This must include one of the three values:

O_RDONLY	open file for read-only access;
O_WRONLY	open file for write-only access;
O_RDWR	open file for read and write access.

In addition any combination of the following flag values may also be included using the bitwise-OR (|) operator:

O_CREAT	open the file and create it if it doesn't exist;
O_EXCL	force open() to fail if O_CREAT is set and file exists;
O_TRUNC	truncate the file to zero length on open();
O_APPEND	force write()'s to occur on the end of the file.

There are some other flags which can be specified but they are of no use on ordinary files so we'll cover them later when thay are more relevant. The symbolic names for all these flag values can be accessed via a #include <fcntl.h>.

If you use open() to create a file, then the mode parameter is used to specify the access permission bits for the file's owner, the file's group and all other system users. The required combination of the following symbolic constants (defined through <sys/stat.h>) should be created with the bitwise-OR operator:

S_IRUSR	read permission bit for file's owner
S_IWUSR	write permission bit for file's owner
S_IXUSR	execute permission bit for file's owner
S_IRGRP	ead permission bit for file's group;
S_IWGRP	write permission bit for file's group;
S_IXGRP	execute permission bit for file's group;
S_IROTH	read permission bit for other file users
S_IWOTH	write permission bit for other file users
S_IXOTH	execute permission bit for other file users.

Three useful bitwise-OR combinations are already defined as follows:

	S_IRWXU  defined as (S_IRUSR | S_IWUSR | S_IXUSR)
	S_IRWXG  defined as (S_IRGRP | S_IWGRP | S_IXGRP)
	S_IRWXO  defined as (S_IROTH | S_IWOTH | S_IXOTH)

In addition, it is also possible to specify that the set-uid bit and/or the set-gid bit should be set as well using the constant values:

	S_ISUID set the set-uid bit; 
	S_ISGID set the set-gid bit.

Don't forget, when a process creates a file on your behalf, the permission bits are modified by the value of the umask associated with the process. The process effectively uses the formula:

	mode & (~umask)

to determine the permissions with which the file will actually be created. This means that you should not be surprised if the permission bits you get on a file created by one of your programs do not match the mode flags you requested in the open() call. We'll look at how to do something about the umask value later if this security feature becomes a problem.

Examples of calls to open() using a range of parameter values might be:

	open("xflie", O_RDONLY);
	open("yflie", O_RDWR | O_TRUNC);
	open("zflie", O~WR0NLY | O_CREAT | O_EXCL, S_IRWXU | S_IXGRP);

The first example just opens xfile for read-only access. An error will be returned by open() if xfile does not already exist. The second example opens yfile for reading and writing. The file must already exist and it will be truncated to zero length, thus losing any previous contents. The third example opens zfile for write-only access. The open() call expects to create the file and will return an error if the file already exists. Because zfile is being creat ed, a set of permission bits need to be specified, which in this case are rwx--x---.

Sometimes it can be a bit difficult to read an open() call, especially if the bitwise-OR operator is being used on lots of flags. In this case it is a good idea to #define a suitable symbolic constant at the top of your code. An example which defines the constant MDDE755 to give the file permission bits rwxr-xr-x could be:

	#define M0DE755 (s_IRWXU | S_IRGRP | S_IXGRP | S_IR0TH | S_IXOTH)

When open() detects an error it is indicated by a return value of -1. As there are lots of different conditions which can give rise to an error, an indication of which error occurred is given by the kernel assigning an appropriate value to the external integer variable errno. The main errors from calls to open() are:

ENOENT: You tried to open a file which did not exist without specifying O_CREAT.
EEXIST: You are using O_CREAT | O_EXCL to force the creation of a file which already exists
EACCES: You do not have the requested access permissions to the file, or you do not have search permission on a directory in the file's pathname.

These symbols are accessed via <ermo.h>.

If no errors occur then the return value from open() is a file descriptor which you will use in all subsequent operations on the file. The file descriptor itself is just a small non-negative integer which is actually the index into an array of pointers to open file descriptions. When a file descriptor is allocated, a search is made through this array to find the first free element. The consequence of this is that Linux will always allocate the lowest numbered free file descriptor when a file is opene d - a fact you will see put to good use later when we look at the implementation of I/O redirection to files and pipes.

Sometimes you may wish to check whether or not your process has access to a particular file before you try to open() it. This can be done with the access() system call. The general form of the call is:

	#inciude <unistd.h>

	int access(char *pathname, int mode);

where pathname is the name of the file whose access you wish to test and mode is one of the following values, which are included via the file <unlstd.h>:

R_OK	check if calling process has read access;
W_OK	check if calling process has write access
X_OK	check if calling process has execute access
F_OK	check if specified file exists.

creat System Call

To maintain backward compatibility with earlier UNIX systems, Linux also provides an alternative system call for creating files, named creat().

	#include <sys/types.h>
	#inciude <sys/stat.h> 
	#include <fcntl.h>

	int creat(const char *path, mode_t mode);

In early versions of UNIX the open() system call only existed in the two-parameter flavor, and did not have the capability to open files if they did not already exist. File creation was then performed by the separate system call creat(). In Linux and all modern versions of UNIX the creat() system call is redundant. Indeed, the creat() call:

	fd = creat(file, mode);

is exactly equivalent to the modern open() call:

	fd = open(file, O_WRONLY | O_CREAT |  O_TRUNC, mode);

read System Call

Once you have a file descriptor associated with an open file description then, as long as the file was opened with O_RDONLY or O_RDWR access permission, you can read bytes from the file with the read() system call:

	#inciude <sys/types.h> 
	#include <unistd.h>

	int read(int fd, void *buf, slze_t nbytes);

where fd is the file descriptor to the file you want to read, buf is a pointer to a block of memory where read() will place the bytes it takes from the file, and nbytes is a count of the number of bytes to copy from the file into buf

The function prototype for the read() system call is given in <unistd.h> and the data type size_t is accessed through <sys/types.h> and is currently defined to be unsigned int which under Linux on a PC are 32-bit values.

The read() operation begins at the current file position, which is given by the file offset value contained in the associated open file description, and continues from there until the required number of bytes has been read from the file or until the end-of-file is reached. It is your responsibility to ensure that the data buffer you specify to read() is big enough to hold the number of data bytes you request, because the kernel cannot check and will just blindly copy its data regardless.

By the end of the read() operation the file offset value will have been incremented beyond the data just copied, to a new position ready for the next read().

The return value from read() is -1 if an error occurred, 0 if the file offset value is at the end-of-file, or a count of the number of bytes copied from the file into the specified buffer. The count will normally be the same as the number of bytes you asked for, except when you ask for more bytes than there are left to read. In this case a number less than the number of bytes you asked for will be returned and then, if the file size doesn't change in between, the end-of-file (0) value will be returned on the next read().

Physical disk reads are always done in full disk data blocks, which are then stored in memory buffers. If a file is being read sequentially (i.e. consecutive disk blocks), then in order to improve system performance there are mechanisms in the kernel which can anticipate further data requirements and have the data ready using a read ahead technique.

write System Call

Data is written to a file with the write() system call. Superficially, it is easy to think of the write() system call as just the opposite of read(). It has the format:

        #include <sys/types.h>
        #include <unistd.h>

        int write(int fd, void *buf, size_t nbytes);

A byte count of nbytes is written from the block of memory pointed to by buf to the file associated with the file descriptor fd.

The data is written to the file starting at the current file offset position and the offset will automatically be incremented by the number of bytes actually written, ready for the next write() call. If the file was opened with the O_APPEND flag set, then the file offset value will automatically be incremented to the end-of-file position before the write() takes place. Using the O_APPEND flag helps to prevent you from overwriting data already in the file and is useful when creating log files and audit trails.

Just as with read(), memory buffers are used by the kernel to cache the data between the process and the disk. However, there is a subtle difference here which is not immediately apparent. In the case of a read() call, if the data is not available in memory then the process must wait while a physical disk read takes place before the read() call can return. If any disk errors occur while reading the data then the return value from read() and the value of errno can reflect this fact.

In the case of write() the data is just cached in memory and then the write() call returns immediately. In the process, the kernel issues a promise that the data will eventually get onto the disk. Any subsequent reads of the data you just wrote will be taken from the buffer cache rather than from the disk. This means that user programs can't really tell that this trickery is going on, except that data reads and writes to memory are orders of magnitude faster than physical disk reads and writes.

So, where's the problem? The problem is that, despite the promises, the physical disk writes may never take place. If a disk error occurs when the data is written from the buffer cache to disk, then that data will be lost and, worse than that, there is no mechanism for letting the process that wrote the data know about the problem. Errors during physical reads are reported to the reading process by the return value of read(). This cannot be done with data writes as the write() call returns long before the data is actually transferred to the disk. Indeed, the writing process may even have terminated before the disk write error occurs. Disk hardware failures are not even the only source of problems, anything which prevents a disk write taking place can cause trouble, including such things as: machine power failure, pressing the reset button, removing a mounted floppy disk without using umount, accidentally switching the machine off etc.

Generally, modern hardware is so reliable that the performance improvement more than compensates for the slight risk of losing data. However, in situations where even this slight risk is unacceptable, alternative arrangements need to be made.

If you are using the Linux second extended filesystem within a modern Linux kernel then there is an extra flag that can be specified to the open() system call - O_SYNC:

        fd = open(file, 0_WR0NLY � O_SYNC);

The effect of this flag is to synchronize write() system calls on the appropriate file descriptor (fd here) with the corresponding physical disk writes. In effect, the write() call doesn't return until the physical disk write has taken place.

A word of warning - if you use this flag then your code may not be portable to other operating systems. In fact, it won't even be portable to files on other file system types within Linux. You need to know this because if you do use O_SYNC with files on other filesystem types accidentally, you will not get any error indication, the system will just silently fail to perform the expected synchronization.

Outside of the data caching issue, the write() call is similar to read(), so that it is again your responsibility to ensure you do not try to do things like writing more bytes of data than are contained within your data buffer.

The return value from write() is -1 on error with errno set approprately, or a count of the number of bytes actually written to the file. In the case of an ordinary file (trhough not some other file types) if write() fails to return a value which is the same as the number of bytes that were supposed to have been written, then this indicates that something odd happened (you exceeded your maximum file size limit,for example) and you should probably treat it as an error anyway.

close System Call

Open file descriptors are released for reuse with the close system call:

        #include <unistd.h>

        int close(int fd);

Each time a file descriptor is allocated, a reference count in the associated open file description is incremented, so that each open file description knows how many file descriptors are associated with it. Each time close() is called on a file descriptor the associated open file description's reference count is decremented. Eventually, a call to close() will take the reference count to zero. In this case the close() call not only releases the file descriptor but also releases the open file description as well.

The close() call returns the value zero on success and -1 on error with errno holding the error number. In fact, there is only one error that close() can return, and it is:

        EBADF    Parameter is not a valid open file descriptor

In fact, no damage is done by calling close() with a file descriptor that is not associated with an open file description and, consequently, the return value from close() is seldom checked for an error.

It is good programming practice explicitly to close() any files that you open() although any files that are still open when a process terminates will be closed automatically.

User Buffering

Just because the system call interface is the lowest level access that ordinary application programs have to the operating system's services, you should not get the idea that their use is necessarily the most efficient way to do something - take data buffering as an example.

The following is a very simple function which takes pointers to two file names as parameters and copies the contents of the file named in the first parameter into the file named in the second parameter:

        #include <fcntl.h>
        #include <sys/stat.h>

        #define NEWFILE (O_WRONLY � O_CREAT � O_TRDNC)
        #define MODE6OO (S_IRUSR � S_IWUSR)
        #define SIZE 1

        void filecopy(char *infile, char *outfile)
        {
                char buf [SIZE];
                int infd, outfd, count;

                if ((infd = open(infile, O_RDONLY))==-1)
                        fatal("opening imfile");

                if ((outfd = open(outfile, NEWFILE, MODE6OO))==-1)
                        fatal("opening outfile");

                while ((count = read(infd, buf, sizeof(buf)))>O)
                        if (write(outfd, buf, count)!=count)
                                        fatal("writiag data");

                if (count==-1)
                        fatal("reading data");

                close(infd);
                close(outfd);
        }

The user function fatal() is not listed here but all it needs to do is to output an error message containing its parameter string and then terminate execution of the process.

You already know that the system uses large amounts of memory as a buffer cache for disk file reads and writes and that getting bytes to and from memory is a very fast operation. However, in the case of the filecopy() function, while it works, in that it performs the required copy, it isn't very efficient. There is quite a time overhead for running a system call and this function makes two system calls (one read() and one write()) for every byte copied.

Using the filecopy() function exactly as listed, I ran a few simple timing tests (using the time command) and discovered that on my system I could copy files at about 6.4 kilobytes per second.

Now it is quite a simple matter to reduce the number of system calls involved in the copy operation you just increase the value of the SIZE constant. The effect of this is to copy the file in a smaller number of larger-sized chunks. A natural SIZE to choose would be something related to the size of a disk data block.

Repeating the experiments after changing the value of SIZE to 1024 (but making no other changes) increased the file copy rate to a staggering (well I was impressed anyway) 1.3 mega-bytes per second - a speed improvement of over 200 times!

Even taking the speed increase into account, it is sometimes just not convenient to read the data in such large blocks. In this case you can achieve the best of both worlds by doing the data buffering yourself, in user memory space, and taking your small chunks of data from there.

The following code example shows a small amendment to the filecopy() function so that it calls the functions my_read() and my_write() which perform exactly the same as the read() and write() system calls except that they use internal 1024 byte buffers to cut down on the number of times the system calls get used. Using write_close() to close the output file rather than the close() system call is necessary in order to flush any partial write_buf[] contents before closing the file:

        #include <fcntl.h>
        #include <sys/stat.h>

        #define NEWFILE (O_WR0NLY � O_CREAT � O_TRUNC)
        #define MODE6OO (S_IRUSR � S_IWUSR)
        #define SIZE 1
        #define BUF_SIZE 1024

        void filecopy(char *infile, char *outfile)
        {
                char buf[SIZE];
                int infd, outfd, count;

                if ((infd = open(infile, O_RD0NLY))==-1)
                        fatal("opening infile");

                if ((outfd = open(outfile, NEWFILE, MODE6OO))==-1)
                        fatal("opening outfile");

                while ((count = my_read(infd, buf, sizeof(buf)))>0)
                        if (my~write(outfd, buf, count)!=count)
                                fatal("writing data");

                if (count==-1)
                        fatal("reading data");

                close(infd);
                write_close(outfd);
        }


        int my_read(int fd, char *buf, size~t count)
        {
                static char read_buf[BUF_SIZE];
                static int read_offset = BUF_SIZE;
                static int read_max = BUF_SIZE;
                int i;

                for (i = 0; i<count; ++i)
                {
                        if (read_offset==read_max)
                        {
                                read_offset = 0;
                                read_max = read(fd, read~buf, sizeof(read_buf));

                                if (read_max==-1)
                                        return -1;

                                if (read_max==0)
                                        return 1;
                        }

                        *buf++ = read_buf[read_offset++];
                }

                return i;
        }


        static char write_buf[BUF_SIZE];
        static int write_offset = 0;

        int my_write(int fd, char *buf, size_t count)
        {
                int i, n;

                for (i = 0; i<count; ++i)
                {
                        write_buf[write_offset++] = *buf++;

                        if (write_offset==BUF_SIZE)
                        {
                                write_off set = 0;
                                n = write(fd, write_buf, sizeof(write_buf));

                                if (n!=BUF_SIZE) return -1;
                        }
                }

                return i;
        }


        int write_close(int fd)
        {
                if (write_offset>0)
                        write(fd, write_buf, write_offset);

                write_offset = 0;
                return close(fd);
        }

This code is only intended to give you the flavor of user buffering because normally you wouldn't write your own - there is a perfectly good user buffering file I/O package in the C standard library (fopen(), fclose() etc.)

Notice that in performing user buffering for writing to files, you encounter the same kind of problems over what to do about reporting errors that we encountered looking at the write() system call itself.

Testing the data transfer rate of the filecopy() function using one byte reads and writes and the user buffering functions gives 330 kilobytes per second. As you would expect, this is not as fast as direct large buffer calls to read() and write() because of the overhead of calling the extra functions, but it still shows more than a 50 times' speed improvement over the direct one byte read and write case.

Next: Random Files