CP502: Building.

9.2. Building.

In building the table we put each item into a specific position. How do we do that? How do we determine which position for which item? How do we make certain that two items are not put into the same 'specific position'? If our symbol table only has room for 100 symbols and there are 200 possible items to be assigned a specific position, how can we possibly give a unique position to each? Therefore it has to be a specific position but not necessarily a unique position.


When an index is computed from a key, an attempt is made to

"smear" all the keys out evenly across the range of values.

 This is often done by cutting up parts of the key and

 recombining them to give an index. This cutting and recombining

 gives the technique its name - hashing.


The function we apply to the key is called a hashing function.

One immediate problem is that there may be millions of possible key values and perhaps only a hundred entries into the table.

This means that two different keys may take us to the same place = collision. We must determine a method for resolving collisions as well as the appropriate hashing function.

The first problem will be left to later (because the 2nd is so easy): To get an index in the appropriate range from a key;

(1)


       extract an integer value from the key somehow.

(2)


       take its remainder MOD the table size:

             INDEX is from 0 -> tablesize - 1 always.

       Therefore table must be an array from 0 -> tablesize-1.

To get a good spread of hash values, the tablesize should be a prime number, otherwise hash index will only depend upon part of key.

If key is numeric, extracting an integer is trivia! Obviously if the key is already in the range 1 to 10 then we could use that directly as an index!

If the key is alphanumeric, then the Unicode values of the characters are used.

e.g. Java allows us to get the Unicode integer value via:




(int)'A' = 65 (int)'a' = 97 (int)' ' = 32

Perhaps use the following as a Hash function:


 hash_value  ((int)key[1]*mult + (int)key[2]) % tablesize;

This is just using the Unicode values of the first two letters of the key to get an index into the table.

NOTE: If mult is a power of 2, then multiplication is just simple shifting of binary bits to the left (fast). If the characters were 8 bits, then if MULT = 256, the above corresponds to placing the first character next to the 2nd in a word and interpreting that as an integer.


i.e.     'A' and 'B'   01000001   ,   01000010

                    ->  0100000101000010

                    =   16706   as an integer.

       16706 MOD 13 = 1

Therefore our index for a key "AB" with table size 13 (prime number remember) would be 1. With a different tablesize or a different key the index would be different. This gives us our specific position.

If we hash another symbol: "BC" -> index 11.

Our above hashing function would also produce an index of 11 for "BP". This would result in a collision. This means that if we are building our table i.e. putting keys into it, then "BC" would go into index 11 and we would subsequently try to put "BP" in there as well! We should choose a hashing function which minimises collisions.