In building the table we put each item into a specific
position. How do we do that? How do we determine which position
for which item? How do we make certain that two items are not put
into the same 'specific position'? If our symbol table only has
room for 100 symbols and there are 200 possible items to be
assigned a specific position, how can we possibly give a unique
position to each? Therefore it has to be a specific position but
not necessarily a unique position.
When an index is computed from a key, an attempt is made to
"smear" all the keys out evenly across the range of values.
This is often done by cutting up parts of the key and
recombining them to give an index. This cutting and recombining
gives the technique its name - hashing.
The function we apply to the key is called a hashing function.
This means that two different keys may take us to the same place = collision. We must determine a method for resolving collisions as well as the appropriate hashing function.
The first problem will be left to later (because the 2nd is so easy): To get an index in the appropriate range from a key;
(1)
extract an integer value from the key somehow.
take its remainder MOD the table size:
INDEX is from 0 -> tablesize - 1 always.
Therefore table must be an array from 0 -> tablesize-1.
If key is numeric, extracting an integer is trivia! Obviously if the key is already in the range 1 to 10 then we could use that directly as an index!
If the key is alphanumeric, then the Unicode values of the characters are used.
e.g. Java allows us to get the Unicode integer value via:
(int)'A' = 65 (int)'a' = 97 (int)' ' = 32
Perhaps use the following as a Hash function:
hash_value((int)key[1]*mult + (int)key[2]) % tablesize;
NOTE: If mult is a power of 2, then multiplication is just
simple shifting of binary bits to the left (fast). If the
characters were 8 bits, then if MULT = 256, the above corresponds
to placing the first character next to the 2nd in a word and
interpreting that as an integer.
i.e. 'A' and 'B' 01000001 , 01000010
-> 0100000101000010
= 16706 as an integer.
16706 MOD 13 = 1
If we hash another symbol:
"BC" -> index 11.
Our above hashing function would also produce an index of 11 for "BP". This would result in a collision. This means that if we are building our table i.e. putting keys into it, then "BC" would go into index 11 and we would subsequently try to put "BP" in there as well! We should choose a hashing function which minimises collisions.