CP502: Searching and Collision Resolution.

9.3. Searching and Collision Resolution.

If we are next searching for "BP" we would hash to INDEX 11 and find "BC" there instead. Are we to assume that "BP" was never put into the table because it is not there? We cannot process our information directly if a collision results : we must resolve it.


 When building the table with a key of "BC" we would insert

 it at index 11. Later we would hash "BP" and try to insert it

 into 11 as well.


  The easiest method of collision resolution is to use the next

 free location after our hashed location.  If hash to I and it

 is full, linearly search for next free slot perhaps at I + 1.

 We must use in this case index = 12.

It is possible that the slot at I+1 is full, and so the next and the next,etc must be checked for a free slot. We must check these slots in a circular manner since the end of the table does not mean the end of the free slots: there may be free slots before index I.

When we now search for an item we must check the hash location plus subsequent elements in a circular manner until we find the item or a free slot. If we find a free slot then the item was not put into the table, else we find it. Unfortunately we are now back to a linear search!!!

As with any normal search, we must ensure that we do not search endlessly! What if no free slots are available?

Linear collision resolution has an important flaw: table entries tend to cluster together as the table fills up. This causes a long linear search (which is what we don't want). This is called a cluster. Since a bigger cluster is a bigger target, the odds are that if you hit a cluster, you will hit a big one. This adds to the building time due to collisions and to the search time due to linear search.

One solution is to not use linear resolution but a quadratic resolution solution :


   build/search for keys at H,H+1,H+4,H+9, ....H+(I-squared)

This reduces clustering considerably but doesn't help when searching the table nor when the table is full. Unfortunately, when the table is almost full, nothing can help no matter what type of collision resolution method is used. If a simple array is used, only so many symbols can be used before the table would be full.

Because of this a symbol table is normally constructed using hashing but the hash index produced IS NOT used to store the data but a pointer to a linked list containing the data.


An array is limited in size to what was set at compile time but a linked

list can grow : this is also our collision resolution.

When building the symbol table, each time a hash index is computed the entry at that part of the table is checked to see if it is null or not. i.e. pointing to something or nothing. If null, then create a record with all necessary information in it and adjust the table entry to point to it. Set the link field of new node to null. We have the first entry in our symbol table for that index.

If not null then another key has hashed to the same index and a linked list is already present. We will create a new node and add it to the linked list. Should we add it to the end of the linked list or to the beginning?

Create a new node, set its info field, set its link field to point to the first node and the pointer in the table to point to this new node. Add to the start of the linked list. Why?


  procedure fish

  var i;


  procedure fish2

  var i;

  begin


       i 

  end;

For a compiler we will meet the most local variable information before non-local or global variable information. This preserves our notion of scope of variables.


NOTE: (1)  Our symbol table can grow very large

      (2)  No entry is special case (as always)

      (3)  Collision is resolved but at the

           expense of linear search of linked list.

      (4)  Could use binary search tree structure

           instead of Linked List to speed up searching.

This method of hashing is called "chaining" as opposed to "open addressing" using arrays to hold information. Most compilers use chaining to build up their symbol tables.

9.3.1. Hashing vs linear vs binary searching Example

Suppose there are N symbols and the table has T slots. Then to find a symbol we need to perform N/(2*T) comparisons on average. Note that to reduce this number, simply increase T. (Trade storage space for speed).

1000 symbols : table size T = 337 initial pointers (337 is prime) -> 2-3 comparisons on average.

How many comparisons for linear search of 1000 items?

How many comparisons for binary search of 1000 (sorted) items?


Linear Search :

Binary Search :

Hashing Search: