As a programmer, I would say this is one of the best and most important techniques which I came across so far.
What is hashing?
To understand this, let us look at some of the definitions from the dictionary.
- Hash (noun) – a dish of diced or chopped vegetables, a jumble
- Hash (verb) – to chop into small pieces; make into hash
- Hashing (Computers) – a technique for locating data in a file by applying a transformation, usually arithmetic, to a key
In simple words, it is a technique to transform a bigger data into a small key (word or number) and then using this key to identify that data whenever it is required. Let us take an example to understand this.
A simple example is the phone book. When we want to call a person named Sada, what exactly we do, go to phone book click on “S” and then search for contact called Sada. Here the transformed key (chopping and taking first letter) is simply the first letter of the contact.
Little complex example is finding a duplicate web page over the internet. As per Google, there are more than 60 trillion pages and if we take a page and compare the full content with all other pages, then it might take months or even years for us to find a duplicate page. To simplify this, we can take the first letter of each paragraph and create a key (for this blog page it will be ATIAL so far). So we simply compare the key of each web page, whenever it matches, then we compare the full content of the page and see if it is a duplicate. Hope now you can understand the power of hashing.
How it is used in programming?
To understand the advantages of this, first let us look at Array and List data structures and see the performance of most important and common operations on these in programming.
- Adding/Inserting an element – this will take constant [O(1)] time as we can simply add at the end
- Searching an element – this will take linear time [O(n)] as we need to compare each element from start to end in worst case. If elements are ordered, then we can use binary search and find it in logarithmic time lg(n)
- Removing/Deleting an element – to delete, we need to find it which again, takes linear or logarithmic time depending on the ordered or not
How do hashing based data structures perform?
- Adding – in constant time [O(1)]
- Finding – in constant time [O(1)]
- Removing – in constant time [O(1)]
By the way, these are amortized time complexities and it is nothing but the average.
Amortized time = (The Sum of time taken by N operations) / (Number of operations(N))
Don’t you think the constant time for all of the operations is really amazing? Let us see how we can achieve this. Following are two basic principles behind these data structures.
- Hash the data using a best suited hashing function and create a key which should be a number
- Use an array (or tree) and store the data at that index. Sometimes this index or slot is also referred to as bucket. Why buckets? In some scenarios, we might end up storing (Chaining) more than one element at same index (slot)
- Choose a hashing algorithm which spreads out the records as uniformly as possible over the range of addresses available
Let us understand the above by looking at the basic concepts and how we will perform them on an example to store some random numbers.
Direct Address Tables
In this, we simply store the elements directly at the position of the input values. For input 13, we store it at index 13 (Array). We can represent this as hash function h(n) = n, where h is hashing function and n is input data.
Hash function – do nothing, just store the number at that index
Figure 1 – This diagram depicts the details on how input data is stored internally with direct addressing hashing function
- What should be the size of an array?
- If we just store 31, 19, 18, 27, 13, 356, 634, 925, 92 and 10,000, then the array size will be 10,000
- Only 10 slots are used and remaining 9,990 spaces (memory) is not used and wasted
The Division Method
Hash function: key = n % m where n is a number to be inserted and m is the size of an array – For example create an array with size 10, divide the number with 10 and take the reminder and store the number at that index. For number 634, the remainder is 4 (634 % 10). So store this number at 4th position.
Figure 2 – This diagram depicts the details on how input data is stored internally with hashing function h(n) = n % m, where n is the input data/key and m=10
- We can store all of the numbers 31, 19, 18, 27, 13, 356, 634, 925, 92 and 10,000 in very small array with just size 10 as the reminders of the above number are 1, 9, 8, 7, 3, 6, 4, 5, 2, 0
- Very little memory is wasted
- Simple to extend when more elements are added. If 11th element is added, then we double the array size to 20. This size increasing is called doubling as we doubled the size. There are many other sophisticated mechanisms and we can create our own if required
- What happens when the reminder hash function generates same key for some different elements? Ex: the hash function for numbers 13, 723 and 53 will be same key 3. This is called collision
If a hash function generates same key for two or more distinct elements, then it is called hash collision. This is practically unavoidable even with best and perfect hashing functions. Following are few workarounds to resolve this collision.
- Open Addressing
- Probing (linear, quadratic)
- Double hashing
Resolving Collisions by Chaining
- Rather than storing the elements directly in internal array, store the link to a linked list
- When the hash function generates the same hash key for two are more different elements, then insert the elements in linked list
Figure 3 – This diagram depicts the details on how input data is stored internally with hashing function h(n) = n % m, where n is the input data/key and m=10
Issues with Chaining
One of the issues with this chaining approach is if the hash function is not properly designed, and then there might be chance of generating the same key to lot of elements. For example, in our hash function h(n) = n % m and m= 10, what happens if we insert 1,21,31,41,51,61,71,81 and 91?
All of these elements will have same key 3 and these should be stored in linked list at index 3. In this case, again the basic operations (search, insert, remove) time cost will become linear O(n).
Figure 4 – This diagram depicts the details on how input data is stored internally with hashing function h(n) = n % m, where n is the input data 1, 21, 31, 41, 51, 61, 71, 81 and 91 and m=10
How can we resolve this? We have multiple ways here –
- Increasing the size of array. Say m=30, then the elements will be store at 1,21,1,11,21,1,11,21 and 1. This will not completely solve the issue and we end up wasting more space
- Use different hashing function which can distribute the elements across the array indexes. Something like h(n) = (n%11) % m. Why 11? Typically using a prime number (which is near the array size m) in modulo hashing functions will distribute the elements across the array. In this case, the elements will be stored at 1, 0, 9, 8, 7, 6, 5, 4 and 3. See how equally they are distributed now
- Use probing technique
Figure 5 – This diagram depicts the details on how input data is stored internally with hashing function h(n) = (n%11) % m, where n is the input data 1, 21, 31, 41, 51, 61, 71, 81 and 91 and m=10
Resolving Collisions by Probing
In this elements are stored in the array itself. When two are more elements are hashed to the same key, then we simply look for the next empty or available slot in the array from hashed slot. For example, if an element 13 is hashed to key 3 (n%m), then we check if the slot 3 is empty, if it empty then we will insert there and we are done. If it is not empty, we will look for next slot 4 (assuming the linear probing interval is 1) and continue this till we find any empty slot and insert this element.
- Linear probing – In this, the next probe location is calculated based on a fixed number. For example, if the probe interval is decided as 2 and the hash key is 3, we first check at 3, if it is not empty, then we will check at 5, 7, 9 (key + probe interval) and will continue till we insert the element
- Double hashing – In this, the next probe location is calculated by another hash function
- This is most important and useful technique
- Hashing is a technique for locating data in a file by applying a transformation, usually arithmetic, to a key
- Choose the appropriate hashing algorithm so that the elements/records will spread out as uniformly as possible over the range of addresses available