In the world of data storage and retrieval, efficiently organizing and accessing large amounts of data is crucial. One powerful technique that addresses these challenges is hashing. Hashing involves mapping large data values to unique and smaller keys within a hash table, enabling optimized operations such as storing, deleting, and searching.
Now, let us suppose that we have to make a record of all the names of candidates (Shown Above) that have taken admission in a particular college and store their data in an organized manner in a table for future purposes, now for easy identification of the candidates we need their unique identification which must be different for each and every student, the application id (Key
) is unique for every candidate and serves as the best medium to fetch the name of a particular student from the table no matter if the name of any two candidates are the same, which makes it easy for a user to search, insert and delete a particular entry of the candidate from the table.
So from the above example, we see that each and every name of a particular student is mapped into a table using a unique key which may or may not be the same for everyone (Depending upon the hash function), now whenever the user wants to get information about a particular student he/she may use the key to get it without searching for it in the whole table one by one.
Now for maintaining the desired record of students as mentioned above, we may proceed by the following methods:-
-
By creating a linked list: to store the names of the candidate at nodes which requires a worst-case time complexity of O(n) for searching the data in the list, O(n)
for insertion and deletion operations.
-
By creating an array: to store names of the candidate at the blocks requires a worst-case searching time of O(nlogn) when the data in the array is sorted and insertion-deletion operations become costly as all elements are to be moved while inserting or deleting the elements.
-
By creating a table: The table shown in the first figure is how you can actually create a hash table that stores the names of the candidate mapped with the key value for the search, Insertion, and deletion process to be done in a worst-case time complexity of O(1)
where application id serves for the unique key value for the names of the candidate. The only problem with this type of structure is the space required to store a large number of records. For example, if the application id is n digits, we need O(m * 10n)
space for the table where ‘m’ is the size of a pointer to record. Another problem is an integer variable in a programming language may not store n digits.
What Is Hashing..?
Hashing is defined as the technique of mapping a large data value with a unique and small value (key) in a hash table for optimal and quick storing, deleting, and searching operations. For example, in the above figure, we can search the names of the candidate in O(1) time by directly knowing their key value.
Terminologies Related To Hashing
The various terminologies related to hashing are:-
-
Hash Table: A hash table is an array of buckets or slots that stores and maps keys with the large data value for optimal and quick storing, deleting, and searching operations. The hash table basically uses a hash function to index the key values to the table.
-
Hash Function: A hash function is defined as the mathematical function to yield distinct values for the individual keys which is difficult to obtain in practice, within a limited range so as to limit the size of the hash table. The limited range maybe is under the total size of the hash table so defined, they are equal to the index values of the hash table. The advantages of using a hash function are quick and fast access to data values and even distribution of keys across the hash table. The hash function which maps all the keys to each different slot index is known as the perfect hash function for example a hash function may have a function that carries out the operation as key = (app id % 10)
which returns the value to the key for the given hash data to be mapped in the hash table.
-
Key: In hashing key is defined as a mapping index that stores the address of the data value, it is used to get the data value stored at a particular index of the hash table formed by the hash function.
Building Hash Functions:
The hash function formation depends upon the programmer, hashing function should be designed in such a way as to reduce the collision and distribute the keys evenly in the hash table along with a high load factor for a given set of keys.
Following are some common methods of obtaining a hash function:
-
Folding: Folding is a method for the formation of a hash function such that the key value is divided evenly into parts and any of the basic arithmetic operations are performed such as addition, subtraction, multiplication, or division to obtain a number where the last two digits of the number so obtained is taken and used to hash the given data value.
-
Truncation: Truncation is a method for the formation of a hash function where any random index for a key value is selected according to the algorithm provided and the following index is used to hash the given data value.
-
Modular Arithmetic: Modular arithmetic is a method for the formation of a hash function where the index so formed by the hash function makes use of the modulo arithmetic along with the size of the hash table such that H(X) = k mod l where k is an integer value and l is the size of the hash table.
Collision In Hashing:
Now, suppose the information given in the above table has to be stored in a limited memory table with less space, so for this, we hash all the application IDs of the candidates to a smaller value to store the data values in the less memory space.
Now, suppose we use a hash function for this operation which says or is defined as H(X) = Application no. of the candidate % 2, now from the above table we can see that the application numbers 22234 and 17240 yield the same index value from the hash function this case where a hash function yields the same index values for the two different data values is known as a collision.
This is the most common problem every programmer faces during the formation of a hash function. For generalizing purpose let X1, X2, ...Xn is the keys of the data values in the table and hence according to the hash function, we have H(X1) = H( X2 ) =... H(Xn) where is known as a synonym. A collision may also be defined as the conflict in the position of two data values of different key values in the hash table. The above diagram illustrates the collision appropriately where two different data values are to be stored at the same index.
Handling Of Collisions In Hashing
Collisions in hashing can be handled through the following methods:-
-
Linear Open Addressing: Now, let us take an example where we have a set of integers say S = {10, 20, 21, 15, 17, 4, 30, 40, 31}
now if we want these numbers to store in a hash table with a hash function defined as H(X) = X mod 10
. Now from the above table, we can see the places occupied by the integers in the hash table by the given hash function, taking a closer look at the table we can see that 40 is completely divisible by 10 is kept on slot two of bucket one this is because the previous three integers also are completely divisible by 10 so there is no slot left (which is often referred to as overflow condition), after the storage of the previous integers into the hash table, so we will keep traversing the table until and unless we find an index close to the result index which we obtained from the hash table i.e. 2. Similarly, if there were more elements in a set of integers then the elements are mapped according to the hash function and stored in the next index if the slots in the particular bucket is full and all other integers are stored according to the hash function so defined if there is no collision. Hence, linear open addressing is one of the best ways to resolve collision in hashing. The following operations are performed in a linear open-addressed hash table:
-
Searching: If there are 'm' number of slots in a hash table and 'n' is the number of keys to be inserted into the table then, Load factor = n / m < (1)
hence searching time is < 1 / (1 - load factor) and approx = 1 / (1 - load factor).
-
Inserting: The insertion operation in a hash table takes place by checking each and every closest empty slot or the bucket if the bucket at which the element has to be placed is occupied. If there are 'm' number of slots in a hash table and ‘n’ is the number of keys to be inserted into the table then, Load factor = n / m < (1), hence insertion time is < 1 / (1 - load factor) and approx = 1 / (1 - load factor).
-
Deletion: The deletion process in a hash table is very inefficient due to the reason that when any data value is deleted from the table then for searching for a data value when we traverse the table if there is an empty slot in the table then the searching operation stops and the data values after the empty slot are left out and we do not reach to a proper conclusion or the result. If there are ‘m’ number of slots in a hash table and ‘n’ is the number of keys to be inserted into the table then, Load factor = n / m < (1), hence deletion time is < 1 / (1 - load factor) and approx = 1 / (1 - load factor).
Different Types Of Collision Resolution Techniques With Open Addressing:
There are different methods to resolve the collision process namely:
-
Rehashing or Double Hashing: Rehashing as the name suggests is hashing the hashed value again and again till an empty slot is found in the hash table. Now, linear addressing is inefficient in terms of storing the data value, and the data gets stored in a very clumsy and gets clustered very closely in the table which reduces the efficiency of insertion, deletion, and searching operations, so for this issue to solve we have the rehashing method where the previous hashed value gets rehashed and is searched for the index in the table if the index is occupied by other data value then the new hashed value again gets hashed and so on until there is an empty slot of the index in the table.
-
Quadratic Probing: It is a method where the indexing is done on the basis of the quadratic hash function which increases proportionally to the hash value, the hash function used is H(X) = (n + k^2
) %
m provided If there are ‘m’ number of slots in a hash table and ‘n’ is the number of keys to be inserted into the table. During the insertion of the data value if the slot at which the data is to be stored is occupied then the next slot using the hash function is checked and so on. Quadratic probing does not necessarily reduce the clustering in the hash table despite the repeated hashing.
-
Random Probing: Random probing as the name suggests is the method for hashing where the random function generator is used to generate an integer value and helps in probing to find the vacant space in the hash table to store the data value. This method, however, has to generate the same sequence of integers.
Chaining
Since the storage of the data values in the hash table is non-dynamic and requires a size to be initialized it serves as a very inefficient storage space which causes overflow and many of the slots get wasted and There is a lot of memory waste with inefficient searching, insertion and deletion operations, so for maintaining efficient storage space for data values, a method called Chaining or Open Hashing is used where the data values are stored in the form of singly linked lists which is dynamic in nature and reduces the lost in space memory due to unoccupied slots in the hash table. Whereas it requires extra space for links to other nodes formed, below is the diagrammatic representation of hashing using the chaining Technique for the same example explained above.
The following operations are performed in a linear open-addressed hash table:
-
Searching: The searching operation performed on the hash table does not depend upon the size of the hash table and takes a best-case time complexity of O(1)
because all of the data values are not stored in one single bucket but rather are distributed in the table, the worst-case time complexity for searching is O(n)
where ‘n’ is the size of the chain, it is the case when all the data values in the table are stored under one bucket and the data value to be searched is at the last node of the chain. On average, the time complexity of the searching operation is better than the linear open-addressing method.
-
Insertion: The insertion operation in the hash table takes the same worst-case time complexity of O(n) where ‘n’ is the size of the chain which occurs when the data value is to be stored at the end of the single bucket chain and the best-case time complexity of O(1)
takes place when the data element has to be indexed just after the bucket.
-
Deletion: The deletion operation in the hash table takes the same worst-case time complexity of O(n) where ‘n’ is the size of the chain which occurs when the data value is to be deleted at the end of the single bucket chain and the best-case time complexity of O(1)
takes place when the data element has to be deleted just after the bucket by maintaining the links of the chain.