Is the modulo operation necessary in Rabin Karp? (7.12)


I’ve been struggling to trying to understand why the mod 997 in Rabin Karp was necessary. On a whim I removed the modulo operation and ran it through a randomly generated test case.

It seems to work just fine actually. Obviously it is overflowing quite a bit (I switched out the hashes to use unsigned integers), but the actual string matching appears to work well, even with uint8_t hashes.

A reply on StackOverflow ( suggested that overflow should be ok because for unsigned ints it is itself a well-defined modulo operation.

Would it be expected that there would be fewer false hash matches with a relatively small prime mod?



Hey Michael,

Thanks for your question, and it is a good question I just recently came across. From my understanding, it is fine to let it overflow since we can just view the range of int (or unsigned int) as a set of buckets for hash codes. I probably would say it is totally fine to let it overflow, and probably using the whole range of int will be less false positive of hash matches.

Please let me know what you think on this question, and I am looking forward your reply.



Yes, you are right, there is vastly fewer false hash collisions with larger/nonexistant modulos.

My old Intro to Algorithms book covers Rabin Karp, and seems to suggest that the primary reason is to keep all necessary computations in single-precision arithmetic.

The modulous q is typically chosen as a prime such that 10q just fits within a computer word, which allows all of the necessary computations to be performed with single-precision arithmetic

In general, with a d-ary alphabet, we choose q so that dq fits within a computer word

Looking at Wikipedia, C++'s integer overflow behaviour seems like an anomaly. Most other languages will throw exceptions on overflow, if they even support unsigned integers at all.



Overflow is never a good option even you know it is harmless in the case you are working. In this problem, I think it is fine to let the integer overflow or just use a relatively larger prime as the modulo (some prime that is close to 2^31 -1).



Hi Tsung,
On this same topic of rolling hashes for the Rabin-Karp algo, but within the context of C++:
isn’t it unsafe as per the standard to use signed integers for t_hash, s_hash and power_s, given that behavior is undefined (i.e. platform/implementation dependent) in case of overflows?
" When signed integer arithmetic operation overflows (the result does not fit in the result type), the behavior is undefined: it may wrap around according to the rules of the representation (typically 2’s complement), it may trap on some platforms or due to compiler options (e.g. -ftrapv in GCC and Clang), or may be completely optimized out by the compiler."

That being said, wouldn’t it be safer, more portable and standard-compliant to use an unsigned integer instead for the aforementioned variables, and rely on the default modulus 2^N (N=8,16,32,64)?
PS On macos 10.15.4, build with gnu g++ 9.2, the program works fine though.