For the last couple years I’ve been working on a program that generates a large number of essentially random ID strings (it’s actually a replicated document storage system that uses the hash of a file’s content as its ID, but the details don’t matter). Since IDs are independently generated there will always be some chance that two different files will just happen to have the same ID assigned — so how long do I need to make my ID string before that probability is small enough that I can sleep at night?
This is essentially the Birthday Paradox, just with bigger numbers and in a different form. For those who haven’t heard of it, the canonical form of the Birthday Paradox asks what the probability is that, out of a random group of 23 people, at least two in share the same birthday. (The “paradoxical” part is that the answer is just over 50%, much higher than most people’s intuition would suggest.) My question just turns that around and asks “how many random N-bit IDs have to be generated before there is a one in a million chance of any two of them being identical?”
Rejiggering the formulas given in Wikipedia, here’s the approximation:
n ≅ (-2 · S · ln(1 – P))1/2
- n is the number of entities required to reach the given probability
- P is the probability desired
- S is the size of the set of all possible entities
For example, the number of people you would need for a 50% chance that at least two of them have the same birthday is (-2 · 365 · ln(1/2))1/2, or between 22 and 23 people. As a more practical example, you would only need to generate 77,163 PGP keys before having a 50% chance of a collision between their 8-character short-form fingerprints.
As for my one-in-a-million chance, you’d need to randomly generate roughly 2(N – 19)/2 N-bit strings before having a one-in-a-million chance of a collision, which means I would need to randomly generate around 270 of my 160-bit ID strings before there would be a one-in-a-million chance of having a collision. I think I can sleep at night.