Your cart is currently empty!
Category: Thinking Out Loud
Two datasets. One is fake. You have AI.
Before you read anything else — look at the two datasets below, paste them into any AI assistant you trust, and ask it which one is more likely to be genuine. Save the answer.
Then unfold the five steps. The story reveals itself in sequence. By Step 5 you will know whether your AI passed or failed — and why the answer matters far beyond this particular puzzle.
| First digit | Count |
|---|---|
| 1 | 15 |
| 2 | 14 |
| 3 | 14 |
| 4 | 13 |
| 5 | 12 |
| 6 | 12 |
| 7 | 10 |
| 8 | 11 |
| 9 | 9 |
| First digit | Count |
|---|---|
| 1 | 33 |
| 2 | 19 |
| 3 | 14 |
| 4 | 11 |
| 5 | 10 |
| 6 | 8 |
| 7 | 6 |
| 8 | 5 |
| 9 | 4 |
During a management audit in 2001 at a financial services company, I was reviewing complaints for the account creation process. The process owner, Dhanush, showed me the data confidently. 110 complaints for approximately 11,000 accounts opened. A 1% complaint rate. Clean, reasonable, documented.
At another location of the same company, my records showed complaint rates consistently and significantly higher. The processes were essentially the same. There was no reason this location should perform so differently.
Then I noticed the serial numbers. Each complaint had an auto-generated serial number starting from 1. The largest visible was 998. But there were only 110 complaints. Hundreds of serial numbers were missing.
When I asked why, Dhanush explained that many complaints had been wrongly categorised and moved to a queue managed by an overseas team. Plausible. Except when I looked at which serial numbers remained — the pattern told a different story entirely.
I arranged the 110 remaining complaint serial numbers by their first digit and counted how many started with 1, how many with 2, and so on. The result was Dataset B — the green table above.
If complaints had been randomly removed from a pool of serial numbers between 1 and 998, the remaining numbers should be roughly equally distributed across first digits. Serial numbers in that range are uniformly distributed — there are roughly the same number of values starting with 1 as with 9.
Dataset A shows that uniform pattern. Roughly 9 to 15 complaints per first digit, with no meaningful trend in either direction. That is what genuinely random removal looks like.
I confronted Dhanush with this. That is when Manish, his manager, joined us. They had prepared for this moment.
Manish smiled. He had done his homework.
“There is something called Benford’s Law,” he said, opening his laptop. “It shows that in naturally occurring datasets, the number 1 appears as the leading digit most often — about 30% of the time. The frequency declines as digits increase. Our data follows exactly this pattern.”
He showed two examples: the heights of the world’s tallest structures, and the populations of 237 countries. Both followed the declining distribution. Both matched Dataset B precisely.
The law states that in many naturally occurring collections of numbers, leading digit d occurs with probability log₀(1 + 1/d). Digit 1 appears roughly 30% of the time. Digit 9 appears less than 5% of the time. The pattern holds across a remarkable range of real-world datasets.
Manish was confident. The law was real. The pattern in Dataset B matched it precisely. He had engineered that match deliberately — believing it would be his proof of innocence.
Benford’s Law has a condition that Manish and Dhanush had missed entirely.
The law applies to datasets where numbers span multiple orders of magnitude — where values range from single digits to hundreds to thousands to millions. In such datasets, the distribution of leading digits naturally follows Benford’s pattern because of the logarithmic relationship between scale and frequency.
I opened Excel and typed =RANDBETWEEN(1,998), copied it into 110 cells, and showed Manish the result. It looked like Dataset A — roughly equal counts across all first digits, no meaningful trend.
When I showed this, the smile faded. Manish understood immediately. Then, after a moment, he burst out laughing and patted Dhanush on the back. “But probably our data did not know this and followed it anyway.”
We reviewed the complete original dataset together. The deletions were confirmed. The finding stood.
Now go back to the answer your AI gave you before you opened any of these steps.
Here is what most AI assistants conclude when shown these two datasets and asked which is more likely to be genuine:
AI knows Benford’s Law. It knows when it typically applies. What it does not do — unless specifically prompted — is verify whether serial numbers in a bounded uniform range qualify as the type of data the law was designed for.
The answer AI gives is the answer Manish gave. Confident. Referenced. Supported by genuine mathematical principle. And wrong in a way that would pass any review that did not go one level deeper.
Manish laughed when he was caught. He understood the mistake the moment it was explained. Most professionals who encounter Benford’s Law learn that it detects fraud. Very few learn the condition under which it applies. That gap between knowing a tool exists and knowing when to use it is where most errors live — human and AI alike.
In 2001, catching this required one auditor with enough depth to ask the right question. Today, AI is being used at scale for data quality assessment, fraud detection, and audit support across thousands of organisations. The same error is now possible at the speed and scale of software.
AI did not create this problem. AI scales it.
Two datasets. One was fake. You had AI. What could go wrong? Now you know.

Leave a Reply