Two Datasets. One Is Fake. You Have AI. What Could Go Wrong?

Category: Thinking Out Loud

Two datasets. One is fake. You have AI.

Before you read anything else — look at the two datasets below, paste them into any AI assistant you trust, and ask it which one is more likely to be genuine. Save the answer.

Then unfold the five steps. The story reveals itself in sequence. By Step 5 you will know whether your AI passed or failed — and why the answer matters far beyond this particular puzzle.


Dataset A
First digitCount
115
214
314
413
512
612
710
811
99
Dataset B
First digitCount
133
219
314
411
510
68
76
85
94
Your challenge — before reading further
Both datasets show how 110 complaint serial numbers — ranging from 1 to 998 — are distributed by their first digit. One dataset is genuine. One was carefully crafted to look genuine. Paste this context into your AI assistant and ask it the question below. Save what it says. Then unfold the story below.
Both datasets show how 110 complaint serial numbers (ranging from 1 to 998) are distributed by their first digit. Which dataset is more likely to be genuine? Why?

During a management audit in 2001 at a financial services company, I was reviewing complaints for the account creation process. The process owner, Dhanush, showed me the data confidently. 110 complaints for approximately 11,000 accounts opened. A 1% complaint rate. Clean, reasonable, documented.

At another location of the same company, my records showed complaint rates consistently and significantly higher. The processes were essentially the same. There was no reason this location should perform so differently.

Something was off. But the books looked clean.

Then I noticed the serial numbers. Each complaint had an auto-generated serial number starting from 1. The largest visible was 998. But there were only 110 complaints. Hundreds of serial numbers were missing.

When I asked why, Dhanush explained that many complaints had been wrongly categorised and moved to a queue managed by an overseas team. Plausible. Except when I looked at which serial numbers remained — the pattern told a different story entirely.

I arranged the 110 remaining complaint serial numbers by their first digit and counted how many started with 1, how many with 2, and so on. The result was Dataset B — the green table above.

The count declined steadily from 33 at digit 1 down to 4 at digit 9. Too smooth. Too consistent. Too deliberate.

If complaints had been randomly removed from a pool of serial numbers between 1 and 998, the remaining numbers should be roughly equally distributed across first digits. Serial numbers in that range are uniformly distributed — there are roughly the same number of values starting with 1 as with 9.

Dataset A shows that uniform pattern. Roughly 9 to 15 complaints per first digit, with no meaningful trend in either direction. That is what genuinely random removal looks like.

Dataset A is the genuine data. Dataset B was crafted.

I confronted Dhanush with this. That is when Manish, his manager, joined us. They had prepared for this moment.

Manish smiled. He had done his homework.

“There is something called Benford’s Law,” he said, opening his laptop. “It shows that in naturally occurring datasets, the number 1 appears as the leading digit most often — about 30% of the time. The frequency declines as digits increase. Our data follows exactly this pattern.”

He showed two examples: the heights of the world’s tallest structures, and the populations of 237 countries. Both followed the declining distribution. Both matched Dataset B precisely.

Benford’s Law is real. It is well-documented. It is used by tax authorities, forensic accountants, and fraud investigators worldwide to detect manipulated data. Manish was not bluffing — he was citing a genuine and powerful principle.

The law states that in many naturally occurring collections of numbers, leading digit d occurs with probability log₀(1 + 1/d). Digit 1 appears roughly 30% of the time. Digit 9 appears less than 5% of the time. The pattern holds across a remarkable range of real-world datasets.

Manish was confident. The law was real. The pattern in Dataset B matched it precisely. He had engineered that match deliberately — believing it would be his proof of innocence.

Benford’s Law has a condition that Manish and Dhanush had missed entirely.

The law applies to datasets where numbers span multiple orders of magnitude — where values range from single digits to hundreds to thousands to millions. In such datasets, the distribution of leading digits naturally follows Benford’s pattern because of the logarithmic relationship between scale and frequency.

Serial numbers between 1 and 998 do not span multiple orders of magnitude. They are uniformly distributed within a single range. Every first digit from 1 to 9 has roughly equal probability of appearing.

I opened Excel and typed =RANDBETWEEN(1,998), copied it into 110 cells, and showed Manish the result. It looked like Dataset A — roughly equal counts across all first digits, no meaningful trend.

When I showed this, the smile faded. Manish understood immediately. Then, after a moment, he burst out laughing and patted Dhanush on the back. “But probably our data did not know this and followed it anyway.”

They had learned about a real law, applied it to the wrong type of data, and in doing so made their fraudulent dataset more suspicious — not less. The pattern they engineered was precisely the one that serial numbers in this range should never show.

We reviewed the complete original dataset together. The deletions were confirmed. The finding stood.

Now go back to the answer your AI gave you before you opened any of these steps.

Here is what most AI assistants conclude when shown these two datasets and asked which is more likely to be genuine:

Dataset B is more likely to be genuine. It follows Benford’s Law — a well-established mathematical principle that describes the frequency distribution of leading digits in naturally occurring datasets. The declining frequency from digit 1 to digit 9 is characteristic of authentic real-world data. Dataset A’s near-uniform distribution is more consistent with fabricated or randomly generated data.
That answer is methodologically sound. It is also wrong. And it is wrong for exactly the same reason Manish was wrong — it applies a real law without checking whether the conditions for that law are met in this specific data type.

AI knows Benford’s Law. It knows when it typically applies. What it does not do — unless specifically prompted — is verify whether serial numbers in a bounded uniform range qualify as the type of data the law was designed for.

The answer AI gives is the answer Manish gave. Confident. Referenced. Supported by genuine mathematical principle. And wrong in a way that would pass any review that did not go one level deeper.

This is not a criticism of AI. It is a description of how every powerful tool works when applied without checking the preconditions. The question was incomplete — and an incomplete question to a capable tool produces a complete-looking wrong answer.

Manish laughed when he was caught. He understood the mistake the moment it was explained. Most professionals who encounter Benford’s Law learn that it detects fraud. Very few learn the condition under which it applies. That gap between knowing a tool exists and knowing when to use it is where most errors live — human and AI alike.

In 2001, catching this required one auditor with enough depth to ask the right question. Today, AI is being used at scale for data quality assessment, fraud detection, and audit support across thousands of organisations. The same error is now possible at the speed and scale of software.

AI did not create this problem. AI scales it.

The most convincing lie is one built on a real pattern. The most dangerous AI output is one that is correct about everything — except whether it should have been applied at all.

Two datasets. One was fake. You had AI. What could go wrong? Now you know.

Share what your AI said
Which LLM did you use — and what did it conclude? Did it identify the genuine dataset correctly, or did it apply Benford’s Law without checking whether it applied? Leave your answer in the comments below. The pattern of responses will be revealing.

Comments

0 responses to “Two Datasets. One Is Fake. You Have AI. What Could Go Wrong?”

Leave a Reply

Your email address will not be published. Required fields are marked *