# Today In Data Mining? Maybe?

Finance professor Jialan Wang won the Internet today with a beautiful note on Benford's law in US accounting data (for completeness of her victory see here, here, here, here, and here).

Here's the argument. Benford's Law is a statistical regularity that applies to many collections of numbers of differing orders of magnitude. As Wang writes:

A second earth-shattering fact is that there are more numbers in the universe that begin with the digit 1 than 2, or 3, or 4, or 5, or 6, or 7, or 8, or 9. And more numbers that begin with 2 than 3, or 4, and so on. This relationship holds for the lengths of rivers, the populations of cities, molecular weights of chemicals, and any number of other categories.

The explanation generally seems linked with exponential growth, and the formula is P(d) = log10 (1 + 1/d). So the probability of a number starting with a 1 is log 2, or 30%; the probability of it starting with a 9 is log 1.11, or about 4.6%. Strong men have been driven mad peering into this abyss.

Benford's law ought to hold for lots of kinds of financial data, particularly if you just take a big unsorted pile of stuff. So Wang took 50 years of various financial data (revenues, assets, and 41 other publicly reported categories) from 20,000 publicly reporting companies and just plotted the number of numbers that started with 1s, 2s, 3s ... etc. And it was a pretty good match to the Benford distribution:

So far so good. Now the bad news: the relationship has been moving away from a Benford distribution over time.

That chart is sum of squares of deviations, so 0.01 means that the average digit appears 3.3% more or less often than Benford's law predicts. You can tell evocative stories about various points on that graph. Wang's stories include:

- Deviation in finance went up in 1981-1982, "coincident with two major deregulatory acts that sparked the beginnings of that other big mortgage debacle, the Savings and Loan Crisis." It peaked in 1988 and matched that level in 2008, corresponding with banking crises.

- Deviation in tech surged during the dotcom bubble.

- Deviation in tech and manufacturing did not decline around 1990, as it did in finance, "since neither industry experienced major fraud scandals during that period."

Her conclusion:

While these time series don't prove anything decisively, deviations from Benford's law are compellingly correlated with known financial crises, bubbles, and fraud waves. And overall, the picture looks grim. Accounting data seem to be less and less related to the natural data-generating process that governs everything from rivers to molecules to cities. Since these data form the basis of most of our research in finance, Benford's law casts serious doubt on the reliability of our results. And it's just one more reason for investors to beware.

The research is clever, simple, alarming, and just really really cool, and everyone seems pretty convinced. And my statistical knowledge is ... so-so. But I'm still a little skeptical. Partly it's that the stories seem a little cherry-picked. (Why did deviations go up for every industry in the early 1980s - banking deregulation? Why do ups and downs in manufacturing track those in tech so closely when manufacturing lacked a lot of the IPO-boom, options-backdating incentives to manipulate earnings that tech arguably had?)

But mostly I worry that the explanation seems light on mechanism. Benford's law has been used to spot fraud in corporate expense accounts, as well as in Enron and Greece. The idea is that people manipulate numbers in ways that aren't natural. A $0.09 EPS gets pushed up to $0.10. Totally made-up profits might as well be round or amusing numbers. So if a company's numbers deviate from Benford's law, that could suggest that that company is up to something suspicious.

But it seems like something that would wash out in aggregating 20,000 companies. A company with an expense account limit of $100 might see a lot of $99 receipts. A company with a $10 minimum for reimbursement might see a lot of $10.25 receipts. Greece might not want to admit to €300 billion in debt but will be cool with €299bn. Enron might prefer $0.10 to $0.09 EPS. Some of those choices will push up the numbers of 1s, 2s, ... etc.; some will push them down.

Aggregating 20,000 companies, *even if they're all committing fraud*, ought to wash out as long as they all have slightly different accounting policies, achieved slightly different actual results, and are committing slightly different frauds. One company's earnings time series is either an artifact of nature or something manipulated by crooks; 20,000 earnings time series, faked or not, are just their own collection of naturally occurring numbers.

I'm not aware of the literature here - though what I've found is all related to individual frauds perpetrated by individuals, or at least within one company with one expense policy - so, y'know, enlighten me if I'm missing something. Maybe the natural tendency of all manipulation is to increase 1s and 9s. And I don't have a better explanation for why the deviations from Benford's law are increasing. But I'm not yet ready to throw, um, every US public company in jail on the basis of these charts.

Benford's Law and the Decreasing Reliability of Accounting Data for US Firms [Studies in Everyday Life]