Benford's law in frameworks
Tagged with: [ Benford ] [ PHP ] [ Statistics ]
In a new talk I’m currently presenting at conferences and meetups, I talk - amongst other things - about Benford’s law. This law states that in natural occurring numbers, the first digit of those numbers will most often start with a 1 (around 30% of the time), and logarithmically drops down to the number 9, which occurs only 5% of the time. This might sound strange: why would a number that starts with 1, (like 1, 16, 152 or even 152533), be more common than 2,25, 266, or even the lesser common 6, 63, 6474 etc? And although there are some explanations, a definitive one still isn’t there.
But Benford works! Think of something regarding numbers, and you’ll find that most likely it will follow Benford’s law. For instance, on the wikipedia page, they use the example of the 60 largest structures made, and they find that they follow Benford’s law, even if you measure the heights in feet or meters. But some goes for measuring lengths of rivers, population counts in countries, and most likely, the sizes of national parks in the US (haven’t tried it though).
Even though there is no guarantee that something will actually follow Benford’s law, a lot of things do, and in fact, it can be used for things like fraud detection: in your taxes, in elections, but basically anything concerning numbers. With enough data, and thus with enough fraudulent data, there should be detectable variances, which can indicate fraud, or at least, artificial modification of that data (so, one tip: if you going to some fraudulent tax evasion, make sure you do it according to Benford’s law :p).
But anyway, I wanted to see Benford’s law in action for myself, so I’ve come up with a simple test:
Take a (PHP) framework, and count the line-numbers for each PHP file in the framework. Since we are only interested in the first digits, when a PHP script has 153 lines, we place that file in the ‘1’ box, if we find a line count of 64,we place that in the ‘6’ box etc. I’ve tried this with the Symfony framework I had currently open in another terminal, and slapped a fairly easy command line that does the trick:
find . -name \*.php -exec wc -l {} \; | sort | cut -b 1 | uniq -c
If this looks too cryptic: don’t worry. What we do is find all the *.PHP files in the current directly and any subdirectories. On each file we find, we execute another command: “wc -l”. This will return the line-count of each of the found PHP files. So this gives us a large list of numbers representing the line-counts of all PHP files in the given directory (we don’t care about the actual filenames, just the line counts).
Next,we sort the numbers, so the list will be something like: 12 125 15 156 16 17 36 367
etc, where every number is
on a separate line. Then, we cut away everything except the first number, since we aren’t really interested in the
counts, but merely in the first digit of the counts. Now, we have a large list of digits, starting with 1, and ending
with 9 (since we have sorted the counts first). Finally, we move that list into uniq -c
, which counts how many 1’s
there are, how many 2’s, how many 3’s etc.
As it turns out, this would be the result when running this command on Symfony 2.8:
1073 1
886 2
636 3
372 4
352 5
350 6
307 7
247 8
222 9
Even without the actual Benford numbers, you already see that the numbers scale down pretty smoothly: the number 1 has the highest count, than number 2 , 3 etc. and number 9 being the least common number.
This is really interesting: it seems that Symfony2 follows Benford’s law, even though there are some minor differences in the numbers 1 and 4. The rest are basically spot on.
So the next thing to do, is checking some other common frameworks, and the results were actually amazing:
In fact, pretty much every framework follows Benford’s law so closely, it’s very hard to actually see the original Benford line in the graph :-) The one deviating from the pattern the most often seems to be Zend Framework 2. (No, this does not imply a fraudulent framework :p).
It’s fun to see that you can come up with something completely and utterly irrelevant as line numbers in frameworks and still find Benford’s law holding up. You could for instance use Benfords law to figure out if numbers are generated artificially (through a random number generator for instance), or are “natural” occuring numbers? When generated through a generator, the numbers should be evenly distributed, and when you are randomizing between 0 and 9999, the initial number should be evenly distributed as well (the graph should show a horizontal line, instead of a sloping down line.