benford_distance¶

benford_distance(values)

Description¶

Calculate the distance from Benford's Law for given values.

As mentioned in wikipedia, Benford's law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small. In sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time.

This function computes the Chi-square distance between the observed distribution of leading digits of values and the expected distribution according to Benford's Law.

The smaller the benford_distance, the more the values follow Benford's Law.

Read "The Mysterious Benford’s Law and it’s Connection with Fraud Detection" by Vihasharma to see some applications of this function.

Usage¶

Call or Deploy benford_distance ?

Call benford_distance directly

The easiest way to use bigfunctions

benford_distance function is deployed in 39 public datasets for all of the 39 BigQuery regions.
It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
(You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)

Public BigFunctions Datasets

Region	Dataset
`eu`	`bigfunctions.eu`
`us`	`bigfunctions.us`
`europe-west1`	`bigfunctions.europe_west1`
`asia-east1`	`bigfunctions.asia_east1`
...	...

Deploy benford_distance in your project

Why deploy?

You may prefer to deploy benford_distance in your own project to build and manage your own catalog of functions.
This is particularly useful if you want to create private functions (for example calling your internal APIs).
Get started by reading the framework page

Deployment

benford_distance function can be deployed with:

pip install bigfunctions
bigfun get benford_distance
bigfun deploy benford_distance

Examples¶

1. Uniformly distributed values do not follow Benford's Law

EUUSeurope-west1

select bigfunctions.eu.benford_distance([1, 2, 3, 4, 5, 6, 7, 8, 9])

select bigfunctions.us.benford_distance([1, 2, 3, 4, 5, 6, 7, 8, 9])

select bigfunctions.europe_west1.benford_distance([1, 2, 3, 4, 5, 6, 7, 8, 9])

+------------------+
| benford_distance |
+------------------+
| 0.4              |
+------------------+

2. Having more small values follow more Benford's Law. Distance is lower

EUUSeurope-west1

select bigfunctions.eu.benford_distance([1, 1, 1, 2, 2, 3, 4, 5, 6])

select bigfunctions.us.benford_distance([1, 1, 1, 2, 2, 3, 4, 5, 6])

select bigfunctions.europe_west1.benford_distance([1, 1, 1, 2, 2, 3, 4, 5, 6])

+------------------+
| benford_distance |
+------------------+
| 0.2              |
+------------------+

3. Having constant values follow less Benford's Law than uniform. Distance is higher

EUUSeurope-west1

select bigfunctions.eu.benford_distance([1, 1, 1, 1, 1, 1, 1, 1, 1])

select bigfunctions.us.benford_distance([1, 1, 1, 1, 1, 1, 1, 1, 1])

select bigfunctions.europe_west1.benford_distance([1, 1, 1, 1, 1, 1, 1, 1, 1])

+------------------+
| benford_distance |
+------------------+
| 2.3              |
+------------------+

4. Higher leading digits is worse. Distance is much higher

EUUSeurope-west1

select bigfunctions.eu.benford_distance([9, 9, 9, 9, 9, 9, 9, 9, 9])

select bigfunctions.us.benford_distance([9, 9, 9, 9, 9, 9, 9, 9, 9])

select bigfunctions.europe_west1.benford_distance([9, 9, 9, 9, 9, 9, 9, 9, 9])

+------------------+
| benford_distance |
+------------------+
| 20.7             |
+------------------+

Use cases¶

A common use case for the benford_distance function is fraud detection. Benford's Law states that in many naturally occurring datasets, the leading digit 1 appears with a probability of about 30%, followed by 2 at about 18%, and so on, with 9 being the least frequent leading digit. Datasets that deviate significantly from this distribution can be a red flag for manipulation or fabrication.

Here's a practical example in the context of financial transactions:

Scenario: You're an auditor examining a company's expense reports. You suspect some employees might be submitting fraudulent claims.

How to use benford_distance:

Apply the function: Use the benford_distance function on the array of leading digits. A higher distance suggests a greater deviation from Benford's Law.
Investigate outliers: Focus your investigation on expense reports with the highest benford_distance scores. These are the reports most likely to contain fabricated numbers.

Example BigQuery SQL:

SELECT
    employee_id,
    bigfunctions.us.benford_distance(array_agg(expense_amount)) AS benford_distance
FROM `your_project.your_dataset.expense_reports`
GROUP BY employee_id
ORDER BY benford_distance DESC;

This query groups expenses by employee and calculates the benford_distance for each employee's expense amounts. Ordering by benford_distance descending allows you to quickly identify employees with suspicious expense patterns.

Other use cases:

Election fraud detection: Analyzing vote counts for adherence to Benford's Law.
Scientific data validation: Checking the integrity of experimental measurements.
Financial market analysis: Identifying potential market manipulation or anomalies in stock prices.
Accounting and auditing: Detecting inconsistencies or fabricated data in financial statements.

By measuring the deviation from Benford's Law, the benford_distance function provides a valuable tool for identifying potentially fraudulent or manipulated data in a variety of applications.

Need help or Found a bug?

Get help using benford_distance

The community can help! Engage the conversation on Slack

We also provide professional suppport.

Report a bug about benford_distance

If the function does not work as expected, please

report a bug so that it can be improved.
or open the discussion with the community on Slack.

We also provide professional suppport.

Show your by adding a on