bigfunctions > benford_distance
benford_distance¶
Call or Deploy benford_distance
?
✅ You can call this benford_distance
bigfunction directly from your Google Cloud Project (no install required).
- This
benford_distance
function is deployed inbigfunctions
GCP project in 39 datasets for all of the 39 BigQuery regions. You need to use the dataset in the same region as your datasets (otherwise you may have a function not found error). - Function is public, so it can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- You may prefer to deploy the BigFunction in your own project if you want to build and manage your own catalog of functions. This is particularly useful if you want to create private functions (for example calling your internal APIs). Discover the framework
Public BigFunctions Datasets:
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Description¶
Signature
benford_distance(values)
Description
Calculate the distance from Benford's Law for given values
.
As mentioned in wikipedia, Benford's law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small. In sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time.
This function computes the Chi-square distance between the observed distribution of leading digits of values
and the expected distribution according to Benford's Law.
The smaller the benford_distance
, the more the values
follow Benford's Law.
Read "The Mysterious Benford’s Law and it’s Connection with Fraud Detection" by Vihasharma to see some applications of this function.
Examples¶
1. Uniformly distributed values do not follow Benford's Law
select bigfunctions.eu.benford_distance([1, 2, 3, 4, 5, 6, 7, 8, 9])
select bigfunctions.us.benford_distance([1, 2, 3, 4, 5, 6, 7, 8, 9])
select bigfunctions.europe_west1.benford_distance([1, 2, 3, 4, 5, 6, 7, 8, 9])
+------------------+
| benford_distance |
+------------------+
| 0.4 |
+------------------+
2. Having more small values follow more Benford's Law. Distance is lower
select bigfunctions.eu.benford_distance([1, 1, 1, 2, 2, 3, 4, 5, 6])
select bigfunctions.us.benford_distance([1, 1, 1, 2, 2, 3, 4, 5, 6])
select bigfunctions.europe_west1.benford_distance([1, 1, 1, 2, 2, 3, 4, 5, 6])
+------------------+
| benford_distance |
+------------------+
| 0.2 |
+------------------+
3. Having constant values follow less Benford's Law than uniform. Distance is higher
select bigfunctions.eu.benford_distance([1, 1, 1, 1, 1, 1, 1, 1, 1])
select bigfunctions.us.benford_distance([1, 1, 1, 1, 1, 1, 1, 1, 1])
select bigfunctions.europe_west1.benford_distance([1, 1, 1, 1, 1, 1, 1, 1, 1])
+------------------+
| benford_distance |
+------------------+
| 2.3 |
+------------------+
4. Higher leading digits is worse. Distance is much higher
select bigfunctions.eu.benford_distance([9, 9, 9, 9, 9, 9, 9, 9, 9])
select bigfunctions.us.benford_distance([9, 9, 9, 9, 9, 9, 9, 9, 9])
select bigfunctions.europe_west1.benford_distance([9, 9, 9, 9, 9, 9, 9, 9, 9])
+------------------+
| benford_distance |
+------------------+
| 20.7 |
+------------------+
Need help using benford_distance
?
The community can help! Engage the conversation on Slack
For professional suppport, don't hesitate to chat with us.
Found a bug using benford_distance
?
If the function does not work as expected, please
- report a bug so that it can be improved.
- or open the discussion with the community on Slack.
For professional suppport, don't hesitate to chat with us.
Use cases¶
A common use case for the benford_distance
function is fraud detection. Benford's Law states that in many naturally occurring datasets, the leading digit 1 appears with a probability of about 30%, followed by 2 at about 18%, and so on, with 9 being the least frequent leading digit. Datasets that deviate significantly from this distribution can be a red flag for manipulation or fabrication.
Here's a practical example in the context of financial transactions:
Scenario: You're an auditor examining a company's expense reports. You suspect some employees might be submitting fraudulent claims.
How to use benford_distance
:
-
Apply the function: Use the
benford_distance
function on the array of leading digits. A higher distance suggests a greater deviation from Benford's Law. -
Investigate outliers: Focus your investigation on expense reports with the highest
benford_distance
scores. These are the reports most likely to contain fabricated numbers.
Example BigQuery SQL:
SELECT
employee_id,
bigfunctions.us.benford_distance(array_agg(expense_amount)) AS benford_distance
FROM `your_project.your_dataset.expense_reports`
GROUP BY employee_id
ORDER BY benford_distance DESC;
This query groups expenses by employee and calculates the benford_distance
for each employee's expense amounts. Ordering by benford_distance
descending allows you to quickly identify employees with suspicious expense patterns.
Other use cases:
- Election fraud detection: Analyzing vote counts for adherence to Benford's Law.
- Scientific data validation: Checking the integrity of experimental measurements.
- Financial market analysis: Identifying potential market manipulation or anomalies in stock prices.
- Accounting and auditing: Detecting inconsistencies or fabricated data in financial statements.
By measuring the deviation from Benford's Law, the benford_distance
function provides a valuable tool for identifying potentially fraudulent or manipulated data in a variety of applications.
Spread the word¶
BigFunctions is fully open-source. Help make it a success by spreading the word!