bigfunctions > deduplicate_rows
deduplicate_rows¶
Call or Deploy deduplicate_rows
?
✅ You can call this deduplicate_rows
bigfunction directly from your Google Cloud Project (no install required).
- This
deduplicate_rows
function is deployed inbigfunctions
GCP project in 39 datasets for all of the 39 BigQuery regions. You need to use the dataset in the same region as your datasets (otherwise you may have a function not found error). - Function is public, so it can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- You may prefer to deploy the BigFunction in your own project if you want to build and manage your own catalog of functions --> Read Getting Started. This is particularly useful if you want to create private functions (for example calling your internal APIs).
- For any question or difficulties, please read Getting Started.
- Found a bug? Please raise an issue here
Public BigFunctions Datasets are like:
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Description¶
Signature
deduplicate_rows(query_or_table_or_view)
Description
Returns the deduplicated rows of query_or_table_or_view
Examples¶
1. Returns table with duplicate rows removed.
call bigfunctions.eu.deduplicate_rows("my_project.my_dataset.my_table");
select * from bigfunction_result;
call bigfunctions.us.deduplicate_rows("my_project.my_dataset.my_table");
select * from bigfunction_result;
call bigfunctions.europe_west1.deduplicate_rows("my_project.my_dataset.my_table");
select * from bigfunction_result;
+-----+-----+
| id1 | id2 |
+-----+-----+
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 4 | 3 |
| 6 | 3 |
| 7 | 3 |
| 8 | 9 |
| 9 | 9 |
+-----+-----+
2. When incorrect table name is passed as arguments.
call bigfunctions.eu.deduplicate_rows("my_project.my_dataset.my_tbl");
select * from bigfunction_result;
call bigfunctions.us.deduplicate_rows("my_project.my_dataset.my_tbl");
select * from bigfunction_result;
call bigfunctions.europe_west1.deduplicate_rows("my_project.my_dataset.my_tbl");
select * from bigfunction_result;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| f0_ |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| Not found: Table my_project:my_dataset.my_tbl was not found in location US at [my_project:my_dataset.deduplicate_rows:2:13] |
+-------------------------------------------------------------------------------------------------------------------------------------------+
3. When a query is passed into the procedure.
call bigfunctions.eu.deduplicate_rows("select data from unnest([1, 2, 3, 1]) data");
select * from bigfunction_result;
call bigfunctions.us.deduplicate_rows("select data from unnest([1, 2, 3, 1]) data");
select * from bigfunction_result;
call bigfunctions.europe_west1.deduplicate_rows("select data from unnest([1, 2, 3, 1]) data");
select * from bigfunction_result;
+------+
| data |
+------+
| 1 |
| 2 |
| 3 |
+------+
Use cases¶
Let's say you have a table of customer transactions where accidental duplicates might occur. You want to analyze the data accurately, so you need to remove those duplicates.
Scenario:
Your table customer_transactions
in dataset my_dataset
in project my_project
looks like this:
transaction_id | customer_id | amount | date |
---|---|---|---|
1 | 101 | 10.00 | 2024-03-08 |
2 | 102 | 25.50 | 2024-03-08 |
3 | 101 | 10.00 | 2024-03-08 |
4 | 103 | 50.00 | 2024-03-09 |
5 | 102 | 12.00 | 2024-03-09 |
6 | 101 | 10.00 | 2024-03-08 |
Use Case with deduplicate_rows
:
You can use the deduplicate_rows
function to remove the duplicate transactions:
CALL bigfunctions.us.deduplicate_rows("my_project.my_dataset.customer_transactions");
SELECT * FROM bigfunction_result;
This will create a temporary table bigfunction_result
containing the deduplicated rows:
transaction_id | customer_id | amount | date |
---|---|---|---|
1 | 101 | 10.00 | 2024-03-08 |
2 | 102 | 25.50 | 2024-03-08 |
4 | 103 | 50.00 | 2024-03-09 |
5 | 102 | 12.00 | 2024-03-09 |
Benefits:
- Simplicity: Easily deduplicate rows without complex SQL queries.
- Efficiency: Leverages BigQuery's processing power for fast deduplication, even on large tables.
- Flexibility: Works with both tables and query results, allowing you to deduplicate data from various sources.
Other Use Cases:
- Deduplicating product catalogs with slight variations in descriptions.
- Removing duplicate entries in user registration data.
- Cleaning up sensor data where multiple readings might be recorded for the same timestamp.
- Removing duplicate records from log files.
Remember to replace bigfunctions.us
with the appropriate dataset for your BigQuery region. You can also create a new table from the bigfunction_result
if you want to store the deduplicated data permanently.