deduplicate_rows¶
deduplicate_rows(query_or_table_or_view)
Description¶
Returns the deduplicated rows of query_or_table_or_view
Examples¶
Call or Deploy deduplicate_rows
?
Call deduplicate_rows
directly
The easiest way to use bigfunctions
deduplicate_rows
function is deployed in 39 public datasets for all of the 39 BigQuery regions.- It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)
Public BigFunctions Datasets
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Deploy deduplicate_rows
in your project
Why deploy?
- You may prefer to deploy
deduplicate_rows
in your own project to build and manage your own catalog of functions. - This is particularly useful if you want to create private functions (for example calling your internal APIs).
- Get started by reading the framework page
Deployment
deduplicate_rows
function can be deployed with:
pip install bigfunctions
bigfun get deduplicate_rows
bigfun deploy deduplicate_rows
1. Returns table with duplicate rows removed.
call bigfunctions.eu.deduplicate_rows("my_project.my_dataset.my_table");
select * from bigfunction_result;
call bigfunctions.us.deduplicate_rows("my_project.my_dataset.my_table");
select * from bigfunction_result;
call bigfunctions.europe_west1.deduplicate_rows("my_project.my_dataset.my_table");
select * from bigfunction_result;
+-----+-----+
| id1 | id2 |
+-----+-----+
| 1 | 2 |
| 1 | 3 |
| 2 | 3 |
| 4 | 3 |
| 6 | 3 |
| 7 | 3 |
| 8 | 9 |
| 9 | 9 |
+-----+-----+
2. When incorrect table name is passed as arguments.
call bigfunctions.eu.deduplicate_rows("my_project.my_dataset.my_tbl");
select * from bigfunction_result;
call bigfunctions.us.deduplicate_rows("my_project.my_dataset.my_tbl");
select * from bigfunction_result;
call bigfunctions.europe_west1.deduplicate_rows("my_project.my_dataset.my_tbl");
select * from bigfunction_result;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| f0_ |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| Not found: Table my_project:my_dataset.my_tbl was not found in location US at [my_project:my_dataset.deduplicate_rows:2:13] |
+-------------------------------------------------------------------------------------------------------------------------------------------+
3. When a query is passed into the procedure.
call bigfunctions.eu.deduplicate_rows("select data from unnest([1, 2, 3, 1]) data");
select * from bigfunction_result;
call bigfunctions.us.deduplicate_rows("select data from unnest([1, 2, 3, 1]) data");
select * from bigfunction_result;
call bigfunctions.europe_west1.deduplicate_rows("select data from unnest([1, 2, 3, 1]) data");
select * from bigfunction_result;
+------+
| data |
+------+
| 1 |
| 2 |
| 3 |
+------+
Need help or Found a bug using deduplicate_rows
?
Get help using deduplicate_rows
The community can help! Engage the conversation on Slack
We also provide professional suppport.
Report a bug about deduplicate_rows
If the function does not work as expected, please
- report a bug so that it can be improved.
- or open the discussion with the community on Slack.
We also provide professional suppport.
Use cases¶
Let's say you have a table of customer transactions where accidental duplicates might occur. You want to analyze the data accurately, so you need to remove those duplicates.
Scenario:
Your table customer_transactions
in dataset my_dataset
in project my_project
looks like this:
transaction_id | customer_id | amount | date |
---|---|---|---|
1 | 101 | 10.00 | 2024-03-08 |
2 | 102 | 25.50 | 2024-03-08 |
3 | 101 | 10.00 | 2024-03-08 |
4 | 103 | 50.00 | 2024-03-09 |
5 | 102 | 12.00 | 2024-03-09 |
6 | 101 | 10.00 | 2024-03-08 |
Use Case with deduplicate_rows
:
You can use the deduplicate_rows
function to remove the duplicate transactions:
CALL bigfunctions.us.deduplicate_rows("my_project.my_dataset.customer_transactions");
SELECT * FROM bigfunction_result;
This will create a temporary table bigfunction_result
containing the deduplicated rows:
transaction_id | customer_id | amount | date |
---|---|---|---|
1 | 101 | 10.00 | 2024-03-08 |
2 | 102 | 25.50 | 2024-03-08 |
4 | 103 | 50.00 | 2024-03-09 |
5 | 102 | 12.00 | 2024-03-09 |
Benefits:
- Simplicity: Easily deduplicate rows without complex SQL queries.
- Efficiency: Leverages BigQuery's processing power for fast deduplication, even on large tables.
- Flexibility: Works with both tables and query results, allowing you to deduplicate data from various sources.
Other Use Cases:
- Deduplicating product catalogs with slight variations in descriptions.
- Removing duplicate entries in user registration data.
- Cleaning up sensor data where multiple readings might be recorded for the same timestamp.
- Removing duplicate records from log files.
Remember to replace bigfunctions.us
with the appropriate dataset for your BigQuery region. You can also create a new table from the bigfunction_result
if you want to store the deduplicated data permanently.
Spread the word!¶
BigFunctions is fully open-source. Help make it a success by spreading the word!