Skip to content

deduplicate_rows

deduplicate_rows(query_or_table_or_view)

Description

Returns the deduplicated rows of query_or_table_or_view

Examples

Call or Deploy deduplicate_rows ?
Call deduplicate_rows directly

The easiest way to use bigfunctions

  • deduplicate_rows function is deployed in 39 public datasets for all of the 39 BigQuery regions.
  • It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
  • (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)

Public BigFunctions Datasets

Region Dataset
eu bigfunctions.eu
us bigfunctions.us
europe-west1 bigfunctions.europe_west1
asia-east1 bigfunctions.asia_east1
... ...
Deploy deduplicate_rows in your project

Why deploy?

  • You may prefer to deploy deduplicate_rows in your own project to build and manage your own catalog of functions.
  • This is particularly useful if you want to create private functions (for example calling your internal APIs).
  • Get started by reading the framework page

Deployment

deduplicate_rows function can be deployed with:

pip install bigfunctions
bigfun get deduplicate_rows
bigfun deploy deduplicate_rows

1. Returns table with duplicate rows removed.

call bigfunctions.eu.deduplicate_rows("my_project.my_dataset.my_table");
select * from bigfunction_result;
call bigfunctions.us.deduplicate_rows("my_project.my_dataset.my_table");
select * from bigfunction_result;
call bigfunctions.europe_west1.deduplicate_rows("my_project.my_dataset.my_table");
select * from bigfunction_result;

+-----+-----+
| id1 | id2 |
+-----+-----+
| 1   | 2   |
| 1   | 3   |
| 2   | 3   |
| 4   | 3   |
| 6   | 3   |
| 7   | 3   |
| 8   | 9   |
| 9   | 9   |
+-----+-----+


2. When incorrect table name is passed as arguments.

call bigfunctions.eu.deduplicate_rows("my_project.my_dataset.my_tbl");
select * from bigfunction_result;
call bigfunctions.us.deduplicate_rows("my_project.my_dataset.my_tbl");
select * from bigfunction_result;
call bigfunctions.europe_west1.deduplicate_rows("my_project.my_dataset.my_tbl");
select * from bigfunction_result;

+-------------------------------------------------------------------------------------------------------------------------------------------+
| f0_                                                                                                                                       |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| Not found: Table my_project:my_dataset.my_tbl was not found in location US at [my_project:my_dataset.deduplicate_rows:2:13] |
+-------------------------------------------------------------------------------------------------------------------------------------------+


3. When a query is passed into the procedure.

call bigfunctions.eu.deduplicate_rows("select data from unnest([1, 2, 3, 1]) data");
select * from bigfunction_result;
call bigfunctions.us.deduplicate_rows("select data from unnest([1, 2, 3, 1]) data");
select * from bigfunction_result;
call bigfunctions.europe_west1.deduplicate_rows("select data from unnest([1, 2, 3, 1]) data");
select * from bigfunction_result;

+------+
| data |
+------+
| 1    |
| 2    |
| 3    |
+------+


Need help or Found a bug using deduplicate_rows?
Get help using deduplicate_rows

The community can help! Engage the conversation on Slack

We also provide professional suppport.

Report a bug about deduplicate_rows

If the function does not work as expected, please

  • report a bug so that it can be improved.
  • or open the discussion with the community on Slack.

We also provide professional suppport.

Use cases

Let's say you have a table of customer transactions where accidental duplicates might occur. You want to analyze the data accurately, so you need to remove those duplicates.

Scenario:

Your table customer_transactions in dataset my_dataset in project my_project looks like this:

transaction_id customer_id amount date
1 101 10.00 2024-03-08
2 102 25.50 2024-03-08
3 101 10.00 2024-03-08
4 103 50.00 2024-03-09
5 102 12.00 2024-03-09
6 101 10.00 2024-03-08

Use Case with deduplicate_rows:

You can use the deduplicate_rows function to remove the duplicate transactions:

CALL bigfunctions.us.deduplicate_rows("my_project.my_dataset.customer_transactions");
SELECT * FROM bigfunction_result;

This will create a temporary table bigfunction_result containing the deduplicated rows:

transaction_id customer_id amount date
1 101 10.00 2024-03-08
2 102 25.50 2024-03-08
4 103 50.00 2024-03-09
5 102 12.00 2024-03-09

Benefits:

  • Simplicity: Easily deduplicate rows without complex SQL queries.
  • Efficiency: Leverages BigQuery's processing power for fast deduplication, even on large tables.
  • Flexibility: Works with both tables and query results, allowing you to deduplicate data from various sources.

Other Use Cases:

  • Deduplicating product catalogs with slight variations in descriptions.
  • Removing duplicate entries in user registration data.
  • Cleaning up sensor data where multiple readings might be recorded for the same timestamp.
  • Removing duplicate records from log files.

Remember to replace bigfunctions.us with the appropriate dataset for your BigQuery region. You can also create a new table from the bigfunction_result if you want to store the deduplicated data permanently.

Spread the word!

BigFunctions is fully open-source. Help make it a success by spreading the word!

Share on Add a on