Skip to content

bigfunctions > deidentify

deidentify

Call or Deploy deidentify ?

✅ You can call this deidentify bigfunction directly from your Google Cloud Project (no install required).

  • This deidentify function is deployed in bigfunctions GCP project in 39 datasets for all of the 39 BigQuery regions. You need to use the dataset in the same region as your datasets (otherwise you may have a function not found error).
  • Function is public, so it can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
  • You may prefer to deploy the BigFunction in your own project if you want to build and manage your own catalog of functions. This is particularly useful if you want to create private functions (for example calling your internal APIs). Discover the framework

Public BigFunctions Datasets:

Region Dataset
eu bigfunctions.eu
us bigfunctions.us
europe-west1 bigfunctions.europe_west1
asia-east1 bigfunctions.asia_east1
... ...

Description

Signature

deidentify(text, info_types)

Description

Masks sensitive information of type info_types in text using Cloud Data Loss Prevention

Param Possible values (can be one or any combination of the following values separated by comma)
info_types ADVERTISING_ID, AGE, AUTH_TOKEN, AWS_CREDENTIALS, AZURE_AUTH_TOKEN, BASIC_AUTH_HEADER, CREDIT_CARD_NUMBER, CREDIT_CARD_TRACK_NUMBER, DATE, DATE_OF_BIRTH, DOMAIN_NAME, EMAIL_ADDRESS, ENCRYPTION_KEY, ETHNIC_GROUP, FEMALE_NAME, FIRST_NAME, GCP_API_KEY, GCP_CREDENTIALS, GENDER, GENERIC_ID, HTTP_COOKIE, HTTP_COOKIE, IBAN_CODE, ICCID_NUMBER, ICD10_CODE, ICD9_CODE, IMEI_HARDWARE_ID, IMSI_ID, IP_ADDRESS, JSON_WEB_TOKEN, LAST_NAME, LOCATION, LOCATION_COORDINATES, MAC_ADDRESS, MAC_ADDRESS_LOCAL, MALE_NAME, MARITAL_STATUS, MEDICAL_RECORD_NUMBER, MEDICAL_TERM, OAUTH_CLIENT_SECRET, ORGANIZATION_NAME, PASSPORT, PASSWORD, PERSON_NAME, PHONE_NUMBER, SSL_CERTIFICATE, STORAGE_SIGNED_POLICY_DOCUMENT, STORAGE_SIGNED_URL, STREET_ADDRESS, SWIFT_CODE, TIME, URL, VAT_NUMBER, VEHICLE_IDENTIFICATION_NUMBER, WEAK_PASSWORD_HASH, XSRF_TOKEN

Examples

1. String with email in it.

select bigfunctions.eu.deidentify("My email is shivam@google.co.in", "PHONE_NUMBER, EMAIL_ADDRESS")
select bigfunctions.us.deidentify("My email is shivam@google.co.in", "PHONE_NUMBER, EMAIL_ADDRESS")
select bigfunctions.europe_west1.deidentify("My email is shivam@google.co.in", "PHONE_NUMBER, EMAIL_ADDRESS")
+-----------------------------+
| masked_info                 |
+-----------------------------+
| My email is [EMAIL_ADDRESS] |
+-----------------------------+

2. String with phone number in it.

select bigfunctions.eu.deidentify("My phone number is 0123456789", "PHONE_NUMBER, email_address")
select bigfunctions.us.deidentify("My phone number is 0123456789", "PHONE_NUMBER, email_address")
select bigfunctions.europe_west1.deidentify("My phone number is 0123456789", "PHONE_NUMBER, email_address")
+-----------------------------------+
| masked_info                       |
+-----------------------------------+
| My phone number is [PHONE_NUMBER] |
+-----------------------------------+

3. If info_types is null or empty, all built-in info types may be used

select bigfunctions.eu.deidentify("My email is shivam@google.co.in", null)
select bigfunctions.us.deidentify("My email is shivam@google.co.in", null)
select bigfunctions.europe_west1.deidentify("My email is shivam@google.co.in", null)
+------------------------------------------+
| masked_info                              |
+------------------------------------------+
| My email is [PERSON_NAME][EMAIL_ADDRESS] |
+------------------------------------------+

Need help using deidentify?

The community can help! Engage the conversation on Slack

For professional suppport, don't hesitate to chat with us.

Found a bug using deidentify?

If the function does not work as expected, please

  • report a bug so that it can be improved.
  • or open the discussion with the community on Slack.

For professional suppport, don't hesitate to chat with us.

Use cases

A customer support system stores chat transcripts including customer names, email addresses, phone numbers, and potentially credit card numbers if they make a purchase through the chat. Regulations like GDPR require protecting this sensitive information. The deidentify function can be used within BigQuery to anonymize this data for analysis or other purposes where the raw PII isn't required.

Scenario: A data analyst needs to analyze chat transcripts to understand common customer issues. They don't need the actual PII, just the context of the conversations.

Implementation:

  1. Data Storage: Chat transcripts are stored in a BigQuery table with columns like chat_id, customer_id, transcript. The transcript column contains the raw conversation text.

  2. De-identification Query: The analyst can use the deidentify function in a query to create an anonymized view of the data:

SELECT
    chat_id,
    customer_id,
    bigfunctions.us.deidentify(transcript, 'PERSON_NAME,EMAIL_ADDRESS,PHONE_NUMBER,CREDIT_CARD_NUMBER') AS anonymized_transcript
FROM
    `project.dataset.chat_transcripts`;

This query replaces identifiable information within the transcript column with generic markers like [PERSON_NAME], [EMAIL_ADDRESS], etc.

  1. Analysis: The analyst can then perform their analysis on the anonymized view, preserving customer privacy while still gaining insights from the conversation data. For example, they could use natural language processing to identify common themes or topics of discussion.

Benefits:

  • Compliance: Meets data privacy regulations by masking sensitive information.
  • Simplified Analysis: Enables analysis without risking exposure of PII.
  • Flexibility: Allows specifying the types of information to mask, providing granular control over the de-identification process.
  • Data Utility: Preserves the context and content of the conversations, allowing for meaningful analysis even after removing PII.

Spread the word

BigFunctions is fully open-source. Help make it a success by spreading the word!

Share on Add a on