bigfunctions > deidentify
deidentify¶
Call or Deploy deidentify
?
✅ You can call this deidentify
bigfunction directly from your Google Cloud Project (no install required).
- This
deidentify
function is deployed inbigfunctions
GCP project in 39 datasets for all of the 39 BigQuery regions. You need to use the dataset in the same region as your datasets (otherwise you may have a function not found error). - Function is public, so it can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- You may prefer to deploy the BigFunction in your own project if you want to build and manage your own catalog of functions. This is particularly useful if you want to create private functions (for example calling your internal APIs). Discover the framework
Public BigFunctions Datasets:
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Description¶
Signature
deidentify(text, info_types)
Description
Masks sensitive information of type info_types
in text
using Cloud Data Loss Prevention
Param | Possible values (can be one or any combination of the following values separated by comma) |
---|---|
info_types |
ADVERTISING_ID , AGE , AUTH_TOKEN , AWS_CREDENTIALS , AZURE_AUTH_TOKEN , BASIC_AUTH_HEADER , CREDIT_CARD_NUMBER , CREDIT_CARD_TRACK_NUMBER , DATE , DATE_OF_BIRTH , DOMAIN_NAME , EMAIL_ADDRESS , ENCRYPTION_KEY , ETHNIC_GROUP , FEMALE_NAME , FIRST_NAME , GCP_API_KEY , GCP_CREDENTIALS , GENDER , GENERIC_ID , HTTP_COOKIE , HTTP_COOKIE , IBAN_CODE , ICCID_NUMBER , ICD10_CODE , ICD9_CODE , IMEI_HARDWARE_ID , IMSI_ID , IP_ADDRESS , JSON_WEB_TOKEN , LAST_NAME , LOCATION , LOCATION_COORDINATES , MAC_ADDRESS , MAC_ADDRESS_LOCAL , MALE_NAME , MARITAL_STATUS , MEDICAL_RECORD_NUMBER , MEDICAL_TERM , OAUTH_CLIENT_SECRET , ORGANIZATION_NAME , PASSPORT , PASSWORD , PERSON_NAME , PHONE_NUMBER , SSL_CERTIFICATE , STORAGE_SIGNED_POLICY_DOCUMENT , STORAGE_SIGNED_URL , STREET_ADDRESS , SWIFT_CODE , TIME , URL , VAT_NUMBER , VEHICLE_IDENTIFICATION_NUMBER , WEAK_PASSWORD_HASH , XSRF_TOKEN |
Examples¶
1. String with email in it.
select bigfunctions.eu.deidentify("My email is shivam@google.co.in", "PHONE_NUMBER, EMAIL_ADDRESS")
select bigfunctions.us.deidentify("My email is shivam@google.co.in", "PHONE_NUMBER, EMAIL_ADDRESS")
select bigfunctions.europe_west1.deidentify("My email is shivam@google.co.in", "PHONE_NUMBER, EMAIL_ADDRESS")
+-----------------------------+
| masked_info |
+-----------------------------+
| My email is [EMAIL_ADDRESS] |
+-----------------------------+
2. String with phone number in it.
select bigfunctions.eu.deidentify("My phone number is 0123456789", "PHONE_NUMBER, email_address")
select bigfunctions.us.deidentify("My phone number is 0123456789", "PHONE_NUMBER, email_address")
select bigfunctions.europe_west1.deidentify("My phone number is 0123456789", "PHONE_NUMBER, email_address")
+-----------------------------------+
| masked_info |
+-----------------------------------+
| My phone number is [PHONE_NUMBER] |
+-----------------------------------+
3. If info_types
is null
or empty, all built-in info types may be used
select bigfunctions.eu.deidentify("My email is shivam@google.co.in", null)
select bigfunctions.us.deidentify("My email is shivam@google.co.in", null)
select bigfunctions.europe_west1.deidentify("My email is shivam@google.co.in", null)
+------------------------------------------+
| masked_info |
+------------------------------------------+
| My email is [PERSON_NAME][EMAIL_ADDRESS] |
+------------------------------------------+
Need help using deidentify
?
The community can help! Engage the conversation on Slack
For professional suppport, don't hesitate to chat with us.
Found a bug using deidentify
?
If the function does not work as expected, please
- report a bug so that it can be improved.
- or open the discussion with the community on Slack.
For professional suppport, don't hesitate to chat with us.
Use cases¶
A customer support system stores chat transcripts including customer names, email addresses, phone numbers, and potentially credit card numbers if they make a purchase through the chat. Regulations like GDPR require protecting this sensitive information. The deidentify
function can be used within BigQuery to anonymize this data for analysis or other purposes where the raw PII isn't required.
Scenario: A data analyst needs to analyze chat transcripts to understand common customer issues. They don't need the actual PII, just the context of the conversations.
Implementation:
-
Data Storage: Chat transcripts are stored in a BigQuery table with columns like
chat_id
,customer_id
,transcript
. Thetranscript
column contains the raw conversation text. -
De-identification Query: The analyst can use the
deidentify
function in a query to create an anonymized view of the data:
SELECT
chat_id,
customer_id,
bigfunctions.us.deidentify(transcript, 'PERSON_NAME,EMAIL_ADDRESS,PHONE_NUMBER,CREDIT_CARD_NUMBER') AS anonymized_transcript
FROM
`project.dataset.chat_transcripts`;
This query replaces identifiable information within the transcript
column with generic markers like [PERSON_NAME]
, [EMAIL_ADDRESS]
, etc.
- Analysis: The analyst can then perform their analysis on the anonymized view, preserving customer privacy while still gaining insights from the conversation data. For example, they could use natural language processing to identify common themes or topics of discussion.
Benefits:
- Compliance: Meets data privacy regulations by masking sensitive information.
- Simplified Analysis: Enables analysis without risking exposure of PII.
- Flexibility: Allows specifying the types of information to mask, providing granular control over the de-identification process.
- Data Utility: Preserves the context and content of the conversations, allowing for meaningful analysis even after removing PII.
Spread the word¶
BigFunctions is fully open-source. Help make it a success by spreading the word!