deidentify¶
deidentify(text, info_types)
Description¶
Masks sensitive information of type info_types
in text
using Cloud Data Loss Prevention
Param | Possible values (can be one or any combination of the following values separated by comma) |
---|---|
info_types |
ADVERTISING_ID , AGE , AUTH_TOKEN , AWS_CREDENTIALS , AZURE_AUTH_TOKEN , BASIC_AUTH_HEADER , CREDIT_CARD_NUMBER , CREDIT_CARD_TRACK_NUMBER , DATE , DATE_OF_BIRTH , DOMAIN_NAME , EMAIL_ADDRESS , ENCRYPTION_KEY , ETHNIC_GROUP , FEMALE_NAME , FIRST_NAME , GCP_API_KEY , GCP_CREDENTIALS , GENDER , GENERIC_ID , HTTP_COOKIE , HTTP_COOKIE , IBAN_CODE , ICCID_NUMBER , ICD10_CODE , ICD9_CODE , IMEI_HARDWARE_ID , IMSI_ID , IP_ADDRESS , JSON_WEB_TOKEN , LAST_NAME , LOCATION , LOCATION_COORDINATES , MAC_ADDRESS , MAC_ADDRESS_LOCAL , MALE_NAME , MARITAL_STATUS , MEDICAL_RECORD_NUMBER , MEDICAL_TERM , OAUTH_CLIENT_SECRET , ORGANIZATION_NAME , PASSPORT , PASSWORD , PERSON_NAME , PHONE_NUMBER , SSL_CERTIFICATE , STORAGE_SIGNED_POLICY_DOCUMENT , STORAGE_SIGNED_URL , STREET_ADDRESS , SWIFT_CODE , TIME , URL , VAT_NUMBER , VEHICLE_IDENTIFICATION_NUMBER , WEAK_PASSWORD_HASH , XSRF_TOKEN |
Usage¶
Call or Deploy deidentify
?
Call deidentify
directly
The easiest way to use bigfunctions
deidentify
function is deployed in 39 public datasets for all of the 39 BigQuery regions.- It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
- (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)
Public BigFunctions Datasets
Region | Dataset |
---|---|
eu |
bigfunctions.eu |
us |
bigfunctions.us |
europe-west1 |
bigfunctions.europe_west1 |
asia-east1 |
bigfunctions.asia_east1 |
... | ... |
Deploy deidentify
in your project
Why deploy?
- You may prefer to deploy
deidentify
in your own project to build and manage your own catalog of functions. - This is particularly useful if you want to create private functions (for example calling your internal APIs).
- Get started by reading the framework page
Deployment
deidentify
function can be deployed with:
pip install bigfunctions
bigfun get deidentify
bigfun deploy deidentify
Examples¶
1. String with email in it.
select bigfunctions.eu.deidentify("My email is shivam@google.co.in", "PHONE_NUMBER, EMAIL_ADDRESS")
select bigfunctions.us.deidentify("My email is shivam@google.co.in", "PHONE_NUMBER, EMAIL_ADDRESS")
select bigfunctions.europe_west1.deidentify("My email is shivam@google.co.in", "PHONE_NUMBER, EMAIL_ADDRESS")
+-----------------------------+
| masked_info |
+-----------------------------+
| My email is [EMAIL_ADDRESS] |
+-----------------------------+
2. String with phone number in it.
select bigfunctions.eu.deidentify("My phone number is 0123456789", "PHONE_NUMBER, email_address")
select bigfunctions.us.deidentify("My phone number is 0123456789", "PHONE_NUMBER, email_address")
select bigfunctions.europe_west1.deidentify("My phone number is 0123456789", "PHONE_NUMBER, email_address")
+-----------------------------------+
| masked_info |
+-----------------------------------+
| My phone number is [PHONE_NUMBER] |
+-----------------------------------+
3. If info_types
is null
or empty, all built-in info types may be used
select bigfunctions.eu.deidentify("My email is shivam@google.co.in", null)
select bigfunctions.us.deidentify("My email is shivam@google.co.in", null)
select bigfunctions.europe_west1.deidentify("My email is shivam@google.co.in", null)
+------------------------------------------+
| masked_info |
+------------------------------------------+
| My email is [PERSON_NAME][EMAIL_ADDRESS] |
+------------------------------------------+
Use cases¶
A customer support system stores chat transcripts including customer names, email addresses, phone numbers, and potentially credit card numbers if they make a purchase through the chat. Regulations like GDPR require protecting this sensitive information. The deidentify
function can be used within BigQuery to anonymize this data for analysis or other purposes where the raw PII isn't required.
Scenario: A data analyst needs to analyze chat transcripts to understand common customer issues. They don't need the actual PII, just the context of the conversations.
Implementation:
-
Data Storage: Chat transcripts are stored in a BigQuery table with columns like
chat_id
,customer_id
,transcript
. Thetranscript
column contains the raw conversation text. -
De-identification Query: The analyst can use the
deidentify
function in a query to create an anonymized view of the data:
SELECT
chat_id,
customer_id,
bigfunctions.us.deidentify(transcript, 'PERSON_NAME,EMAIL_ADDRESS,PHONE_NUMBER,CREDIT_CARD_NUMBER') AS anonymized_transcript
FROM
`project.dataset.chat_transcripts`;
This query replaces identifiable information within the transcript
column with generic markers like [PERSON_NAME]
, [EMAIL_ADDRESS]
, etc.
- Analysis: The analyst can then perform their analysis on the anonymized view, preserving customer privacy while still gaining insights from the conversation data. For example, they could use natural language processing to identify common themes or topics of discussion.
Benefits:
- Compliance: Meets data privacy regulations by masking sensitive information.
- Simplified Analysis: Enables analysis without risking exposure of PII.
- Flexibility: Allows specifying the types of information to mask, providing granular control over the de-identification process.
- Data Utility: Preserves the context and content of the conversations, allowing for meaningful analysis even after removing PII.
Need help or Found a bug?
Get help using deidentify
The community can help! Engage the conversation on Slack
We also provide professional suppport.
Report a bug about deidentify
If the function does not work as expected, please
- report a bug so that it can be improved.
- or open the discussion with the community on Slack.
We also provide professional suppport.