Skip to content

bigfunctions > connected_components

connected_components

Call or Deploy connected_components ?

✅ You can call this connected_components bigfunction directly from your Google Cloud Project (no install required).

  • This connected_components function is deployed in bigfunctions GCP project in 39 datasets for all of the 39 BigQuery regions. You need to use the dataset in the same region as your datasets (otherwise you may have a function not found error).
  • Function is public, so it can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
  • You may prefer to deploy the BigFunction in your own project if you want to build and manage your own catalog of functions. This is particularly useful if you want to create private functions (for example calling your internal APIs). Discover the framework

Public BigFunctions Datasets:

Region Dataset
eu bigfunctions.eu
us bigfunctions.us
europe-west1 bigfunctions.europe_west1
asia-east1 bigfunctions.asia_east1
... ...

Description

Signature

connected_components(fully_qualified_table)

Description

Compute the connected components of a non-directed graph.

Given a table with two columns of the same type STRING or INTEGER representing the edges of a graph, this computes a new temporary table called bigfunction_result containing two columns of the same type named node_id and connected_component_id.

This is an implementation of the Alternating Algorithm (large-star, small-star) described in the 2014 paper "Connected Components in MapReduce and Beyond" written by {rkiveris, silviol, mirrokni, rvibhor, sergeiv} @google.com

PERFORMANCE AND COST CONSIDERATIONS

  • This algorithm has been proved to converge in O(log(n)²) and is conjectured to converge in O(log(n)), where n is the number of nodes in the graph. It was the most performant known distributed connected component algorithm last time I checked (in 2017).
  • This implementation persists temporary results at each iteration loop: for the BigQuery pricing, you should be expecting it to cost the equivalent of 15 to 30 scans on your input table. Since the input table has only two columns, this should be reasonable, and we recommend using INTEGER columns rather than STRING when possible.
  • If your graph contains nodes with a very high number of neighbors, the algorithm may crash. It is recommended to apply a pre-filtering on your nodes and remove nodes with a pathologically high cardinality. You should also monitor actively the number of nodes filtered this way and their cardinality, as this could help you detect a data quality deterioration in your input graph. If the input graph contains duplicate edges, they will be automatically removed by the algorithm.

ISOLATED NODES: If you want to have isolated nodes (nodes that have no neighbors) in the resulting graph, there is two possible ways to achieve this:

  • Add self-loops edges to all your nodes in your input graph (it also works if you add edges between all the graph nodes and a fictitious node with id NULL)
  • Only add edges between distinct nodes to your input, and perform a join between your input graph and the algorithm's output to find all the nodes that have disappeared. These will be the isolated nodes. This second method requires a little more work but it should also be cheaper.

Examples

Identify the two connected components of a graph which has 6 nodes and is represented by the edges below:

+---------+-----+
| node1 | node2 |
+-------+-------+
|   1   |   2   |
|   2   |   3   |
|   3   |   4   |
|   5   |   6   |
+-------+-------+

call bigfunctions.eu.connected_components('bigfunctions.eu.sample_graph');
select * from bigfunction_result;
call bigfunctions.us.connected_components('bigfunctions.us.sample_graph');
select * from bigfunction_result;
call bigfunctions.europe_west1.connected_components('bigfunctions.europe_west1.sample_graph');
select * from bigfunction_result;

+---------+------------------------+
| node_id | connected_component_id |
+---------+------------------------+
|    1    |           1            |
|    2    |           1            |
|    3    |           1            |
|    4    |           1            |
|    5    |           5            |
|    6    |           5            |
+---------+------------------------+


Need help using connected_components?

The community can help! Engage the conversation on Slack

For professional suppport, don't hesitate to chat with us.

Found a bug using connected_components?

If the function does not work as expected, please

  • report a bug so that it can be improved.
  • or open the discussion with the community on Slack.

For professional suppport, don't hesitate to chat with us.

Use cases

Use Case: Social Network Analysis

Imagine you have a social network platform, and you store user connections (friendships) in a BigQuery table. Each row in the table represents a connection between two users.

Table Schema:

Table: social_connections
Columns:
  user1: STRING
  user2: STRING

Data Example:

user1 | user2
-------+-------
alice  | bob
bob    | charlie
david  | eve
eve    | frank
george |

Goal: You want to identify distinct groups of connected users (i.e., friend circles or communities). Users within a group are directly or indirectly connected to each other, while users in different groups have no connection path between them. George is an isolated user with no connections.

Solution using connected_components:

-- Call the connected_components function using your dataset's region (e.g., us)
CALL bigfunctions.us.connected_components('your_project.your_dataset.social_connections');

-- Query the resulting table to see the connected components
SELECT * FROM bigfunction_result;

Expected Output:

node_id | connected_component_id
---------+------------------------
alice    | alice
bob      | alice
charlie  | alice
david    | david
eve      | david
frank    | david
george   | george

Interpretation:

  • Users alice, bob, and charlie belong to the same connected component (friend circle).
  • Users david, eve, and frank form another connected component.
  • George is in his own isolated component because he has no connections in the original social_connections table. Note how self-loops are created, which is necessary for the function to work correctly. This can be changed to NULL after the call, if desired.

Further Analysis:

You can now use the connected_component_id to perform further analysis on these groups, such as:

  • Calculate the size of each community.
  • Analyze the characteristics of users within each community (e.g., demographics, interests).
  • Identify influential users within each community.
  • Target marketing campaigns to specific communities.

Other Use Cases:

  • Network Infrastructure Analysis: Identify connected devices in a network.
  • Recommendation Systems: Recommend products or content based on connected users' preferences.
  • Fraud Detection: Detect groups of users engaging in suspicious activities.
  • Biological Networks: Analyze protein-protein interaction networks or gene regulatory networks. This could reveal functional modules or pathways.

This example clearly demonstrates how the connected_components BigQuery function can be applied to a real-world scenario to gain valuable insights from connected data. Remember to adjust the dataset region in the function call according to where your data is stored.

Spread the word

BigFunctions is fully open-source. Help make it a success by spreading the word!

Share on Add a on