Skip to content

connected_components

connected_components(fully_qualified_table)

Description

Compute the connected components of a non-directed graph.

Given a table with two columns of the same type STRING or INTEGER representing the edges of a graph, this computes a new temporary table called bigfunction_result containing two columns of the same type named node_id and connected_component_id.

This is an implementation of the Alternating Algorithm (large-star, small-star) described in the 2014 paper "Connected Components in MapReduce and Beyond" written by {rkiveris, silviol, mirrokni, rvibhor, sergeiv} @google.com

PERFORMANCE AND COST CONSIDERATIONS

  • This algorithm has been proved to converge in O(log(n)²) and is conjectured to converge in O(log(n)), where n is the number of nodes in the graph. It was the most performant known distributed connected component algorithm last time I checked (in 2017).
  • This implementation persists temporary results at each iteration loop: for the BigQuery pricing, you should be expecting it to cost the equivalent of 15 to 30 scans on your input table. Since the input table has only two columns, this should be reasonable, and we recommend using INTEGER columns rather than STRING when possible.
  • If your graph contains nodes with a very high number of neighbors, the algorithm may crash. It is recommended to apply a pre-filtering on your nodes and remove nodes with a pathologically high cardinality. You should also monitor actively the number of nodes filtered this way and their cardinality, as this could help you detect a data quality deterioration in your input graph. If the input graph contains duplicate edges, they will be automatically removed by the algorithm.

ISOLATED NODES: If you want to have isolated nodes (nodes that have no neighbors) in the resulting graph, there is two possible ways to achieve this:

  • Add self-loops edges to all your nodes in your input graph (it also works if you add edges between all the graph nodes and a fictitious node with id NULL)
  • Only add edges between distinct nodes to your input, and perform a join between your input graph and the algorithm's output to find all the nodes that have disappeared. These will be the isolated nodes. This second method requires a little more work but it should also be cheaper.

Usage

Call or Deploy connected_components ?
Call connected_components directly

The easiest way to use bigfunctions

  • connected_components function is deployed in 39 public datasets for all of the 39 BigQuery regions.
  • It can be called by anyone. Just copy / paste examples below in your BigQuery console. It just works!
  • (You need to use the dataset in the same region as your datasets otherwise you may have a function not found error)

Public BigFunctions Datasets

Region Dataset
eu bigfunctions.eu
us bigfunctions.us
europe-west1 bigfunctions.europe_west1
asia-east1 bigfunctions.asia_east1
... ...
Deploy connected_components in your project

Why deploy?

  • You may prefer to deploy connected_components in your own project to build and manage your own catalog of functions.
  • This is particularly useful if you want to create private functions (for example calling your internal APIs).
  • Get started by reading the framework page

Deployment

connected_components function can be deployed with:

pip install bigfunctions
bigfun get connected_components
bigfun deploy connected_components

Examples

Identify the two connected components of a graph which has 6 nodes and is represented by the edges below:

+---------+-----+
| node1 | node2 |
+-------+-------+
|   1   |   2   |
|   2   |   3   |
|   3   |   4   |
|   5   |   6   |
+-------+-------+

call bigfunctions.eu.connected_components("bigfunctions.eu.sample_graph");
select * from bigfunction_result;
call bigfunctions.us.connected_components("bigfunctions.us.sample_graph");
select * from bigfunction_result;
call bigfunctions.europe_west1.connected_components("bigfunctions.europe_west1.sample_graph");
select * from bigfunction_result;

+---------+------------------------+
| node_id | connected_component_id |
+---------+------------------------+
|    1    |           1            |
|    2    |           1            |
|    3    |           1            |
|    4    |           1            |
|    5    |           5            |
|    6    |           5            |
+---------+------------------------+


Use cases

Use Case: Social Network Analysis

Imagine you have a social network platform, and you store user connections (friendships) in a BigQuery table. Each row in the table represents a connection between two users.

Table Schema:

Table: social_connections
Columns:
  user1: STRING
  user2: STRING

Data Example:

user1 | user2
-------+-------
alice  | bob
bob    | charlie
david  | eve
eve    | frank
george |

Goal: You want to identify distinct groups of connected users (i.e., friend circles or communities). Users within a group are directly or indirectly connected to each other, while users in different groups have no connection path between them. George is an isolated user with no connections.

Solution using connected_components:

-- Call the connected_components function using your dataset's region (e.g., us)
CALL bigfunctions.us.connected_components('your_project.your_dataset.social_connections');

-- Query the resulting table to see the connected components
SELECT * FROM bigfunction_result;

Expected Output:

node_id | connected_component_id
---------+------------------------
alice    | alice
bob      | alice
charlie  | alice
david    | david
eve      | david
frank    | david
george   | george

Interpretation:

  • Users alice, bob, and charlie belong to the same connected component (friend circle).
  • Users david, eve, and frank form another connected component.
  • George is in his own isolated component because he has no connections in the original social_connections table. Note how self-loops are created, which is necessary for the function to work correctly. This can be changed to NULL after the call, if desired.

Further Analysis:

You can now use the connected_component_id to perform further analysis on these groups, such as:

  • Calculate the size of each community.
  • Analyze the characteristics of users within each community (e.g., demographics, interests).
  • Identify influential users within each community.
  • Target marketing campaigns to specific communities.

Other Use Cases:

  • Network Infrastructure Analysis: Identify connected devices in a network.
  • Recommendation Systems: Recommend products or content based on connected users' preferences.
  • Fraud Detection: Detect groups of users engaging in suspicious activities.
  • Biological Networks: Analyze protein-protein interaction networks or gene regulatory networks. This could reveal functional modules or pathways.

This example clearly demonstrates how the connected_components BigQuery function can be applied to a real-world scenario to gain valuable insights from connected data. Remember to adjust the dataset region in the function call according to where your data is stored.


Need help or Found a bug?
Get help using connected_components

The community can help! Engage the conversation on Slack

We also provide professional suppport.

Report a bug about connected_components

If the function does not work as expected, please

  • report a bug so that it can be improved.
  • or open the discussion with the community on Slack.

We also provide professional suppport.


Show your ❤ by adding a ⭐ on