Skip to content

Get Started!

logo

Build a custom data-catalog in minutes



πŸ”οΈ 1. What is CatalogBuilder?ΒΆ

  • CatalogBuilder is a simple tool to generate & deploy a documentation website for your data assets.
  • It enables anyone at your company to quickly find the trusted data they are looking for.


πŸ’‘ 2. Why CatalogBuilder?ΒΆ

There are many open-source projects (admundsen, open-metadata, datahub, metacat, atlas) to build such a catalog in-house. But as they offer a lot of advanced features, they are hard to manage and deploy if you're not a tech expert. They can be even harder to customize.

dbt docs is great to generate a documentation website on top of your dbt assets but:

  • it focuses on dbt only (while you are interested in other sources + metadata)
  • is very hard to customize (except you're an angular expert)
  • can be slow.


πŸ‘‰ CatalogBuilder aims at offering a lightweight alternative to generate a documentation website on top of your data assets. It focuses on read-only data discovery and:

  1. βœ”οΈ can be easily customized and deployed by low tech people
  2. βœ”οΈ can then handle the very specific needs of your company
  3. βœ”οΈ is fast and lightweight
  4. βœ”οΈ is built on top of the very famous mkdocs-material python library which is used by millions of developers to deploy their documentation (such as fastapi).


πŸ’₯ 3. Getting Started with catalog CLIΒΆ

catalog is the CLI (command-line-interface) of CatalogBuilder to generate, show & deploy the documentation.

3.1 Install catalog CLI πŸ› οΈΒΆ

pip install catalog-builder

3.2 Create your first documentation configuration πŸ‘¨β€πŸ’»ΒΆ

catalog download bigquery_public_data

To get started, let's download a catalog configuration example from the GitHub repo and play with it. The above command will download the catalogs/bigquery_public_data folder on your laptop.

You will find in the folder:

  • assets file: a file containing the list of the assets you want to put in your documentation. It can be a parquet file named assets.parquet or a json lines file named assets.jsonl. Each asset in the file must have the following fields:
  • asset_type: for example: table.
  • documentation_path: the path of the asset page in the generated documentation. For example dataset_name/table_name.
  • data: a dict of attributes used to generate the documentation. For example {"name": "foo"}
  • generate_assets_file.py: the python script used to (re)generate the assets file.
  • requirements.txt: the python requirements needed by generate_assets_file.py.
  • templates: a folder which includes a jinja-template markdown-file for each asset_type. These templates are used to generate a markdown documentation file for each asset.
  • mkdocs.yml: the mkdocs configuration file used by mkdocs to build the documentation website from the generated markdown files.

3.3 Build your catalog website πŸ‘ΎΒΆ

catalog build bigquery_public_data
  1. For each asset of the assets file, the jinja template of asset_type will be rendered using the asset data to generate a markdown file which will be written into catalogs/bigquery_public_data/docs/ at documentation_path.
  2. Mkdocs will then build the documentation website from the markdown files into catalogs/bigquery_public_data/site (using mkdocs.yml configuration file).

3.4 Run your catalog website locally ⚑¢

catalog serve bigquery_public_data

You can now see the generated documentation website at http://localhost:8000.

3.5 Deploy the documentation website! πŸš€ΒΆ

A. To deploy on GitHub pages:

catalog gh-deploy bigquery_public_data

Mkdocs will deploy the site on GitHub pages (this only works if you are on a github repository).

B. To deploy elsewhere:

You can follow these instructions from mkdocs.


πŸ’Ž 4. Generate your dbt documentationΒΆ

To generate a documentation website for your own dbt project, do the following:

  1. Change directory to your dbt project directory
  2. Download catalogs/dbt documentation example by running catalog download dbt.
  3. Run dbt docs generate to compute target/manifest.json and target/catalog.json.
  4. Generate the assets file by running python catalogs/dbt/generate_assets_file.py. The script will parse target/manifest.json and target/catalog.json to generate the assets file in the expected format.
  5. Run catalog serve dbt to build the website and show it locally.


Keep in touch πŸ§‘β€πŸ’»ΒΆ

Join our Slack for any question, to get help for getting started, to speak about a bug, to suggest improvements, or simply if you want to have a chat πŸ™‚.


πŸ‘‹ ContributeΒΆ

Any contribution is more than welcome πŸ€—!

  • Add a ⭐ on the repo to show your support
  • Join our Slack and talk with us
  • Raise an issue to raise a bug or suggest improvements
  • Open a PR!