Best practice to load GCS files into native BigQuery tables with metadata columns (filename, ingestion_time)?

I’m designing a RAW ==> BRONZE ingestion pattern in BigQuery:

RAW layer: source CSV/Parquet files in GCS
BRONZE layer: native BigQuery table for improved performance vs an external table (columnar storage, partitioning, clustering)

I want each record in my BRONZE table to include metadata about its origin, for example:

Source file URI or filename
Ingestion timestamp

I'm considering:

bq load the new files (either once a day or with a cloud function when a new file arrives)
BQ Transfer Service with Incremental mode

My questions:

I'm wondering if it's possible to load data into BQ using those approaches, while adding the metadata columns I want?
What’s the simplest, most performant pattern to load GCS files into a native BigQuery table while adding metadata columns in one workflow?
If using bq load, should I load into a staging table then INSERT...SELECT into the final table, using literal columns to add the metadata?
Is there any BigQuery-native feature (e.g., external table pseudo‑columns, ingestion-time partition pseudo‑columns) that can eliminate extra steps?

I’m designing a RAW ==> BRONZE ingestion pattern in BigQuery:

RAW layer: source CSV/Parquet files in GCS
BRONZE layer: native BigQuery table for improved performance vs an external table (columnar storage, partitioning, clustering)

I want each record in my BRONZE table to include metadata about its origin, for example:

Source file URI or filename
Ingestion timestamp

I'm considering:

bq load the new files (either once a day or with a cloud function when a new file arrives)
BQ Transfer Service with Incremental mode

My questions:

I'm wondering if it's possible to load data into BQ using those approaches, while adding the metadata columns I want?
What’s the simplest, most performant pattern to load GCS files into a native BigQuery table while adding metadata columns in one workflow?
If using bq load, should I load into a staging table then INSERT...SELECT into the final table, using literal columns to add the metadata?
Is there any BigQuery-native feature (e.g., external table pseudo‑columns, ingestion-time partition pseudo‑columns) that can eliminate extra steps?

Share Improve this question asked Mar 27 at 16:26 Etienne Neveu 12.7k9 gold badges37 silver badges59 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

My idea to this is:

bq load is the simplest and fast load, and can add metadata after loading the data after the staging table. BigQuery Transfer service is for batch processing and also can be used for metadata columns.

Load files from GCS into BigQuery using bq load.
Manually add ingestion\_time in a post-load SQL step.
Use --projection\_fields to capture filenames if available.
bq load is like copying data from a folder (GCS) into a table (BigQuery).
You may need to add the filename and timestamp later using a simple query.

I agree with using bq load in the staging table then Insert … select because BigQuery does not automatically capture the filename, staging table is like a sorting area where you can manually attach labels like file name or timestamp to it.

BigQuery has a built-in _PARTITIONTIME column that can track when data is loaded, this can eliminate extra steps. As BigQuery doesn't automatically store the filename in native table, you still need to manually add the filename in a query after loading.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Best practice to load GCS files into native BigQuery tables with metadata columns (filename, ingestion_time)? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)