I’m designing a RAW ==> BRONZE ingestion pattern in BigQuery:
- RAW layer: source CSV/Parquet files in GCS
- BRONZE layer: native BigQuery table for improved performance vs an external table (columnar storage, partitioning, clustering)
I want each record in my BRONZE table to include metadata about its origin, for example:
- Source file URI or filename
- Ingestion timestamp
I'm considering:
- bq load the new files (either once a day or with a cloud function when a new file arrives)
- BQ Transfer Service with Incremental mode
My questions:
I'm wondering if it's possible to load data into BQ using those approaches, while adding the metadata columns I want?
What’s the simplest, most performant pattern to load GCS files into a native BigQuery table while adding metadata columns in one workflow?
If using bq load, should I load into a staging table then INSERT...SELECT into the final table, using literal columns to add the metadata?
Is there any BigQuery-native feature (e.g., external table pseudo‑columns, ingestion-time partition pseudo‑columns) that can eliminate extra steps?
I’m designing a RAW ==> BRONZE ingestion pattern in BigQuery:
- RAW layer: source CSV/Parquet files in GCS
- BRONZE layer: native BigQuery table for improved performance vs an external table (columnar storage, partitioning, clustering)
I want each record in my BRONZE table to include metadata about its origin, for example:
- Source file URI or filename
- Ingestion timestamp
I'm considering:
- bq load the new files (either once a day or with a cloud function when a new file arrives)
- BQ Transfer Service with Incremental mode
My questions:
I'm wondering if it's possible to load data into BQ using those approaches, while adding the metadata columns I want?
What’s the simplest, most performant pattern to load GCS files into a native BigQuery table while adding metadata columns in one workflow?
If using bq load, should I load into a staging table then INSERT...SELECT into the final table, using literal columns to add the metadata?
Is there any BigQuery-native feature (e.g., external table pseudo‑columns, ingestion-time partition pseudo‑columns) that can eliminate extra steps?
1 Answer
Reset to default 0My idea to this is:
bq load is the simplest and fast load, and can add metadata after loading the data after the staging table. BigQuery Transfer service is for batch processing and also can be used for metadata columns.
Load files from GCS into BigQuery using
bq load
.Manually add
ingestion\_time
in a post-load SQL step.Use
--projection\_fields
to capture filenames if available.bq load is like copying data from a folder (GCS) into a table (BigQuery).
You may need to add the filename and timestamp later using a simple query.
I agree with using bq load
in the staging table then Insert … select
because BigQuery does not automatically capture the filename, staging table is like a sorting area where you can manually attach labels like file name or timestamp to it.
BigQuery has a built-in _PARTITIONTIME column that can track when data is loaded, this can eliminate extra steps. As BigQuery doesn't automatically store the filename in native table, you still need to manually add the filename in a query after loading.