I am running DBT core on a PostgreSQL database.
I have a few incremental models which always select more or less the same rows (events) from a source_table
:
model_1.sql
: Get the type of event based on a table with many-to-1 relationship
SELECT source_table.event_id
, source_table.event_time
, location_type
, source_table.ingestion_time as ingestion_time
FROM source_table
INNER JOIN locations
ON locations.id = source_table.location_id
{% if is_incremental() %}
WHERE source_table.ingestion_time > (SELECT MAX(ingestion_time) from {{ this }})
{% endif %}
model_2.sql
: Get all transactions from a 1-to-many relationship to get the sum
SELECT source_table.event_id
, SUM(transactions.amount) as total_revenue
, any_value(source_table.ingestion_time) as ingestion_time
FROM source_table
INNER JOIN transactions
ON transactions.event_id = source_table.event_id
{% if is_incremental() %}
WHERE source_table.ingestion_time > (SELECT MAX(ingestion_time) from {{ this }})
{% endif %}
As you can see, the is_incremental
conditions on both tables are nearly identical, only the reference table changes.
Since both models are being run at the same time, they are selecting almost the same rows from source_table
, thus running the same scan twice, which is quite inefficient (for the purpose of this question, assume I cannot change the source_table
as to make scanning more efficient)
Hence my idea of building a new_events
table (with a table
materialization, i.e it is rebuilt for every dbt run) from the source_table
, and then build the subsequent views over it.
This would go like this:
new_events.sql
:
SELECT * FROM source_table
WHERE source_table.ingestion_table > LEAST(
(SELECT MAX(ingestion_time) FROM {{ ref("model_1") }})
, (SELECT MAX(ingestion_time) FROM {{ ref("model_2" }})
)
model_1.sql
:
SELECT new_events.event_id
, new_events.event_time
, locations.location_type
, new_events.ingestion_time as ingestion_time
{% if is_incremental() %}
FROM {{ ref("new_events") }} new_events
{% else %} -- select from full table if no incrementation
FROM source_table new_events
{% endif %}
INNER JOIN locations
ON locations.id = new_events.location_id
{% if is_incremental() %}
WHERE new_events.ingestion_time > (SELECT MAX(ingestion_time) from {{ this }})
{% endif %}
Obviously, DBT does not like this as it introduces circular dependencies between the tables, and I cannot run the models presently.
Is there a way to circumvent this and build a new_events
table ? Or another way to reach my goal (i.e scan the new_events
only once ?)