最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

sql - Introduce circular dependencies in DBT to select rows to update - Stack Overflow

programmeradmin0浏览0评论

I am running DBT core on a PostgreSQL database.

I have a few incremental models which always select more or less the same rows (events) from a source_table:

model_1.sql: Get the type of event based on a table with many-to-1 relationship

SELECT source_table.event_id
     , source_table.event_time
     , location_type
     , source_table.ingestion_time as ingestion_time
FROM source_table
INNER JOIN locations
    ON locations.id = source_table.location_id
{% if is_incremental() %}
WHERE source_table.ingestion_time > (SELECT MAX(ingestion_time) from {{ this }})
{% endif %}

model_2.sql: Get all transactions from a 1-to-many relationship to get the sum

SELECT source_table.event_id
     , SUM(transactions.amount) as total_revenue
     , any_value(source_table.ingestion_time) as ingestion_time
FROM source_table
INNER JOIN transactions
   ON transactions.event_id = source_table.event_id
{% if is_incremental() %}
WHERE source_table.ingestion_time > (SELECT MAX(ingestion_time) from {{ this }})
{% endif %}

As you can see, the is_incremental conditions on both tables are nearly identical, only the reference table changes.

Since both models are being run at the same time, they are selecting almost the same rows from source_table, thus running the same scan twice, which is quite inefficient (for the purpose of this question, assume I cannot change the source_table as to make scanning more efficient)

Hence my idea of building a new_events table (with a table materialization, i.e it is rebuilt for every dbt run) from the source_table, and then build the subsequent views over it.

This would go like this:

new_events.sql:

SELECT * FROM source_table
    WHERE source_table.ingestion_table > LEAST(
        (SELECT MAX(ingestion_time) FROM {{ ref("model_1") }})
        , (SELECT MAX(ingestion_time) FROM {{ ref("model_2" }})
)

model_1.sql:

SELECT new_events.event_id
     , new_events.event_time
     , locations.location_type
     , new_events.ingestion_time as ingestion_time
{% if is_incremental() %}
FROM {{ ref("new_events") }} new_events
{% else %} -- select from full table if no incrementation
FROM source_table new_events
{% endif %}
INNER JOIN locations
    ON locations.id = new_events.location_id
{% if is_incremental() %}
WHERE new_events.ingestion_time > (SELECT MAX(ingestion_time) from {{ this }})
{% endif %}

Obviously, DBT does not like this as it introduces circular dependencies between the tables, and I cannot run the models presently.

Is there a way to circumvent this and build a new_events table ? Or another way to reach my goal (i.e scan the new_events only once ?)

发布评论

评论列表(0)

  1. 暂无评论