最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Why does duplicating a Django filter condition change PostgreSQL’s query plan estimates? - Stack Overflow

programmeradmin2浏览0评论

I'm working on a Django project with PostgreSQL and have implemented a custom filter to select "active" records. The filter is defined as follows:

def get_custom_filter(qs):
    return qs.filter(
        Q(
            Q(timestamp_field__lte=timezone.now(), related_obj__delay__isnull=True)
            | Q(
                related_obj__delay__lte=timezone.now(),
                type_obj__flag=False,
                timestamp_field__isnull=False,
            )
            | Q(type_obj__flag=True, timestamp_field__lte=timezone.now())
        )
    )

I then create a custom filter backend:

class CustomFilterBackend(filters.BaseFilterBackend):
    def filter_queryset(self, request, qs, view):
        return get_custom_filter(qs)

In my view, I include this filter in the filter_backends list. Here’s what happens:

When I include CustomFilterBackend only once, the PostgreSQL query plan shows that after filtering the estimated number of rows for sorting/uniquing is around 7,200 rows. For example, the WHERE clause appears as:

WHERE ("tenant_table"."tenant_id" IN (1)
       AND (("main_table"."delay_field" IS NULL
             AND "primary_table"."timestamp_field" <= TIMESTAMP)
            OR (NOT "type_table"."flag"
                AND "main_table"."delay_field" <= TIMESTAMP
                AND "primary_table"."timestamp_field" IS NOT NULL)
            OR ("type_table"."flag"
                AND "primary_table"."timestamp_field" <= TIMESTAMP))
       AND NOT ( ... ))

When I duplicate the filter (include it twice), the plan shows an estimated row count of about 2,400—roughly one third of the original estimate. The WHERE clause now contains two copies of the active filter condition:

WHERE ("tenant_table"."tenant_id" IN (1)
       AND (("main_table"."delay_field" IS NULL
             AND "primary_table"."timestamp_field" <= TIMESTAMP)
            OR (NOT "type_table"."flag"
                AND "main_table"."delay_field" <= TIMESTAMP
                AND "primary_table"."timestamp_field" IS NOT NULL)
            OR ("type_table"."flag"
                AND "primary_table"."timestamp_field" <= TIMESTAMP))
       AND (("main_table"."delay_field" IS NULL
             AND "primary_table"."timestamp_field" <= TIMESTAMP)
            OR (NOT "type_table"."flag"
                AND "main_table"."delay_field" <= TIMESTAMP
                AND "primary_table"."timestamp_field" IS NOT NULL)
            OR ("type_table"."flag"
                AND "primary_table"."timestamp_field" <= TIMESTAMP))
       AND NOT ( ... ))

Logically, repeating the same condition (i.e. X AND X) should not change the result set. However, PostgreSQL’s planner multiplies the selectivity when it sees the condition twice—if one application of the filter is estimated to pass 50% of rows, duplicating it would estimate 0.5 × 0.5 = 25% of rows. This drastically reduces the number of rows for later operations (such as sorting and uniquing) and leads to a lower overall cost estimate.

My questions are:

  1. Is it correct that PostgreSQL multiplies selectivity when the same condition appears twice in the WHERE clause?
  2. How does PostgreSQL calculate selectivity in these cases?
  3. Is there a recommended way to force the optimizer to obtain the same lower row estimate without duplicating the filter condition in the code?

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论