最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

django - How to update large amount of records in chunks? - Stack Overflow

programmeradmin1浏览0评论

I have models:

class Partner(models.Model):
    sap_code = models.CharField(max_length=64, null=True, blank=True, verbose_name='sap id', default=uuid4)
    # another fields


class DeficienciesAct(models.Model):
    partner = models.ForeignKey('partners.Partner', null=True, on_delete=models.CASCADE)
    sap_code = models.CharField(max_length=64, null=True)
    # another fields

Data:

  • Partner:
id sap_code
1 123
2 124
... ...

I have models:

class Partner(models.Model):
    sap_code = models.CharField(max_length=64, null=True, blank=True, verbose_name='sap id', default=uuid4)
    # another fields


class DeficienciesAct(models.Model):
    partner = models.ForeignKey('partners.Partner', null=True, on_delete=models.CASCADE)
    sap_code = models.CharField(max_length=64, null=True)
    # another fields

Data:

  • Partner:
id sap_code
1 123
2 124
... ...
  • DeficienciesAct:
id partner_id sap_code
1 null 123
2 null 333
... ... ...
500000 null 421

Let's imagine that in DeficienciesAct there are 500 000 records. Not all DeficienciesAct.sap_code can be in Partner.sap_code.

So I need to update field DeficienciesAct.partner_id by DeficienciesAct.sap_code. DeficienciesAct.sap_code is equal to Partner.sap_code. For that I developed a script:

DeficienciesAct.objects.filter(
        partner__isnull=True, sap_code__in=Subquery(Partner.objects.values_list("sap_code"))
    ).annotate(
        new_partner_id=Subquery(
            Partner.objects.filter(
                sap_code=OuterRef('sap_code')
            ).values('id')[:1]
        )
    ).update(partner_id=F("new_partner_id"))

Well that works. But I'm afraid that huge amount of records can affect a database (postgres).

Is there any way to do the task in chunks?

Please don't offer change models/tables.

Share Improve this question edited Jan 30 at 15:41 Альберт Александров asked Jan 30 at 14:59 Альберт АлександровАльберт Александров 8752 gold badges20 silver badges44 bronze badges 6
  • If you worry about speed of PostgreSQL, use pg-bouncer – Mehdi Commented Jan 30 at 15:06
  • add a db_index=True on sap_code. – willeM_ Van Onsem Commented Jan 30 at 15:07
  • Is sap_code a unique field on the Partner? – willeM_ Van Onsem Commented Jan 30 at 15:11
  • @willeM_VanOnsem, nope – Альберт Александров Commented Jan 30 at 16:02
  • @АльбертАлександров: but then Partner.objects.filter(sap_code=...).values('id') can return multiple Partners, so then you link "just one" of the relevant Partners. That is a bit troublesome, since depending on which one is picked first, we get a different linking. – willeM_ Van Onsem Commented Jan 30 at 16:05
 |  Show 1 more comment

2 Answers 2

Reset to default 1

Add an index on the sap_code field with db_index=True [Django-doc], like:

class Partner(models.Model):
    sap_code = models.CharField(
        max_length=64,
        null=True,
        blank=True,
        verbose_name='sap id',
        default=uuid4,
        db_index=True,
    )
    # another fields


class DeficienciesAct(models.Model):
    partner = models.ForeignKey(
        'partners.Partner', null=True, on_delete=models.CASCADE
    )
    sap_code = models.CharField(max_length=64, null=True)

this will boost lookups for a given sap_code.

But if sap_code is a unique field for a ForeignKey, we can just use that as value for the ForeignKey, with:

class Partner(models.Model):
    sap_code = models.CharField(
        max_length=64,
        verbose_name='sap id',
        default=uuid4,
        db_index=True,
        unique=True,
    )


class DeficienciesAct(models.Model):
    partner = models.ForeignKey(
        'partners.Partner', null=True, on_delete=models.CASCADE
    )
    sap_code = models.CharField(max_length=64, null=True)
    new_partner = models.ForeignKey(
        'partners.Partner', null=True, to_field='sap_code'
    )

We can then work with:

DeficienciesAct.objects.update(new_partner_id=F('sap_code'))

and eventually thus get rid of the "old" partner field, and rename new_parter to Partner and get rid of the sap_code field in DeficienciesAct. This because Django then uses the sap_code field to refer to a Partner.

That worked:

batch_size = 1000
deficiency_acts_for_update = DeficienciesAct.objects.filter(
    partner__isnull=True, sap_code__in=Subquery(Partner.objects.values_list("sap_code"))
).order_by('id')
while True:
    if not deficiency_acts_for_update.count():
        break
    deficiency_acts_for_update.filter(
        id__in=Subquery(deficiency_acts_for_update.values_list("id")[:batch_size])
    ).annotate(
        _partner_id=Subquery(
            Partner.objects.filter(
                sap_code=OuterRef('sap_code')
            ).values('id')[:1]
        )
    ).update(partner_id=F("_partner_id"))
发布评论

评论列表(0)

  1. 暂无评论