I need to update a column in a PostGIS database.

I want to do that with this:

UPDATE 
  cj_geometry 
SET 
  groundgeometry = CASE WHEN NOT ST_IsValid(groundgeometry) THEN ST_Multi(
    ST_CollectionExtract(
      ST_MakeValid(groundgeometry), 
      3
    )
  ) ELSE ST_MakeValid(groundgeometry) END

It's around 65.000.000 lines.

I let it run for an hour or so, then realized that there is no space left on device, so I quit the query and restarted the database. It did restart, but it's now super big, without new data. So two questions:

How do I get it back to its original size?
Why does this use so much disk space? How can I avoid this?

I need to update a column in a PostGIS database.

I want to do that with this:

UPDATE 
  cj_geometry 
SET 
  groundgeometry = CASE WHEN NOT ST_IsValid(groundgeometry) THEN ST_Multi(
    ST_CollectionExtract(
      ST_MakeValid(groundgeometry), 
      3
    )
  ) ELSE ST_MakeValid(groundgeometry) END

It's around 65.000.000 lines.

How do I get it back to its original size?
Why does this use so much disk space? How can I avoid this?

Share Improve this question edited Jan 20 at 11:53 cafce25 27.3k5 gold badges42 silver badges54 bronze badges asked Jan 20 at 11:33 pcace 6407 silver badges21 bronze badges

2 One thing that wasn't clearly pointed out is that Postgres doesn't do in-place updates, if that's what you expected. In its MVCC model, an update is really a combination of insert and delete - the old version of the row stays in place and gets marked for deletion, the new one is written separately, elsewere. The old one is kept for as long as there's at least one transaction that started working on that table before the deletion of that row. After its commit/rollback, when vacuum comes, it'll reclaim the old row's space, enabling new inserts and updates on the table to re-use it. – Zegarek Commented Jan 20 at 12:53

Add a comment |

3 Answers 3

Sorted by: Reset to default 1

Postgres does not release storage back to the OS by default. Databases tend only to grow, so there's not much call to release storage back. Postgres keeps this storage around so that future inserts and updates and simply reuse space that points to dead tuples. This is faster than asking the OS to extend the size of the file.

Reclaiming Disk Space

To reclaim assigned storage you can do a full vacuum. This essentially transfers the table to a new file, compacting the backing-file to its smallest possible size. This requires that you have enough space to store the old backing-file and the new-backing file simultaneously. So if you are very low on disk space, then this may not be an option.

VACUUM (FULL) mytable;

If you cannot do a full vacuum because of limited disk space, then your only options are to dump and reload the table, or to transfer the table to and from a tablespace on a different disk (mentioned in more detail below). Other the two, dump and reload is perhaps the simpler option. You need to dump/export the table to a different host or partition, truncate/drop the table, and then load/import the table again. This may require you to (temporarily) drop any foreign key constraints that point to your table. As such, to keep database integrity, you must stop and prevent any other users from accessing the database whilst you are performing this process. Once you have restored all the dropped constraints you can re-enable access to the database.

Tablespaces

If you do not have enough space on the current disk that hosts your postgres data directory, but you do have a separate disk on your host that does have enough space, then you can use it to trim the table without the data ever leaving the postgres server. This approach has the benefit of postgres being able to maintain its ACID guarantees without any manual assistance from you. To do this you need to:

Mount the disk
Initialise a data directory on the mount point (owned by the postgres user)
Create a tablespace that points to this directory
ALTER TABLE to use this new tablespace. This will transfer the table to your other disk, compacting it as it goes.
Transfer the table back to the original tablespace
Drop the tablespace, and unmount the disk.

Creating the Tablespace

Mount the disk and create a directory on the mount point that is owned by postgres. Then make the directory available as a tablespace in postgres:

CREATE TABLESPACE recoveryspace LOCATION '/path/to/mount/postgresql/data';

Compacting the Table

This is a big operation, and access to the table will be blocked whilst this operation is ongoing.

-- send table to disk with free space.
-- this compacts the table, and releases the space on your primary disk
ALTER TABLE mytable SET TABLESPACE recoveryspace;
-- send the table back to your primary tablespace/disk
ALTER TABLE mytable SET TABLESPACE pg_default;

Delete the Tablespace

Once you have transferred the table back to its original tablespace, you can drop the tablespace and then unmount the disk.

DROP TABLESPACE recoveryspace;

How to Avoid Using Too Much Disk Space

To avoid this problem in future you can do incremental updates. This reduces the total number of reachable tuples at any given time, and so reduces the maximum size of the backing-file. That is, when you update every row in a table, postgres must be able to access every before and after row at the point you commit the transaction. This means that the backing file will have to be twice its normal size to accommodate all these rows.

If you only update 1 million rows or so at time, then the total size of the backing file only needs to be large enough to accomodate 66 million tuples (65m + 1m). When you update the next batch, there will be 1 million tuples that are no longer reachable, and postgres can safely reuse this space. You will likely have to tell postgres that it should recycle these unreachable tuples though. For this technique to work, there must be no long running transactions that mean that the old rows are still reachable.

BEGIN;
UPDATE mytable SET mycol = newval WHERE <criteria>;
COMMIT;
VACUUM;  -- recycles unreachable tuples, but retains file storage

<criteria> should be able to exclude previously updated rows. A simple criteria is to use a unique column or primary key on the table. eg.

UPDATE mytable
    SET mycol = f(x)
    WHERE id in (
        SELECT id FROM mytable WHERE 0 <= id 1000000
    );
-- and then on the next update
WHERE 1000000 <= id < 2000000

You can run this update in a loop (commiting and vacuuming between each update), changing the ids to update each time and until all rows have been successfully updated.

This requires you to keep track of which ids have been updated. If you can inspect a row and tell if it needs updating then you do not need to keep track of the id column between queries. Instead use that in the where clause when selecting which ids to update.

UPDATE mytable
    SET mycol = f(x)
    WHERE id in (
        SELECT id FROM mytable
        -- exclude rows that have been successfully updated
        WHERE is_invalid(mycol)
        LIMIT 1000000
    )

If there is no unique identifier on the table then you can use the system column ctid to help you limit which rows are selected for update. This will require you to be able to tell if a row needs updating or not. If you cannot then simply add a temporary boolean column (with no default) and set to true when updated. At the end of the updates you can then drop the column. This will lead to some bloat in table, but this will disappear over time as the rows get updated.

UPDATE 
    mytable
SET 
    mycol = f(x)
WHERE
    ctid IN (
        SELECT ctid FROM mytable
        WHERE is_invalid(mycol)
        LIMIT 1000000
    );

Using a temporary column

ALTER TABLE mytable ADD COLUMN updated boolean;

BEGIN;
UPDATE 
    mytable
SET 
    mycol = f(x), updated = true
WHERE
    ctid IN (
        SELECT ctid FROM mytable
        WHERE updated IS NULL
        LIMIT 1000000
    );
COMMIT;
VACUUM;
-- repeat update, commit and vacuum until done

-- clean up table
ALTER TABLE mytable DROP COLUMN updated;

To shrink the table, end all long-running transactions, close all prepared transactions, drop all stale replication slots and run

VACUUM (FULL) table_name;

The cause for the problem is that you are updating all rows of the table in a single statement and consequently in a single transaction. Than can easily double the size of the table.

How do I get it back to its original size?

VACUUM FULL cj_geometry;, as already pointed out by @Laurenz Albe. Since you're dealing with geometries, you can actually go below its original size, if you alter table..alter column to ditch the default pglz used for transparent compression and go with a slightly more efficient lz4:

alter table cj_geometry 
  alter column groundgeometry 
  set compression lz4;

If you maintain your database as a read-only data source, you can dump and re-build, as long as you can afford the downtime. Both vacuum full and the db re-upload cause all indexes to be rebuilt.

Why does this use so much disk space?

Postgres doesn't do in-place updates, if that's what you expected. In its MVCC model, an update is really a combination of insert and delete - the old version of the row stays in place and gets marked for deletion, the new one is written separately, elsewere. The old one is kept for as long as there's at least one transaction that started working on that table before the deletion of that row. After its commit/rollback, when vacuum comes, it'll reclaim the old row's space, enabling new inserts and updates on the table to re-use it.

That's why you can regularly see vacuum running on its own, as a part of autovacuum enabled by default.

How can I avoid this?

Your update re-writes the entire table so it'll also re-write all its indexes. If you drop them, the whole thing will temporarily shrink in size, and the update will actually run faster, free from having to also reflect each change in each index. You can save their definitions first:

create temp table index_definitions as select indexdef 
  from pg_indexes where schemaname='public' and tablename='cj_geometry';

And after you're done you can rebuild them - that's what would happen on its own anyways:

do $p$ declare indexdef_ text; 
begin for indexdef_ in select indexdef from index_definitions loop
          execute indexdef_;
      end loop;
      drop table index_definitions;
end $p$;

As hinted by @Dunes, you can also try to split your update in batches, vacuuming in between:

UPDATE cj_geometry 
SET groundgeometry = CASE WHEN NOT ST_IsValid(groundgeometry) 
                          THEN ST_Multi(
                                ST_CollectionExtract(
                                 ST_MakeValid(groundgeometry), 
                                 3
                                )
                               ) 
                          ELSE ST_MakeValid(groundgeometry) 
                      END
WHERE ctid IN (SELECT id FROM cj_geometry
               ORDER BY id
               LIMIT 1e7
               OFFSET 1e7);

VACUUM cj_geometry;

Changing your limit and offset accordingly - this example shows the 2nd batch of 10 million rows. If you lack a regular primary key column, there's also a ctid hidden system column you can use to identify a row throughout the statement. If you can afford a maintenance window and your table is modifiable only by you, you can try that instead.

Note that VACUUM has to run in its own transaction with nothing else in it, so the above example assumes you're running these in autocommit mode.

Also, that will only prevent this one table's footprint from increasing by forcing the table to recycle the space on its pages more often - plain vacuum reclaims space internally, only for that table to re-use. Unlike lifting indexes and changing compression, this won't release any space to the system, only prevent the table from requesting and taking up more than it needs to re-write one batch of rows.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

postgis - Why does updating a table in PostgreSQL need disk storage? - Stack Overflow