I need to update a column in a PostGIS database.
I want to do that with this:
UPDATE
cj_geometry
SET
groundgeometry = CASE WHEN NOT ST_IsValid(groundgeometry) THEN ST_Multi(
ST_CollectionExtract(
ST_MakeValid(groundgeometry),
3
)
) ELSE ST_MakeValid(groundgeometry) END
It's around 65.000.000 lines.
I let it run for an hour or so, then realized that there is no space left on device, so I quit the query and restarted the database. It did restart, but it's now super big, without new data. So two questions:
- How do I get it back to its original size?
- Why does this use so much disk space? How can I avoid this?
I need to update a column in a PostGIS database.
I want to do that with this:
UPDATE
cj_geometry
SET
groundgeometry = CASE WHEN NOT ST_IsValid(groundgeometry) THEN ST_Multi(
ST_CollectionExtract(
ST_MakeValid(groundgeometry),
3
)
) ELSE ST_MakeValid(groundgeometry) END
It's around 65.000.000 lines.
I let it run for an hour or so, then realized that there is no space left on device, so I quit the query and restarted the database. It did restart, but it's now super big, without new data. So two questions:
- How do I get it back to its original size?
- Why does this use so much disk space? How can I avoid this?
3 Answers
Reset to default 1Postgres does not release storage back to the OS by default. Databases tend only to grow, so there's not much call to release storage back. Postgres keeps this storage around so that future inserts and updates and simply reuse space that points to dead tuples. This is faster than asking the OS to extend the size of the file.
Reclaiming Disk Space
To reclaim assigned storage you can do a full vacuum. This essentially transfers the table to a new file, compacting the backing-file to its smallest possible size. This requires that you have enough space to store the old backing-file and the new-backing file simultaneously. So if you are very low on disk space, then this may not be an option.
VACUUM (FULL) mytable;
If you cannot do a full vacuum because of limited disk space, then your only options are to dump and reload the table, or to transfer the table to and from a tablespace on a different disk (mentioned in more detail below). Other the two, dump and reload is perhaps the simpler option. You need to dump/export the table to a different host or partition, truncate/drop the table, and then load/import the table again. This may require you to (temporarily) drop any foreign key constraints that point to your table. As such, to keep database integrity, you must stop and prevent any other users from accessing the database whilst you are performing this process. Once you have restored all the dropped constraints you can re-enable access to the database.
Tablespaces
If you do not have enough space on the current disk that hosts your postgres data directory, but you do have a separate disk on your host that does have enough space, then you can use it to trim the table without the data ever leaving the postgres server. This approach has the benefit of postgres being able to maintain its ACID guarantees without any manual assistance from you. To do this you need to:
- Mount the disk
- Initialise a data directory on the mount point (owned by the
postgres
user) - Create a tablespace that points to this directory
ALTER TABLE
to use this new tablespace. This will transfer the table to your other disk, compacting it as it goes.- Transfer the table back to the original tablespace
- Drop the tablespace, and unmount the disk.
Creating the Tablespace
Mount the disk and create a directory on the mount point that is owned by postgres
. Then make the directory available as a tablespace in postgres:
CREATE TABLESPACE recoveryspace LOCATION '/path/to/mount/postgresql/data';
Compacting the Table
This is a big operation, and access to the table will be blocked whilst this operation is ongoing.
-- send table to disk with free space.
-- this compacts the table, and releases the space on your primary disk
ALTER TABLE mytable SET TABLESPACE recoveryspace;
-- send the table back to your primary tablespace/disk
ALTER TABLE mytable SET TABLESPACE pg_default;
Delete the Tablespace
Once you have transferred the table back to its original tablespace, you can drop the tablespace and then unmount the disk.
DROP TABLESPACE recoveryspace;
How to Avoid Using Too Much Disk Space
To avoid this problem in future you can do incremental updates. This reduces the total number of reachable tuples at any given time, and so reduces the maximum size of the backing-file. That is, when you update every row in a table, postgres must be able to access every before and after row at the point you commit the transaction. This means that the backing file will have to be twice its normal size to accommodate all these rows.
If you only update 1 million rows or so at time, then the total size of the backing file only needs to be large enough to accomodate 66 million tuples (65m + 1m). When you update the next batch, there will be 1 million tuples that are no longer reachable, and postgres can safely reuse this space. You will likely have to tell postgres that it should recycle these unreachable tuples though. For this technique to work, there must be no long running transactions that mean that the old rows are still reachable.
BEGIN;
UPDATE mytable SET mycol = newval WHERE <criteria>;
COMMIT;
VACUUM; -- recycles unreachable tuples, but retains file storage
<criteria>
should be able to exclude previously updated rows. A simple criteria is to use a unique column or primary key on the table. eg.
UPDATE mytable
SET mycol = f(x)
WHERE id in (
SELECT id FROM mytable WHERE 0 <= id 1000000
);
-- and then on the next update
WHERE 1000000 <= id < 2000000
You can run this update in a loop (commiting and vacuuming between each update), changing the ids to update each time and until all rows have been successfully updated.
This requires you to keep track of which ids have been updated. If you can inspect a row and tell if it needs updating then you do not need to keep track of the id column between queries. Instead use that in the where clause when selecting which ids to update.
UPDATE mytable
SET mycol = f(x)
WHERE id in (
SELECT id FROM mytable
-- exclude rows that have been successfully updated
WHERE is_invalid(mycol)
LIMIT 1000000
)
If there is no unique identifier on the table then you can use the system column ctid
to help you limit which rows are selected for update. This will require you to be able to tell if a row needs updating or not. If you cannot then simply add a temporary boolean column (with no default) and set to true when updated. At the end of the updates you can then drop the column. This will lead to some bloat in table, but this will disappear over time as the rows get updated.
UPDATE
mytable
SET
mycol = f(x)
WHERE
ctid IN (
SELECT ctid FROM mytable
WHERE is_invalid(mycol)
LIMIT 1000000
);
Using a temporary column
ALTER TABLE mytable ADD COLUMN updated boolean;
BEGIN;
UPDATE
mytable
SET
mycol = f(x), updated = true
WHERE
ctid IN (
SELECT ctid FROM mytable
WHERE updated IS NULL
LIMIT 1000000
);
COMMIT;
VACUUM;
-- repeat update, commit and vacuum until done
-- clean up table
ALTER TABLE mytable DROP COLUMN updated;
To shrink the table, end all long-running transactions, close all prepared transactions, drop all stale replication slots and run
VACUUM (FULL) table_name;
The cause for the problem is that you are updating all rows of the table in a single statement and consequently in a single transaction. Than can easily double the size of the table.
-
How do I get it back to its original size?
VACUUM FULL cj_geometry;
, as already pointed out by @Laurenz Albe. Since you're dealing with geometries, you can actually go below its original size, if you alter table..alter column
to ditch the default pglz
used for transparent compression and go with a slightly more efficient lz4
:
alter table cj_geometry
alter column groundgeometry
set compression lz4;
If you maintain your database as a read-only data source, you can dump and re-build, as long as you can afford the downtime. Both vacuum full and the db re-upload cause all indexes to be rebuilt.
-
Why does this use so much disk space?
Postgres doesn't do in-place update
s, if that's what you expected. In its MVCC model, an update
is really a combination of insert
and delete
- the old version of the row stays in place and gets marked for deletion, the new one is written separately, elsewere. The old one is kept for as long as there's at least one transaction that started working on that table before the deletion of that row. After its commit/rollback, when vacuum
comes, it'll reclaim the old row's space, enabling new insert
s and update
s on the table to re-use it.
That's why you can regularly see vacuum running on its own, as a part of autovacuum
enabled by default.
-
How can I avoid this?
Your update
re-writes the entire table so it'll also re-write all its indexes. If you drop them, the whole thing will temporarily shrink in size, and the update
will actually run faster, free from having to also reflect each change in each index. You can save their definitions first:
create temp table index_definitions as select indexdef
from pg_indexes where schemaname='public' and tablename='cj_geometry';
And after you're done you can rebuild them - that's what would happen on its own anyways:
do $p$ declare indexdef_ text;
begin for indexdef_ in select indexdef from index_definitions loop
execute indexdef_;
end loop;
drop table index_definitions;
end $p$;
As hinted by @Dunes, you can also try to split your update in batches, vacuum
ing in between:
UPDATE cj_geometry
SET groundgeometry = CASE WHEN NOT ST_IsValid(groundgeometry)
THEN ST_Multi(
ST_CollectionExtract(
ST_MakeValid(groundgeometry),
3
)
)
ELSE ST_MakeValid(groundgeometry)
END
WHERE ctid IN (SELECT id FROM cj_geometry
ORDER BY id
LIMIT 1e7
OFFSET 1e7);
VACUUM cj_geometry;
Changing your limit
and offset
accordingly - this example shows the 2nd batch of 10 million rows. If you lack a regular primary key column, there's also a ctid
hidden system column you can use to identify a row throughout the statement. If you can afford a maintenance window and your table is modifiable only by you, you can try that instead.
Note that VACUUM
has to run in its own transaction with nothing else in it, so the above example assumes you're running these in autocommit
mode.
Also, that will only prevent this one table's footprint from increasing by forcing the table to recycle the space on its pages more often - plain vacuum
reclaims space internally, only for that table to re-use. Unlike lifting indexes and changing compression, this won't release any space to the system, only prevent the table from requesting and taking up more than it needs to re-write one batch of rows.
update
s, if that's what you expected. In its MVCC model, anupdate
is really a combination ofinsert
anddelete
- the old version of the row stays in place and gets marked for deletion, the new one is written separately, elsewere. The old one is kept for as long as there's at least one transaction that started working on that table before the deletion of that row. After its commit/rollback, whenvacuum
comes, it'll reclaim the old row's space, enabling newinsert
s andupdate
s on the table to re-use it. – Zegarek Commented Jan 20 at 12:53