UserID | First | Middle | Last | Type | CreatedAt |
---|---|---|---|---|---|
123 | John | Henry | Doe | Mage | 03-28-2025 |
UserID | First | Middle | Last | Type | CreatedAt |
---|---|---|---|---|---|
123 | John | Henry | Doe | Mage | 03-28-2025 |
Let's say I have the following table above. I would like to create index to help speed up my queries.
All queries would like along the lines of this:
Select *
from users
where Type = 'SomeType'
and First = 'SomeName1'
Order by CreatedAt DESC;
Select *
from users
where Type = 'SomeType'
and Middle = 'SomeName2'
Order by CreatedAt DESC;
Select *
from users
where Type = 'SomeType'
and Last = 'SomeName3'
Order by CreatedAt DESC;
How would I index the columns to make the queries efficient? Would CreatedAt
be the first of the indexed column?
I'm thinking
CREATE INDEX idx_users on users(CreatedAt, Type, First, Middle, Last)
CreatedAt
and type
would always be used, whereas first, middle, and last varies.
3 Answers
Reset to default 3The index you proposed might be unhelpful:
CREATE INDEX idx_users on users(CreatedAt, Type, First, Middle, Last)
The fact that you're selecting and ordering by CreatedAt
doesn't mean it's needed as an index key. In my test on 400k samples, none of the three queries you showed ended up using it. They all ended up running a sequential scan that took above 150ms
.
If you simply remove createdat
from the index, they're happy to use it for a Bitmap Index Scan taking below 5ms
. You can add more tweaks depending on how much input latency you're ready to tolerate and how much space you're willing to spend on accommodating the indexes.
Here's a comparison of storage space they take up on 200k rows, with query times in ms averaged on 45 calls, plus the index size and how much more space you need for it, in % compared to the base table size:
demo at db<>fiddle
variant | query1 | query2 | query3 | indexes size | %size |
---|---|---|---|---|---|
(Type,First,Middle,Last) include(createdat,userid); |
0.159 | 0.148 | 0.148 | 35 MB | 104 |
(Type,First,CreatedAt) INCLUDE(Middle,Last,UserID); (Type,Middle,CreatedAt) INCLUDE(First,Last,UserID); (Type,Last,CreatedAt) INCLUDE (First,Middle,UserID); charlieface |
0.172 | 0.165 | 0.161 | 108 MB | 325 |
(Type,First,Middle,Last); |
0.197 | 0.187 | 0.182 | 33 MB | 98 |
(Type,first); (Type,middle); (Type,last); |
0.196 | 0.191 | 0.179 | 4912 kB | 14 |
(first,type,createdat desc); (middle,type,createdat desc); (last,type,createdat desc); Thorsten Kettner |
0.214 | 0.210 | 0.202 | 53 MB | 161 |
(Type) include(First,Middle,Last, UserID,CreatedAt); |
2.712 | 2.689 | 2.399 | 34 MB | 103 |
(Type); |
5.098 | 5.024 | 4.627 | 1456 kB | 4 |
(CreatedAt,Type,First,Middle,Last); |
77.073 | 76.153 | 76.333 | 32 MB | 97 |
Indexing is about selectivity. If there was an index, and a query selected 50% of the rows in the table for instance, then it would be a bad idea for the DBMS to use the index, as this would be a lot of work compared to just reading all table rows sequentially.
There are two columns in each where clause. Which of the two is more selective? Let's look at the first query: I would expect there to be more distinct first names then types, so the first name would be the first column in the index and the type only be the second one. So, an index for the first query would look like this:
create index idx_1 on users (first, type);
As you are using equality comparison only (Type = ... , First = ...) you will hit one position in the index where you will find all matches. This allows us to add the CreatedAt column to the index, so all those matches would already be sorted:
create index idx_1 on users (first, type, createdat desc);
For the other two queries you'd have accordingly:
create index idx_2 on users (middle, type, createdat desc);
create index idx_3 on users (last, type, createdat desc);
So, you need three different indexes for the three queries.
If you only wanted to provide a single index for all queries instead, then you'd index the type, because this is the column in the where clause all three queries have in common. You could add the other columns, so as to provide them before the table is read, but I doubt the DBMS would use the index, anyway, because I don't assume the type to be selective enough. One query might benefit, namely the one where the second indexed column would match the column in the where clause, but that's it. So, providing one index only isn't a promising option. You must provide three indexes, one per query.
That index is not useful, as CreatedAt
is the first column, which means all the other columns cannot be seeked.
Furthermore:
- As it doesn't fully "cover" the query, which uses
select *
. You would need all other columns in aninclude
. Middle
andLast
are not helping you by being in the index key, as they are afterFirst
. So this index (even if it was covering) will only help for the first query.
So you need three separate indexes.
(Type, First, CreatedAt) INCLUDE (Middle, Last, UserID)
(Type, Middle, CreatedAt) INCLUDE (First, Last, UserID)
(Type, Last, CreatedAt) INCLUDE (First, Middle, UserID)
You can see in the fiddle that this gets you an index-only scan in all three queries.
db<>fiddle
CREATE INDEX idx_users on users(Type, First, Middle, Last)
; orCREATE INDEX idx_users on users(Type, First, Middle, Last)include(createdat,userid);
if it fits and you're after index-only scans. Ifcreatedat
anduserid
aren't inwhere
, they don't need to be index keys, only an optionallyinclude
d payload. If the type you're searching for is constant, you can make that a partial index by removing it from the keys and adding awhere(type='sometype')
to the definition. You might be interested in index types that support pattern matching (gin_trgm_ops
) for the names. – Zegarek Commented Mar 28 at 6:50