I am working with a dataset that tracks folder movements within conversations. I have a folder_movement table that contains conversation_id, updated_at, and movement details.
I created a folder_cum table where I extracted conversation_history_id_fp (starting point) and conversation_history_id_endp (ending point) for each interaction (1 Interaction = 1 Entrie and 1 exit from CE triage). Now, I need to count how many folder movements exist between conversation_history_id_fp
and conversation_history_id_endp
, while maintaining order by conversation_id
and updated_at
.
The challenge is that I cannot group by conversation_id, since a single conversation can have multiple interactions with different folder movements. Instead, I need to count the rows that fall between the two extracted IDs in the folder_movement
table, and this table must be ordered by conversation_id
and updated_at
.
WITH
DATA AS (
SELECT
conversation_history_id,
conversation_id,
convo.created_at AT time zone 'cst' AS updated_at,
convo.user_id,
convo.conversation_folder_name AS folder_to,
LAG(convo.conversation_folder_name, 1) OVER (
ORDER BY
convo.conversation_id, convo.created_at ASC
) folder_from,
convo.status,
LAG(convo.status, 1) OVER (
ORDER BY
convo.conversation_id,convo.created_at ASC
) previous_status,
convo.snooze_until,
convo.snooze_note
FROM conversation_history convo
ORDER BY
conversation_id,updated_at DESC
--limit 1000
),
folder_movement as (
SELECT
conversation_history_id,
conversation_id,
updated_at,
folder_from,
folder_to,
user_id
FROM DATA
WHERE folder_to <> folder_from
),
folder_cum as (
SELECT
conversation_history_id as conversation_history_id_fp, -- start point
conversation_id,
updated_at,
folder_from,
folder_to,
user_id,
LEAD(conversation_id, 1) OVER (
ORDER BY
conversation_id, updated_at asc
) conversation_id_check,
LEAD(updated_at, 1) OVER (
ORDER BY
conversation_id, updated_at asc
) close_at,
LEAD(user_id, 1) OVER (
ORDER BY
conversation_id, updated_at asc
) closed_by,
LEAD(folder_from, 1) OVER (
ORDER BY
conversation_id, updated_at asc
) last_interaction_folder,
LEAD(conversation_history_id, 1) OVER (
ORDER BY
conversation_id, updated_at asc
) conversation_history_id_endp
FROM folder_movement
WHERE (folder_from = '/CE triage/') or
(folder_to = '/CE triage/')
order by updated_at desc
),
interactions_number as ( -- help here please --
SELECT
fp.conversation_history_id_fp,
endp.conversation_history_id_endp,
COUNT(folder_movement.conversation_history_id) as mov_count
FROM folder_movement
LEFT JOIN folder_cum fp on folder_movement.conversation_history_id = fp.conversation_history_id_fp
LEFT JOIN folder_cum endp on folder_movement.conversation_history_id = endp.conversation_history_id_endp
GROUP BY fp.conversation_history_id_fp, endp.conversation_history_id_endp
),
prueba as (
SELECT
folder_cum.conversation_id,
folder_from,
updated_at as open_at,
case
when user_id = 5782129474346356288 then true --software id
else false
end as bot, --True = chatbot movement
folder_to as first_interaction_folder,
CASE
WHEN folder_cum.conversation_id <> conversation_id_check then null --If it is different, --it means the interaction is not closed yet, so it should be NULL
else mov_count end as mov_count,
CASE
WHEN folder_cum.conversation_id <> conversation_id_check then null
else close_at end as close_at,
CASE
WHEN folder_cum.conversation_id <> conversation_id_check then null
else closed_by end as closed_by,
CASE
WHEN folder_cum.conversation_id <> conversation_id_check then null
else last_interaction_folder end as last_interaction_folder
FROM folder_cum
LEFT JOIN interactions_number on folder_cum.conversation_history_id_fp = interactions_number.conversation_history_id_fp
where (folder_from = '/CE triage/')
and (updated_at is not null)
)
select
conversation_id,
folder_from,
open_at,
bot,
first_interaction_folder,
mov_count,
close_at,
bright_user.email as closed_by,
last_interaction_folder
from prueba
left join bright_user on bright_user.user_id = prueba.closed_by
--where conversation_id = '4442463178442263265' --test
ORDER BY open_at desc
Here’s a simplified version of my tables:
folder_movement (Extracted from DATA)
conversation_history_id | conversation_id | updated_at | folder_from | folder_to |
---|---|---|---|---|
01 | A1 | 2024-01-03 10:49:00 | Folder 2 | CE Triage |
02 | A1 | 2024-01-03 10:42:00 | Folder 1 | Folder 2 |
03 | A1 | 2024-01-03 10:41:00 | CE Triage | Folder 1 |
08 | A1 | 2023-01-03 13:45:00 | Folder 4 | CE Triage |
09 | A1 | 2023-01-03 13:41:00 | CE Triage | Folder 4 |
04 | A2 | 2024-01-02 12:47:00 | Folder 1 | CE Triage |
05 | A2 | 2024-01-02 12:45:00 | Folder 3 | Folder 1 |
06 | A2 | 2024-01-02 12:42:00 | CE Triage | Folder 3 |
07 | A3 | 2024-01-02 12:52:00 | CE Triage | Folder 3 |
I am working with a dataset that tracks folder movements within conversations. I have a folder_movement table that contains conversation_id, updated_at, and movement details.
I created a folder_cum table where I extracted conversation_history_id_fp (starting point) and conversation_history_id_endp (ending point) for each interaction (1 Interaction = 1 Entrie and 1 exit from CE triage). Now, I need to count how many folder movements exist between conversation_history_id_fp
and conversation_history_id_endp
, while maintaining order by conversation_id
and updated_at
.
The challenge is that I cannot group by conversation_id, since a single conversation can have multiple interactions with different folder movements. Instead, I need to count the rows that fall between the two extracted IDs in the folder_movement
table, and this table must be ordered by conversation_id
and updated_at
.
WITH
DATA AS (
SELECT
conversation_history_id,
conversation_id,
convo.created_at AT time zone 'cst' AS updated_at,
convo.user_id,
convo.conversation_folder_name AS folder_to,
LAG(convo.conversation_folder_name, 1) OVER (
ORDER BY
convo.conversation_id, convo.created_at ASC
) folder_from,
convo.status,
LAG(convo.status, 1) OVER (
ORDER BY
convo.conversation_id,convo.created_at ASC
) previous_status,
convo.snooze_until,
convo.snooze_note
FROM conversation_history convo
ORDER BY
conversation_id,updated_at DESC
--limit 1000
),
folder_movement as (
SELECT
conversation_history_id,
conversation_id,
updated_at,
folder_from,
folder_to,
user_id
FROM DATA
WHERE folder_to <> folder_from
),
folder_cum as (
SELECT
conversation_history_id as conversation_history_id_fp, -- start point
conversation_id,
updated_at,
folder_from,
folder_to,
user_id,
LEAD(conversation_id, 1) OVER (
ORDER BY
conversation_id, updated_at asc
) conversation_id_check,
LEAD(updated_at, 1) OVER (
ORDER BY
conversation_id, updated_at asc
) close_at,
LEAD(user_id, 1) OVER (
ORDER BY
conversation_id, updated_at asc
) closed_by,
LEAD(folder_from, 1) OVER (
ORDER BY
conversation_id, updated_at asc
) last_interaction_folder,
LEAD(conversation_history_id, 1) OVER (
ORDER BY
conversation_id, updated_at asc
) conversation_history_id_endp
FROM folder_movement
WHERE (folder_from = '/CE triage/') or
(folder_to = '/CE triage/')
order by updated_at desc
),
interactions_number as ( -- help here please --
SELECT
fp.conversation_history_id_fp,
endp.conversation_history_id_endp,
COUNT(folder_movement.conversation_history_id) as mov_count
FROM folder_movement
LEFT JOIN folder_cum fp on folder_movement.conversation_history_id = fp.conversation_history_id_fp
LEFT JOIN folder_cum endp on folder_movement.conversation_history_id = endp.conversation_history_id_endp
GROUP BY fp.conversation_history_id_fp, endp.conversation_history_id_endp
),
prueba as (
SELECT
folder_cum.conversation_id,
folder_from,
updated_at as open_at,
case
when user_id = 5782129474346356288 then true --software id
else false
end as bot, --True = chatbot movement
folder_to as first_interaction_folder,
CASE
WHEN folder_cum.conversation_id <> conversation_id_check then null --If it is different, --it means the interaction is not closed yet, so it should be NULL
else mov_count end as mov_count,
CASE
WHEN folder_cum.conversation_id <> conversation_id_check then null
else close_at end as close_at,
CASE
WHEN folder_cum.conversation_id <> conversation_id_check then null
else closed_by end as closed_by,
CASE
WHEN folder_cum.conversation_id <> conversation_id_check then null
else last_interaction_folder end as last_interaction_folder
FROM folder_cum
LEFT JOIN interactions_number on folder_cum.conversation_history_id_fp = interactions_number.conversation_history_id_fp
where (folder_from = '/CE triage/')
and (updated_at is not null)
)
select
conversation_id,
folder_from,
open_at,
bot,
first_interaction_folder,
mov_count,
close_at,
bright_user.email as closed_by,
last_interaction_folder
from prueba
left join bright_user on bright_user.user_id = prueba.closed_by
--where conversation_id = '4442463178442263265' --test
ORDER BY open_at desc
Here’s a simplified version of my tables:
folder_movement (Extracted from DATA)
conversation_history_id | conversation_id | updated_at | folder_from | folder_to |
---|---|---|---|---|
01 | A1 | 2024-01-03 10:49:00 | Folder 2 | CE Triage |
02 | A1 | 2024-01-03 10:42:00 | Folder 1 | Folder 2 |
03 | A1 | 2024-01-03 10:41:00 | CE Triage | Folder 1 |
08 | A1 | 2023-01-03 13:45:00 | Folder 4 | CE Triage |
09 | A1 | 2023-01-03 13:41:00 | CE Triage | Folder 4 |
04 | A2 | 2024-01-02 12:47:00 | Folder 1 | CE Triage |
05 | A2 | 2024-01-02 12:45:00 | Folder 3 | Folder 1 |
06 | A2 | 2024-01-02 12:42:00 | CE Triage | Folder 3 |
07 | A3 | 2024-01-02 12:52:00 | CE Triage | Folder 3 |
interactions_number (Expected)
conversation_history_id_endp | conversation_history_id_fp | conversation_id | mov_count |
---|---|---|---|
03 | 01 | A1 | 3 |
09 | 08 | A1 | 2 |
06 | 04 | A2 | 3 |
07 | 06 | A3 | 2 |
(Conversation ID A3, since the interaction is not finalized (folder_to = 'CE Triage'
), will share the conversation_history_id
from the immediately preceding row. This does not matter because, later in the code, the prueba table filters out unfinished interactions. Check the comments in first case in prueba)
I am expecting to get a count of all folder movements that occurred between conversation_history_id_fp and conversation_history_id_endp for each interaction, while ensuring the order is preserved (conversation_id and updated_at)
Share Improve this question edited Feb 8 at 13:29 Guillaume Outters 1,1608 silver badges20 bronze badges asked Feb 6 at 22:53 TreeCatTreeCat 112 bronze badges New contributor TreeCat is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. 1- In case conversation A1 gets 2 subsequent roadtrips (enters Triage, exits from Triage, reenters Triage, exits Triage, but without an A2 inbetween): do you expect 1 interaction, or 2? In other words: is it the roundtrip to Triage that defines an interaction, or the interruption of a conversation by another one? – Guillaume Outters Commented Feb 7 at 14:13
1 Answer
Reset to default 0Creating interactions_number
You did a majority of the hard work in your folder_cum
table, because:
- your criteria ("either from or to Triage") limited
folder_cum
to contain only 2 entries (the start and end of the interaction) - you used window functions so that the start of interaction now knows data from the end of interaction
So, now that the start of interaction has all necessary data, you can work with it only, which makes 1 row per (start of) conversation as long as you can choose the right one (to this effect we refilter on folder_from = '/CE triage/'
, which helps us distinguish the start from the end), and thus you can GROUP BY
on its fields.
Then classically JOIN
to folder_movement
, and COUNT
to get your per-interaction count:
select
f.conversation_history_id_fp,
f.conversation_history_id_endp,
f.conversation_id,
count(1) mov_count
from folder_cum f
left join folder_movement m on m.conversation_history_id between conversation_history_id_endp and conversation_history_id_fp
where f.folder_from = '/CE triage/'
group by 1, 2, 3
Correcting folder_cum
Beware though, that your window functions in folder_cum
should use a PARTITION BY
to get the desired effect.
By using ORDER BY conversation_id, updated_at
only, you risk interleaved conversations to get the wrong data (A1 start A2 start A1 end A2 end: A2 start will get its LEAD
in A1 end instead of A2 end; or A3 getting A1 data as you noticed).
In the other hand, by using PARTITION BY conversation_id ORDER BY updated_at
, you ensure that subsequent entries will always be related to the same conversation, and that the ending, unclosed conversation will get NULL
instead of unrelated data.
All in all
With those two changes, you get the expected result as seen in SQLFiddle.
You even get a correct A3 conversation, ending in NULL
and having 1 interaction only.
And note that I user the shorthand WINDOW same_conv AS
to avoid repeating the window definition for each one of its uses (and avoid copy-paste mistakes).