I need to remove a specific node from an XML column in PostgreSQL only if it is not encoded, while keeping the encoded version.
XML:
<titles>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
</titles>
UPDATE main.table
SET mc = REGEXP_REPLACE(
mc ::text,
E'<title>Évaluation du gestionnaire</title>\\s*', -- Match exact decoded title with optional space
'',
'g'
)::xml
WHERE id=133
AND mc::text LIKE '%<title>Évaluation du gestionnaire</title>%';
The given query removes the encoded one from XML.
How can I remove only Évaluation du gestionnaire while ensuring the <title>Évaluation du gestionnaire</title>
remains untouched?
I need to remove a specific node from an XML column in PostgreSQL only if it is not encoded, while keeping the encoded version.
XML:
<titles>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
</titles>
UPDATE main.table
SET mc = REGEXP_REPLACE(
mc ::text,
E'<title>Évaluation du gestionnaire</title>\\s*', -- Match exact decoded title with optional space
'',
'g'
)::xml
WHERE id=133
AND mc::text LIKE '%<title>Évaluation du gestionnaire</title>%';
The given query removes the encoded one from XML.
How can I remove only Évaluation du gestionnaire while ensuring the <title>Évaluation du gestionnaire</title>
remains untouched?
1 Answer
Reset to default 0In the case of two consecutive <title>
elements, this regex pattern will matche if one of the <title>
elements includes the Unicode encoded character signified by &#(\d{2,4}|x\w{2,4});
AND if one of the ADJACENT <title>
elements does not contain a unicode encoded character. It will delete the <title>
element that did not contain a unicode encoded character.
REGEX PATTERN (PCRE2 Flavor)(Flags: g):
(?:(?=&#(?:\d{2,4}|x\w{2,4});[^<]*<\/title>(?:\s*<title>[^<]*<\/title>))([^<]*<\/title>\s*)<title>(?![^<]*?&#(?:\d{2,4}|x\w{2,4});[^<]*<)[^<]*<\/title>\s*)|(?:(?:<title>(?![^<]*?&#(?:\d{2,4}|x\w{2,4});[^<]*<))[^<]*<\/title>\s*(<title>[^<]*?&#(?:\d{2,4}|x\w{2,4});[^<]*<\/title>))
replacement_string = '$1$2'
Regex Demo: https://regex101/r/MzpCIC/4 ) (10 matches 1,484 steps)
COMMENTS
- I.e. only if one of the
<title>
elements has an encoded character and the other immediately ADJACENT (only whitespace in between)<title>
element does not, then the<title>
element that did not have an encoded element will be replaced with an empty string "", i.e. it is deleted. Please note that the actual replacement string to make this happen is$1$2
, not an empty string. Please note that the<title>
element without the encoded character that will be deleted can be either BEFORE or AFTER the<title>
element that has the encoded character (no change to this later element)> - This regex pattern with the replacement pattern should work to replaces all occurrences, or if you apply it to a specific record.
-- To replace all occurrences
SELECT REGEXP_REPLACE(text_string, regex_pattern, replacement_string="", 'g');
- (HTML/XML Character Codes: Numeric character references start with &# and end with ;. They allow you to display characters by their Unicode number. )
- Match Option 1 OR Option 2
(?:...)|(?:...)
. - If match remote the
<title>
element that did not have the encoding. - Option 1 matches if there is a
<title>
element containing an encoded character, immediately followed by a<title>
element that does not contain an encoded character. - Option 2 matches if there is a
<title>
element that does not contain an encoded character AND is immediately followed by a<title>
element that contains an encoded character. - Unicode encoded character patterns for decimal and hexadecimal forms:
&#(\d{2,4}|x\w{2,4});
- Replacement_string is
$1$2
.$1
is the capture group #1 from matching Option 1.$2
is the capture group #2 from matching Option 2. If there is not match for Option 1,$1
returnsNone
. Same way, if there is no match for Option 2, the$2
returnsNone
. Because we can only have one option matched at a time, the$1$2
will only return the part of the string that was match with one of the options, while the other capture group variable always returnsNone
(empty string).
REGEX PATTERN NOTES:
(?:
Begin non-capturing group 1 for Option 1 for case one where the<title>
element that includes the encoding is before the<title>
element with the encoding.(?=
Begin positive lookahead(?=...)
. This lookahead will check to make sure that the first<title>
element contains an encoded character before pursuing Option 1 further. Will not consume any characters.&#(?:\d{2,4}|x\w{2,4});
Matches a decimal or a hexadecimal form XLM unicode character&#nnnn;
and&#xhhhh;
.[^<]*
Negated character class[^...]
. Matches any character that is not a literal<
.<\/title>
Match literal<\/title>
.(?:
Begin non-capturing group 2(?:...)
.\s*
Match any whitespace character\s
0 or more times (*
).<title>
Match literal<title>
.[^<]*
Negated character class[^...]
. Matches any character that is not a literal<
.<\/title>
Match literal<\/title>
.)
End non-capturing group 2.)
End positive lookahead.(
Begin CAPTURING GROUP 1 Referred to with$1
in thereplacement_string
.[^<]*
Negated character class[^...]
. Matches any character that is not a literal<
.<\/title>
Match literal<\/title>
.\s*
Match any whitespace character\s
0 or more times (*
).)
End CAPTURING GROUP 1 ($1
).<title>
Match literal<title>
.(?!
Begin negative lookahead. Make sure there are no unicode encoded characters inside the current<title>
element.[^<]*
Negated character class[^...]
. Matches any character that is not a literal<
.?&#(?:\d{2,4}|x\w{2,4});
Matches a decimal or a hexadecimal form XLM unicode character&#nnnn;
and&#xhhhh;
.[^<]*
Negated character class[^...]
. Matches any character that is not a literal<
.<
Matches literal<
.)
End negative lookahead.[^<]*
Negated character class[^...]
. Matches any character that is not a literal<
.<\/title>
Match literal<\/title>
.\s*
Match any whitespace character\s
0 or more times (*
).)
End non-capture group 1.|
Alteration The pipe\
is read as ```OR``.(?:
Begin non-capture group 3 for Option 2.(?:
Begin non-capture group 4,<title>
Match literal<title>
.(?!
Negative lookahead(?!..)
to make sure there are NO encoded characters contained in the first<title>
element.[^<]*
Negated character class[^...]
. Matches any character that is not a literal<
.?&#(?:\d{2,4}|x\w{2,4});
Matches a decimal or a hexadecimal form XLM unicode character&#nnnn;
and&#xhhhh;
.[^<]*
Negated character class[^...]
. Matches any character that is not a literal<
.<
Matches literal<
.)
End negative lookahead.)
End non-capturing group 3.[^<]*
Negated character class[^...]
. Matches any character that is not a literal<
.<\/title>
Match literal<\/title>
.\s*
Match any whitespace character\s
0 or more times (*
).(
Begin CAPTURE GROUP 2 for Option 2. Referred to with$2
in thereplacement_string
.<title>
Match literal<title>
.[^<]*?
Negated character class[^...]
. Matches any character that is not a literal<
.&#(?:\d{2,4}|x\w{2,4});
Matches a decimal or a hexadecimal form XLM unicode character&#nnnn;
and&#xhhhh;
.[^<]*
Negated character class[^...]
. Matches any character that is not a literal<
.<\/title>
Match literal<\/title>
.)
End Capturing Group 1 ($1
))
End non-capturing group 3.
TEST STRING:
<titles>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
</titles>
<titles>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
</titles>
<titles>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
</titles>
<titles>
<title>Évaluation du gestionnaire</title>
</titles>
<titles>
<title>Évaluation du gestionnaire</title>
</titles>
// LAST PART
<titles>
<title>Evaluation du Égestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du Égestionnaire</title>
<title>Évaluation du Égestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Evaluation du Égestionnaire</title>
<title>Evaluation du Égestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Evaluation du Égestionnaire</title>
</titles>
RESULT:
<titles>
<title>Évaluation du gestionnaire</title>
</titles>
<titles>
<title>Évaluation du gestionnaire</title>
</titles>
<titles>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du gestionnaire</title>
</titles>
<titles>
<title>Évaluation du gestionnaire</title>
</titles>
<titles>
<title>Évaluation du gestionnaire</title>
</titles>
// LAST PART
<titles>
<title>Evaluation du Égestionnaire</title>
<title>Évaluation du gestionnaire</title>
<title>Évaluation du Égestionnaire</title>
<title>Évaluation du Égestionnaire</title>
<title>Evaluation du Égestionnaire</title>
<title>Evaluation du Égestionnaire</title>
<title>Evaluation du Égestionnaire</title>
</titles>
xml
data type in the first place rather thantext
then you wouldn't have had this problem. AlsoÉ
doesn't need to be encoded in XML, it's perfectly valid without encoding. – Charlieface Commented Mar 6 at 14:22