sql - PostgreSQL: Remove Specific XML Node While Preserving Encoded Equivalent

I need to remove a specific node from an XML column in PostgreSQL only if it is not encoded, while keeping the encoded version.

XML:

<titles>
  <title>&#201;valuation du gestionnaire</title>
  <title>Évaluation du gestionnaire</title>
</titles>

UPDATE main.table
SET mc = REGEXP_REPLACE(
    mc ::text, 
    E'<title>Évaluation du gestionnaire</title>\\s*',  -- Match exact decoded title with optional space
    '', 
    'g'
)::xml
WHERE id=133
AND mc::text LIKE '%<title>Évaluation du gestionnaire</title>%';

The given query removes the encoded one from XML.

How can I remove only Évaluation du gestionnaire while ensuring the <title>Évaluation du gestionnaire</title> remains untouched?

I need to remove a specific node from an XML column in PostgreSQL only if it is not encoded, while keeping the encoded version.

XML:

<titles>
  <title>&#201;valuation du gestionnaire</title>
  <title>Évaluation du gestionnaire</title>
</titles>

UPDATE main.table
SET mc = REGEXP_REPLACE(
    mc ::text, 
    E'<title>Évaluation du gestionnaire</title>\\s*',  -- Match exact decoded title with optional space
    '', 
    'g'
)::xml
WHERE id=133
AND mc::text LIKE '%<title>Évaluation du gestionnaire</title>%';

The given query removes the encoded one from XML.

How can I remove only Évaluation du gestionnaire while ensuring the <title>Évaluation du gestionnaire</title> remains untouched?

Share Improve this question edited Mar 11 at 17:55 Basheer Jarrah 6404 silver badges16 bronze badges asked Mar 6 at 13:31 Midhun Mundayadan 3,1923 gold badges22 silver badges33 bronze badges

1 If you would have stored this data as xml data type in the first place rather than text then you wouldn't have had this problem. Also É doesn't need to be encoded in XML, it's perfectly valid without encoding. – Charlieface Commented Mar 6 at 14:22
"the given query remove the encoded one from xml", not it's not doing that, see: dbfiddle.uk/krmA-PyS – Luuk Commented Mar 6 at 18:11

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

In the case of two consecutive <title> elements, this regex pattern will matche if one of the <title> elements includes the Unicode encoded character signified by &#(\d{2,4}|x\w{2,4}); AND if one of the ADJACENT <title> elements does not contain a unicode encoded character. It will delete the <title> element that did not contain a unicode encoded character.

REGEX PATTERN (PCRE2 Flavor)(Flags: g):

(?:(?=&#(?:\d{2,4}|x\w{2,4});[^<]*<\/title>(?:\s*<title>[^<]*<\/title>))([^<]*<\/title>\s*)<title>(?![^<]*?&#(?:\d{2,4}|x\w{2,4});[^<]*<)[^<]*<\/title>\s*)|(?:(?:<title>(?![^<]*?&#(?:\d{2,4}|x\w{2,4});[^<]*<))[^<]*<\/title>\s*(<title>[^<]*?&#(?:\d{2,4}|x\w{2,4});[^<]*<\/title>))

replacement_string = '$1$2'

Regex Demo: https://regex101/r/MzpCIC/4 ) (10 matches 1,484 steps)

COMMENTS

I.e. only if one of the <title> elements has an encoded character and the other immediately ADJACENT (only whitespace in between) <title> element does not, then the <title> element that did not have an encoded element will be replaced with an empty string "", i.e. it is deleted. Please note that the actual replacement string to make this happen is $1$2, not an empty string. Please note that the <title> element without the encoded character that will be deleted can be either BEFORE or AFTER the <title> element that has the encoded character (no change to this later element)>
This regex pattern with the replacement pattern should work to replaces all occurrences, or if you apply it to a specific record.

-- To replace all occurrences
SELECT REGEXP_REPLACE(text_string, regex_pattern, replacement_string="", 'g');

(HTML/XML Character Codes: Numeric character references start with &# and end with ;. They allow you to display characters by their Unicode number. )
Match Option 1 OR Option 2 (?:...)|(?:...).
If match remote the <title> element that did not have the encoding.
Option 1 matches if there is a <title> element containing an encoded character, immediately followed by a <title> element that does not contain an encoded character.
Option 2 matches if there is a <title> element that does not contain an encoded character AND is immediately followed by a <title> element that contains an encoded character.
Unicode encoded character patterns for decimal and hexadecimal forms: &#(\d{2,4}|x\w{2,4});
Replacement_string is $1$2. $1 is the capture group #1 from matching Option 1. $2 is the capture group #2 from matching Option 2. If there is not match for Option 1, $1 returns None. Same way, if there is no match for Option 2, the $2 returns None. Because we can only have one option matched at a time, the $1$2 will only return the part of the string that was match with one of the options, while the other capture group variable always returns None (empty string).

REGEX PATTERN NOTES:

(?: Begin non-capturing group 1 for Option 1 for case one where the <title> element that includes the encoding is before the <title> element with the encoding.
(?= Begin positive lookahead (?=...). This lookahead will check to make sure that the first <title> element contains an encoded character before pursuing Option 1 further. Will not consume any characters.
&#(?:\d{2,4}|x\w{2,4}); Matches a decimal or a hexadecimal form XLM unicode character &#nnnn; and &#xhhhh;.
[^<]* Negated character class [^...]. Matches any character that is not a literal <.
<\/title> Match literal <\/title>.
(?: Begin non-capturing group 2 (?:...).
\s* Match any whitespace character \s 0 or more times (*).
<title> Match literal <title>.
[^<]* Negated character class [^...]. Matches any character that is not a literal <.
<\/title> Match literal <\/title>.
) End non-capturing group 2.
) End positive lookahead.
( Begin CAPTURING GROUP 1 Referred to with $1 in the replacement_string.
[^<]* Negated character class [^...]. Matches any character that is not a literal <.
<\/title> Match literal <\/title>.
\s* Match any whitespace character \s 0 or more times (*).
) End CAPTURING GROUP 1 ($1).
<title> Match literal <title>.
(?! Begin negative lookahead. Make sure there are no unicode encoded characters inside the current <title> element.
[^<]* Negated character class [^...]. Matches any character that is not a literal <.
?&#(?:\d{2,4}|x\w{2,4}); Matches a decimal or a hexadecimal form XLM unicode character &#nnnn; and &#xhhhh;.
[^<]* Negated character class [^...]. Matches any character that is not a literal <.
< Matches literal <.
) End negative lookahead.
[^<]* Negated character class [^...]. Matches any character that is not a literal <.
<\/title> Match literal <\/title>.
\s* Match any whitespace character \s 0 or more times (*).
) End non-capture group 1.
| Alteration The pipe \ is read as ```OR``.
(?: Begin non-capture group 3 for Option 2.
(?: Begin non-capture group 4,
<title> Match literal <title>.
(?! Negative lookahead (?!..) to make sure there are NO encoded characters contained in the first <title> element.
[^<]* Negated character class [^...]. Matches any character that is not a literal <.
?&#(?:\d{2,4}|x\w{2,4}); Matches a decimal or a hexadecimal form XLM unicode character &#nnnn; and &#xhhhh;.
[^<]* Negated character class [^...]. Matches any character that is not a literal <.
< Matches literal <.
) End negative lookahead.
) End non-capturing group 3.
[^<]* Negated character class [^...]. Matches any character that is not a literal <.
<\/title> Match literal <\/title>.
\s* Match any whitespace character \s 0 or more times (*).
( Begin CAPTURE GROUP 2 for Option 2. Referred to with $2 in the replacement_string.
<title> Match literal <title>.
[^<]*? Negated character class [^...]. Matches any character that is not a literal <.
&#(?:\d{2,4}|x\w{2,4}); Matches a decimal or a hexadecimal form XLM unicode character &#nnnn; and &#xhhhh;.
[^<]* Negated character class [^...]. Matches any character that is not a literal <.
<\/title> Match literal <\/title>.
) End Capturing Group 1 ($1)
) End non-capturing group 3.

TEST STRING:

<titles>
  <title>&#201;valuation du gestionnaire</title>
  <title>Évaluation du gestionnaire</title>
</titles>

<titles>
  <title>Évaluation du gestionnaire</title>
  <title>&#201;valuation du gestionnaire</title>
</titles>

<titles>
  <title>Évaluation du gestionnaire</title>
  <title>&#201;valuation du gestionnaire</title>
  <title>Évaluation du gestionnaire</title>
  <title>&#201;valuation du gestionnaire</title>
  <title>&#201;valuation du gestionnaire</title>
  <title>Évaluation du gestionnaire</title>
</titles>

<titles>
  <title>&#201;valuation du gestionnaire</title>
</titles>

<titles>
  <title>Évaluation du gestionnaire</title>
</titles>

// LAST PART 
<titles>
  <title>Evaluation du &#201;gestionnaire</title> 
  <title>Évaluation du gestionnaire</title>
  <title>Évaluation du gestionnaire</title>
  <title>Évaluation du gestionnaire</title>
  <title>&#201;valuation du &#201;gestionnaire</title>
  <title>&#201;valuation du &#201;gestionnaire</title>
  <title>Évaluation du gestionnaire</title>
  <title>Evaluation du &#201;gestionnaire</title>
  <title>Evaluation du &#201;gestionnaire</title>
  <title>Évaluation du gestionnaire</title>
  <title>Évaluation du gestionnaire</title>
  <title>Evaluation du &#201;gestionnaire</title>
</titles>

RESULT:

<titles>
  <title>&#201;valuation du gestionnaire</title>
  </titles>

<titles>
  <title>&#201;valuation du gestionnaire</title>
</titles>

<titles>
  <title>&#201;valuation du gestionnaire</title>
  <title>&#201;valuation du gestionnaire</title>
  <title>&#201;valuation du gestionnaire</title>
  </titles>

<titles>
  <title>&#201;valuation du gestionnaire</title>
</titles>

<titles>
  <title>Évaluation du gestionnaire</title>
</titles>

// LAST PART 
<titles>
  <title>Evaluation du &#201;gestionnaire</title> 
  <title>Évaluation du gestionnaire</title>
  <title>&#201;valuation du &#201;gestionnaire</title>
  <title>&#201;valuation du &#201;gestionnaire</title>
  <title>Evaluation du &#201;gestionnaire</title>
  <title>Evaluation du &#201;gestionnaire</title>
  <title>Evaluation du &#201;gestionnaire</title>
</titles>

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

sql - PostgreSQL: Remove Specific XML Node While Preserving Encoded Equivalent - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)