I want to select the contents between any two headings.
I have already created this regex which doesn't really selects what I need. Currently, it selects the heading along with the paragraph but not the last heading.
Current Regex: /^<h.*?(?:>)(.*?)(?=<\h)/gms
Given String:
<h2>What is lorem impsum</h2>
Stack overflow is a great munity for developers to seek help and connect the beautiful experience.
<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically.
But its a nice place to be at.
another paragraph betwen these headings
<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of
people with his kindness and his contents on the internet.
Expected Result:
[
'Stack overflow is a great munity for developers to seek help and connect the beautiful
experience.',
'Quoora is good but doesn't provide any benefits to the person who's helping others economically.
But it\'s a nice place to be at.
another paragraph betwen these headings',
'One of the best guy to learn react with. He also has helped a lot of
people with his kindness and his contents on the internet.'
]
I want to select the contents between any two headings.
I have already created this regex which doesn't really selects what I need. Currently, it selects the heading along with the paragraph but not the last heading.
Current Regex: /^<h.*?(?:>)(.*?)(?=<\h)/gms
Given String:
<h2>What is lorem impsum</h2>
Stack overflow is a great munity for developers to seek help and connect the beautiful experience.
<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically.
But its a nice place to be at.
another paragraph betwen these headings
<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of
people with his kindness and his contents on the internet.
Expected Result:
[
'Stack overflow is a great munity for developers to seek help and connect the beautiful
experience.',
'Quoora is good but doesn't provide any benefits to the person who's helping others economically.
But it\'s a nice place to be at.
another paragraph betwen these headings',
'One of the best guy to learn react with. He also has helped a lot of
people with his kindness and his contents on the internet.'
]
Share
Improve this question
asked Oct 17, 2020 at 6:05
rakesh shrestharakesh shrestha
1,4553 gold badges23 silver badges39 bronze badges
4
- 2 Avoid regex for HTML parsing – anubhava Commented Oct 17, 2020 at 6:26
- 2 last one in Expected Result is not between headings tag, – Abishek Kumar Commented Oct 17, 2020 at 6:36
- @anubhava why is that? can you please elaborate? – rakesh shrestha Commented Oct 18, 2020 at 6:00
- please check: stackoverflow./questions/590747/… – anubhava Commented Oct 18, 2020 at 6:02
5 Answers
Reset to default 2If you want to get the matches without capturing:
/(?<=<\/h\d+>\s*)\S.*?(?=\s*<h\d|$)/gs
See proof
const text = `<h2>What is lorem impsum</h2>
Stack overflow is a great munity for developers to seek help and connect the beautiful experience.
<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically.
But its a nice place to be at.
another paragraph betwen these headings
<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of
people with his kindness and his contents on the internet.`;
const regex = /(?<=<\/h\d+>\s*)\S.*?(?=\s*<h\d|$)/gs;
console.log(text.match(regex));
If you need more efficient regex, use capturing:
const text = `<h2>What is lorem impsum</h2>
Stack overflow is a great munity for developers to seek help and connect the beautiful experience.
<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically.
But its a nice place to be at.
another paragraph betwen these headings
<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of
people with his kindness and his contents on the internet.`;
const regex = /<\/h\d+>\s*([^<]*(?:<(?!h\d)[^<]*)*?)\s*(?:<h\d|$)/g;
console.log(Array.from(text.matchAll(regex), x => x[1].trim()));
The second regex explanation:
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
\/ '/'
--------------------------------------------------------------------------------
h 'h'
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
> '>'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
--------------------------------------------------------------------------------
< '<'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
h 'h'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
)*? end of grouping
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
<h '<h'
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
$ before an optional \n, and the end of
the string
--------------------------------------------------------------------------------
) end of grouping
REGEX : /(<.+\/h[.-d]>)/gm
this will select your all the header tags and the content between them. use boolean if it's true then discard it
if it's false select then you will get what you need.
If you want to stay away from regex for parsing HTML. You could utilize nextSibling. Note that there are different kinds of nodes. I grab here all the nodes including text nodes as I thought this is what you want. This can be tweaked to only look for elements nodes though.
const op = []
const [h1, h2] = document.querySelectorAll("h1,h2")
let next = h1.nextSibling
while (next && next !== h2) {
op.push(next.textContent)
next = next.nextSibling
}
console.log(op)
<h1>start</h1>
The quick brown fox jumps over the lazy dog
<p> some paragraph as well </p>
<div> something <strong> nested <code>works</code> too </strong> :) </div>
<h2>next</h2>
more content we are not interested in...
It will be less plex if you can select the headings themselves (Instead of trying to select text between the headings ) and remove them from the whole string keeping just the content between them. You can select only the headings with the expression:
(<h.*(?:>))/gm
You can find it in action here (Just the selection of headings with RegEx. The deleting part will have to be handled in the code)
There are some incredible answers here especially the dom one but if you need to pass a string then you might consider mine too.
Just need to pass the required string and it would return the required array
function GetContentBetweenHtags(HtmlString){
const Regex = /<\/h\w>(.*?)<h\w>/msg
const AfterTagRegex = /<\/h\w>([\s\w\.]*)$/
const EndMatch = HtmlString.match(AfterTagRegex)
let result, resultArr = []
while((result = Regex.exec(HtmlString)) != null){
resultArr.push(result[1].trim())
}
if(EndMatch.length !== 0){
resultArr.push(EndMatch[1].trim())
}
return resultArr
}