javascript - Regex: How can I select all the contents between two headings?

I want to select the contents between any two headings.

I have already created this regex which doesn't really selects what I need. Currently, it selects the heading along with the paragraph but not the last heading.

Current Regex: /^<h.*?(?:>)(.*?)(?=<\h)/gms

Given String:

<h2>What is lorem impsum</h2>
Stack overflow is a great munity for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.

Expected Result:

[
    'Stack overflow is a great munity for developers to seek help and connect the beautiful 
    experience.',

    'Quoora is good but doesn't provide any benefits to the person who's helping others economically. 
    But it\'s a nice place to be at.
    another paragraph betwen these headings',

   'One of the best guy to learn react with. He also has helped a lot of 
    people with his kindness and his contents on the internet.'

]

I want to select the contents between any two headings.

I have already created this regex which doesn't really selects what I need. Currently, it selects the heading along with the paragraph but not the last heading.

Current Regex: /^<h.*?(?:>)(.*?)(?=<\h)/gms

Given String:

<h2>What is lorem impsum</h2>
Stack overflow is a great munity for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.

Expected Result:

[
    'Stack overflow is a great munity for developers to seek help and connect the beautiful 
    experience.',

    'Quoora is good but doesn't provide any benefits to the person who's helping others economically. 
    But it\'s a nice place to be at.
    another paragraph betwen these headings',

   'One of the best guy to learn react with. He also has helped a lot of 
    people with his kindness and his contents on the internet.'

]

Share Improve this question asked Oct 17, 2020 at 6:05 rakesh shrestha 1,4553 gold badges23 silver badges39 bronze badges

2 Avoid regex for HTML parsing – anubhava Commented Oct 17, 2020 at 6:26
2 last one in Expected Result is not between headings tag, – Abishek Kumar Commented Oct 17, 2020 at 6:36
@anubhava why is that? can you please elaborate? – rakesh shrestha Commented Oct 18, 2020 at 6:00
please check: stackoverflow./questions/590747/… – anubhava Commented Oct 18, 2020 at 6:02

Add a ment |

5 Answers 5

Sorted by: Reset to default 2

If you want to get the matches without capturing:

/(?<=<\/h\d+>\s*)\S.*?(?=\s*<h\d|$)/gs

See proof

const text = `<h2>What is lorem impsum</h2>
Stack overflow is a great munity for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.`;
const regex = /(?<=<\/h\d+>\s*)\S.*?(?=\s*<h\d|$)/gs;
console.log(text.match(regex));

If you need more efficient regex, use capturing:

const text = `<h2>What is lorem impsum</h2>
Stack overflow is a great munity for developers to seek help and connect the beautiful experience.

<h3>What is quora?</h3>
Quoora is good but doesn\'t provide any benefits to the person who\'s helping others economically. 
But its a nice place to be at.
another paragraph betwen these headings

<h3>Who is Kent C Dodds</h3>
One of the best guy to learn react with. He also has helped a lot of 
people with his kindness and his contents on the internet.`;
const regex = /<\/h\d+>\s*([^<]*(?:<(?!h\d)[^<]*)*?)\s*(?:<h\d|$)/g;
console.log(Array.from(text.matchAll(regex), x => x[1].trim()));

The second regex explanation:

--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  h                        'h'
--------------------------------------------------------------------------------
  \d+                      digits (0-9) (1 or more times (matching
                           the most amount possible))
--------------------------------------------------------------------------------
  >                        '>'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
--------------------------------------------------------------------------------
      <                        '<'
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        h                        'h'
--------------------------------------------------------------------------------
        \d                       digits (0-9)
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
      [^<]*                    any character except: '<' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )*?                      end of grouping
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    <h                       '<h'
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    $                        before an optional \n, and the end of
                             the string
--------------------------------------------------------------------------------
  )                        end of grouping

REGEX : /(<.+\/h[.-d]>)/gm

this will select your all the header tags and the content between them. use boolean if it's true then discard it

if it's false select then you will get what you need.

If you want to stay away from regex for parsing HTML. You could utilize nextSibling. Note that there are different kinds of nodes. I grab here all the nodes including text nodes as I thought this is what you want. This can be tweaked to only look for elements nodes though.

const op = []

const [h1, h2] = document.querySelectorAll("h1,h2")

let next = h1.nextSibling

while (next && next !== h2) {
  op.push(next.textContent)
  next = next.nextSibling
}

console.log(op)

<h1>start</h1>

The quick brown fox jumps over the lazy dog

<p> some paragraph as well </p>

<div> something <strong> nested <code>works</code> too </strong> :) </div>

<h2>next</h2>

more content we are not interested in...

It will be less plex if you can select the headings themselves (Instead of trying to select text between the headings ) and remove them from the whole string keeping just the content between them. You can select only the headings with the expression:

(<h.*(?:>))/gm

You can find it in action here (Just the selection of headings with RegEx. The deleting part will have to be handled in the code)

There are some incredible answers here especially the dom one but if you need to pass a string then you might consider mine too.

Just need to pass the required string and it would return the required array

function GetContentBetweenHtags(HtmlString){
  const Regex = /<\/h\w>(.*?)<h\w>/msg
  const AfterTagRegex = /<\/h\w>([\s\w\.]*)$/
  const EndMatch = HtmlString.match(AfterTagRegex)
  let result, resultArr = []
  while((result = Regex.exec(HtmlString)) != null){
    resultArr.push(result[1].trim())
  }
  if(EndMatch.length !== 0){
    resultArr.push(EndMatch[1].trim())
  }
  return resultArr
}

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Regex: How can I select all the contents between two headings? - Stack Overflow

5 Answers 5

与本文相关的文章

评论列表(0)