最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

Javascript Regex: Match text NOT part of a HTML tag - Stack Overflow

programmeradmin0浏览0评论

I would really want to have a Regex that is executable in node.js (so no jQuery DOM Handling etc., because the tags can have a different nesting) that matches all the text that is NOT a HTML tag or part of it into seperate groups.

E.g. I'd like to match "5","ELT.","SPR"," ","plo","Unterricht"," ","&nbsp" and "plo" from that String:

<tr class='list even'>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">5</span>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">ELT.</span></b>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">SPR</span></b>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <strike><span style="color: #010101">pio</span></strike>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">Unterricht</span>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">pio</span></b>
    </td>
</tr>

I can assure that there will be no ">"'s within the tags.

The solution I found was (?<=^|>)[^><]+?(?=<|$), but that won't work in node.js (probably because the lookaheads? It says "Invalid group")

Any suggestions? (and yes, I really think that Regex is the right way to go because the html may be nested in other ways and the content always has the same order because it's a table)

I would really want to have a Regex that is executable in node.js (so no jQuery DOM Handling etc., because the tags can have a different nesting) that matches all the text that is NOT a HTML tag or part of it into seperate groups.

E.g. I'd like to match "5","ELT.","SPR"," ","plo","Unterricht"," ","&nbsp" and "plo" from that String:

<tr class='list even'>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">5</span>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">ELT.</span></b>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">SPR</span></b>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <strike><span style="color: #010101">pio</span></strike>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">Unterricht</span>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">pio</span></b>
    </td>
</tr>

I can assure that there will be no ">"'s within the tags.

The solution I found was (?<=^|>)[^><]+?(?=<|$), but that won't work in node.js (probably because the lookaheads? It says "Invalid group")

Any suggestions? (and yes, I really think that Regex is the right way to go because the html may be nested in other ways and the content always has the same order because it's a table)

Share Improve this question edited Sep 24, 2011 at 17:28 Bakudan 19.5k9 gold badges55 silver badges75 bronze badges asked Sep 24, 2011 at 17:02 iStefoiStefo 4281 gold badge4 silver badges9 bronze badges 10
  • 2 I love linking to this stackoverflow./questions/1732348/… – NimChimpsky Commented Sep 24, 2011 at 17:05
  • Is this what you are looking for? stackoverflow./questions/822452/… – Rusty Fausak Commented Sep 24, 2011 at 17:05
  • 1 You cannot use regular expressions to parse HTML (this is the point of the link @NimChimpsky gave you), because HTML is not a regular language. Any attempt to use regular expressions, solely, to parse HTML will fail. You have no choice but to actually parse the HTML. – T.J. Crowder Commented Sep 24, 2011 at 17:08
  • @rfausak: No, because the OP has said clearly they're not running in a browser. – T.J. Crowder Commented Sep 24, 2011 at 17:08
  • If you want to match something based on the context around and don't have lookarounds available, then ... no. – Howard Commented Sep 24, 2011 at 17:09
 |  Show 5 more ments

2 Answers 2

Reset to default 3

Try 'yourhtml'.replace(/(<[^>]*>)/g,' ')

'<tr class="list even"><td class="list" align="center" style="background-color: #FFFFFF" ><span style="color: #010101">5</span></td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">ELT.</span></b></td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">SPR</span></b></td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" align="center" style="background-color: #FFFFFF" ><strike><span style="color: #010101">pio</span></strike></td><td class="list" align="center" style="background-color: #FFFFFF" ><span style="color: #010101">Unterricht</span></td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">pio</span></b></td></tr>'.replace(/(<[^>]*>)/g,' ')

It will give a space delimited text that you want to match (which you can split on space).

Maybe you can split directly using the tags themselves:

html.split(/<.*?>/)

Afterwards you have to remove the empty strings from the result.

发布评论

评论列表(0)

  1. 暂无评论