Javascript Regex: Match text NOT part of a HTML tag

I would really want to have a Regex that is executable in node.js (so no jQuery DOM Handling etc., because the tags can have a different nesting) that matches all the text that is NOT a HTML tag or part of it into seperate groups.

E.g. I'd like to match "5","ELT.","SPR"," ","plo","Unterricht"," ","&nbsp" and "plo" from that String:

<tr class='list even'>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">5</span>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">ELT.</span></b>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">SPR</span></b>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <strike><span style="color: #010101">pio</span></strike>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">Unterricht</span>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">pio</span></b>
    </td>
</tr>

I can assure that there will be no ">"'s within the tags.

The solution I found was (?<=^|>)[^><]+?(?=<|$), but that won't work in node.js (probably because the lookaheads? It says "Invalid group")

Any suggestions? (and yes, I really think that Regex is the right way to go because the html may be nested in other ways and the content always has the same order because it's a table)

E.g. I'd like to match "5","ELT.","SPR"," ","plo","Unterricht"," ","&nbsp" and "plo" from that String:

<tr class='list even'>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">5</span>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">ELT.</span></b>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">SPR</span></b>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <strike><span style="color: #010101">pio</span></strike>
    </td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <span style="color: #010101">Unterricht</span>
    </td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" style="background-color: #FFFFFF" >&nbsp;</td>
    <td class="list" align="center" style="background-color: #FFFFFF" >
        <b><span style="color: #010101">pio</span></b>
    </td>
</tr>

I can assure that there will be no ">"'s within the tags.

The solution I found was (?<=^|>)[^><]+?(?=<|$), but that won't work in node.js (probably because the lookaheads? It says "Invalid group")

Any suggestions? (and yes, I really think that Regex is the right way to go because the html may be nested in other ways and the content always has the same order because it's a table)

Share Improve this question edited Sep 24, 2011 at 17:28 Bakudan 19.5k9 gold badges55 silver badges75 bronze badges asked Sep 24, 2011 at 17:02 iStefo 4281 gold badge4 silver badges9 bronze badges

2 I love linking to this stackoverflow./questions/1732348/… – NimChimpsky Commented Sep 24, 2011 at 17:05
Is this what you are looking for? stackoverflow./questions/822452/… – Rusty Fausak Commented Sep 24, 2011 at 17:05
1 You cannot use regular expressions to parse HTML (this is the point of the link @NimChimpsky gave you), because HTML is not a regular language. Any attempt to use regular expressions, solely, to parse HTML will fail. You have no choice but to actually parse the HTML. – T.J. Crowder Commented Sep 24, 2011 at 17:08
@rfausak: No, because the OP has said clearly they're not running in a browser. – T.J. Crowder Commented Sep 24, 2011 at 17:08
If you want to match something based on the context around and don't have lookarounds available, then ... no. – Howard Commented Sep 24, 2011 at 17:09

| Show 5 more ments

2 Answers 2

Sorted by: Reset to default 3

Try 'yourhtml'.replace(/(<[^>]*>)/g,' ')

'<tr class="list even"><td class="list" align="center" style="background-color: #FFFFFF" ><span style="color: #010101">5</span></td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">ELT.</span></b></td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">SPR</span></b></td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" align="center" style="background-color: #FFFFFF" ><strike><span style="color: #010101">pio</span></strike></td><td class="list" align="center" style="background-color: #FFFFFF" ><span style="color: #010101">Unterricht</span></td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">pio</span></b></td></tr>'.replace(/(<[^>]*>)/g,' ')

It will give a space delimited text that you want to match (which you can split on space).

Maybe you can split directly using the tags themselves:

html.split(/<.*?>/)

Afterwards you have to remove the empty strings from the result.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

Javascript Regex: Match text NOT part of a HTML tag - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)