I would really want to have a Regex that is executable in node.js (so no jQuery DOM Handling etc., because the tags can have a different nesting) that matches all the text that is NOT a HTML tag or part of it into seperate groups.
E.g. I'd like to match "5","ELT.","SPR"," ","plo","Unterricht"," "," " and "plo" from that String:
<tr class='list even'>
<td class="list" align="center" style="background-color: #FFFFFF" >
<span style="color: #010101">5</span>
</td>
<td class="list" align="center" style="background-color: #FFFFFF" >
<b><span style="color: #010101">ELT.</span></b>
</td>
<td class="list" align="center" style="background-color: #FFFFFF" >
<b><span style="color: #010101">SPR</span></b>
</td>
<td class="list" style="background-color: #FFFFFF" > </td>
<td class="list" align="center" style="background-color: #FFFFFF" >
<strike><span style="color: #010101">pio</span></strike>
</td>
<td class="list" align="center" style="background-color: #FFFFFF" >
<span style="color: #010101">Unterricht</span>
</td>
<td class="list" style="background-color: #FFFFFF" > </td>
<td class="list" style="background-color: #FFFFFF" > </td>
<td class="list" align="center" style="background-color: #FFFFFF" >
<b><span style="color: #010101">pio</span></b>
</td>
</tr>
I can assure that there will be no ">"'s within the tags.
The solution I found was (?<=^|>)[^><]+?(?=<|$)
, but that won't work in node.js (probably because the lookaheads? It says "Invalid group")
Any suggestions? (and yes, I really think that Regex is the right way to go because the html may be nested in other ways and the content always has the same order because it's a table)
I would really want to have a Regex that is executable in node.js (so no jQuery DOM Handling etc., because the tags can have a different nesting) that matches all the text that is NOT a HTML tag or part of it into seperate groups.
E.g. I'd like to match "5","ELT.","SPR"," ","plo","Unterricht"," "," " and "plo" from that String:
<tr class='list even'>
<td class="list" align="center" style="background-color: #FFFFFF" >
<span style="color: #010101">5</span>
</td>
<td class="list" align="center" style="background-color: #FFFFFF" >
<b><span style="color: #010101">ELT.</span></b>
</td>
<td class="list" align="center" style="background-color: #FFFFFF" >
<b><span style="color: #010101">SPR</span></b>
</td>
<td class="list" style="background-color: #FFFFFF" > </td>
<td class="list" align="center" style="background-color: #FFFFFF" >
<strike><span style="color: #010101">pio</span></strike>
</td>
<td class="list" align="center" style="background-color: #FFFFFF" >
<span style="color: #010101">Unterricht</span>
</td>
<td class="list" style="background-color: #FFFFFF" > </td>
<td class="list" style="background-color: #FFFFFF" > </td>
<td class="list" align="center" style="background-color: #FFFFFF" >
<b><span style="color: #010101">pio</span></b>
</td>
</tr>
I can assure that there will be no ">"'s within the tags.
The solution I found was (?<=^|>)[^><]+?(?=<|$)
, but that won't work in node.js (probably because the lookaheads? It says "Invalid group")
Any suggestions? (and yes, I really think that Regex is the right way to go because the html may be nested in other ways and the content always has the same order because it's a table)
Share Improve this question edited Sep 24, 2011 at 17:28 Bakudan 19.5k9 gold badges55 silver badges75 bronze badges asked Sep 24, 2011 at 17:02 iStefoiStefo 4281 gold badge4 silver badges9 bronze badges 10- 2 I love linking to this stackoverflow./questions/1732348/… – NimChimpsky Commented Sep 24, 2011 at 17:05
- Is this what you are looking for? stackoverflow./questions/822452/… – Rusty Fausak Commented Sep 24, 2011 at 17:05
- 1 You cannot use regular expressions to parse HTML (this is the point of the link @NimChimpsky gave you), because HTML is not a regular language. Any attempt to use regular expressions, solely, to parse HTML will fail. You have no choice but to actually parse the HTML. – T.J. Crowder Commented Sep 24, 2011 at 17:08
- @rfausak: No, because the OP has said clearly they're not running in a browser. – T.J. Crowder Commented Sep 24, 2011 at 17:08
- If you want to match something based on the context around and don't have lookarounds available, then ... no. – Howard Commented Sep 24, 2011 at 17:09
2 Answers
Reset to default 3Try 'yourhtml'.replace(/(<[^>]*>)/g,' ')
'<tr class="list even"><td class="list" align="center" style="background-color: #FFFFFF" ><span style="color: #010101">5</span></td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">ELT.</span></b></td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">SPR</span></b></td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" align="center" style="background-color: #FFFFFF" ><strike><span style="color: #010101">pio</span></strike></td><td class="list" align="center" style="background-color: #FFFFFF" ><span style="color: #010101">Unterricht</span></td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" style="background-color: #FFFFFF" > </td><td class="list" align="center" style="background-color: #FFFFFF" ><b><span style="color: #010101">pio</span></b></td></tr>'.replace(/(<[^>]*>)/g,' ')
It will give a space delimited text that you want to match (which you can split on space).
Maybe you can split directly using the tags themselves:
html.split(/<.*?>/)
Afterwards you have to remove the empty strings from the result.