最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

php - Search and Replace Words in HTML - Stack Overflow

programmeradmin1浏览0评论

what I'm trying to do is make a 'jargon buster'. Basically I have some html and some glossary terms in a database. When the person clicks on jargon buster it replaces the words in the text with a nice tooltip (wztooltip) which shows them the meanings.

I've been trying hard on this one and been looking heavily at this question Regex / DOMDocument - match and replace text not in a link

and it seems like the answer lies in the simple_html_dom libs but I'm having trouble getting it to work. Obviously any words already linked don't get touched. Here is a strip down of what I've got.

$html = str_get_html($article['content']);

$query_glossary = "SELECT word,glossary_term_id,info FROM glossary_terms WHERE status = 1  ORDER BY LENGTH(word) DESC";
$result_glossary = mysql_query_run($query_glossary);

while($glossary = mysql_fetch_array($result_glossary)) {
    $glossary_link = SITEURL.'/glossary/term/'.string_to_url($glossary['word']).'-'.$glossary['glossary_term_id'];
    if(strlen($glossary['info'])>400) {
        $glossary_info = substr(strip_tags($glossary['info']),0,350).' ...<br /> <a href="'.$glossary_link.'">Read More</a>';
    }
    else {
        $glossary_info = $glossary['info'];
    }
    $glossary_tip = 'href="javascript:;" onmouseout="UnTip();" class="article_jargon_highligher" onmouseover="'.tooltip_javascript('<a href="'.$glossary_link.'">'.$glossary['word'].'</a>',$glossary_info,400,1,0,1).'"';
    $glossary_word = $glossary['word'];
    $glossary_word = preg_quote($glossary_word,'/');

    //once done we can replace the words with a nice tip    
    foreach ($html->find('text') as $element) {
        if (!in_array($element->parent()->tag,array())) {
            //problems are case aren't taken into account and grammer
            $element->innertext = str_ireplace(''.$glossary['word'].' ',' <a '.$glossary_tip.' >'.$glossary['word'].'</a> ', $element->innertext);

           //$element->innertext = str_ireplace(''.$glossary['word'].',',' <a '.$glossary_tip.'>'.$glossary['word'].'</a> ', $element->innertext);
           //$element->innertext = preg_replace ("/\s(".$glossary_word.")\s/ise","nothing(' <a'.'$glossary_tip.'>'.'$1'.'</a> ')" , $element->innertext);
          // $element->innertext = str_replace('__glossary_tip_replace__',$glossary_tip, $element->innertext);
        }
    }
}
$article['content'] = $html->save();

what I'm trying to do is make a 'jargon buster'. Basically I have some html and some glossary terms in a database. When the person clicks on jargon buster it replaces the words in the text with a nice tooltip (wztooltip) which shows them the meanings.

I've been trying hard on this one and been looking heavily at this question Regex / DOMDocument - match and replace text not in a link

and it seems like the answer lies in the simple_html_dom libs but I'm having trouble getting it to work. Obviously any words already linked don't get touched. Here is a strip down of what I've got.

$html = str_get_html($article['content']);

$query_glossary = "SELECT word,glossary_term_id,info FROM glossary_terms WHERE status = 1  ORDER BY LENGTH(word) DESC";
$result_glossary = mysql_query_run($query_glossary);

while($glossary = mysql_fetch_array($result_glossary)) {
    $glossary_link = SITEURL.'/glossary/term/'.string_to_url($glossary['word']).'-'.$glossary['glossary_term_id'];
    if(strlen($glossary['info'])>400) {
        $glossary_info = substr(strip_tags($glossary['info']),0,350).' ...<br /> <a href="'.$glossary_link.'">Read More</a>';
    }
    else {
        $glossary_info = $glossary['info'];
    }
    $glossary_tip = 'href="javascript:;" onmouseout="UnTip();" class="article_jargon_highligher" onmouseover="'.tooltip_javascript('<a href="'.$glossary_link.'">'.$glossary['word'].'</a>',$glossary_info,400,1,0,1).'"';
    $glossary_word = $glossary['word'];
    $glossary_word = preg_quote($glossary_word,'/');

    //once done we can replace the words with a nice tip    
    foreach ($html->find('text') as $element) {
        if (!in_array($element->parent()->tag,array())) {
            //problems are case aren't taken into account and grammer
            $element->innertext = str_ireplace(''.$glossary['word'].' ',' <a '.$glossary_tip.' >'.$glossary['word'].'</a> ', $element->innertext);

           //$element->innertext = str_ireplace(''.$glossary['word'].',',' <a '.$glossary_tip.'>'.$glossary['word'].'</a> ', $element->innertext);
           //$element->innertext = preg_replace ("/\s(".$glossary_word.")\s/ise","nothing(' <a'.'$glossary_tip.'>'.'$1'.'</a> ')" , $element->innertext);
          // $element->innertext = str_replace('__glossary_tip_replace__',$glossary_tip, $element->innertext);
        }
    }
}
$article['content'] = $html->save();
Share Improve this question edited May 23, 2017 at 11:48 CommunityBot 11 silver badge asked Jun 29, 2011 at 12:14 Richard HoushamRichard Housham 8642 gold badges17 silver badges34 bronze badges 4
  • I'm a colleague . The real problem is that we are having trouble getting the code to only match invidiaul words and not words inside words (ie: APS in perhaps). These words are within HTML as well. So that needs considering. – David Commented Jul 1, 2011 at 14:21
  • It's surely just a case of writing a powerful enough regex, probably using whitespace and punctuation to detect word boundaries, although I won't embarrass myself by trying. +1 – shanethehat Commented Jul 1, 2011 at 14:52
  • Do you want a JS solution or a PHP solution, because you used both tags? – Gerben Commented Jul 1, 2011 at 21:05
  • Hi, I wrote a Wikimedia extension a while back that does something similar. Depending on your approach, it's very easy to end up with an inefficient solution. It might help you to take a look: github.com/bcoughlan/Extension-Lingo/blob/master/Lingo.php – bcoughlan Commented Jul 8, 2011 at 2:23
Add a comment  | 

3 Answers 3

Reset to default 11 +50

Use the inverted word character \W to select for any characters other than numbers and letters in your regex pattern. Because this would still fail at the boundaries of the text blob, you would also need to test those conditions as well. Thus using the word 'term' as the text you are searching for:

(^term$)|(^term\W)|(\Wterm\W)|(\Wterm$)

The first condition checks to make sure that term isn't the only contents of the blob, the second checks if its the first word, the third if it contained within the blob, and the last if its the last word.

If you want to consider any other characters as word characters (say a hyphen) you would need to repace the \W with [^\w\-].

Hope this helps. There are probably optimizations that can performed as well, but this should at least be a good starting point.

Assuming all your glossary "words" consist of standard "word" characters, (i.e. [A-Za-z0-9_]), then a simple word boundary assertion can be placed before and after the word in the regex pattern. Try replacing the pertinant statement with this:

$element->innertext = preg_replace(
    '/\b'. $glossary_word .'\b/i',
    '<a '. $glossary_tip .' >'. $glossary['word'] .'</a>',
    $element->innertext);

This assumes that $glossary_word has been run trough preg_quote (which your code does).

However, if the glossary words may contain other non-standard word characters (such as a '-' dash), a more complex regex can be formulated which incorporates lookahead and lookbehind to ensure that only whole words are matched. For example:

$re_pattern = "/         # Match a glossary whole word.
    (?<=[\s'\"]|^)       # Word preceded by whitespace, quote or BOS.
    {$glossary_word}     # Word to be matched.
    (?=[\s'\".?!,;:]|$)  # Word followed by ws, quote, punct or EOS.
    /ix";

I had this problem in JS getting individual words. What I did was the following (you can translate it from JS to PHP):

It actually works REALLY well for me. :)

var words = document.body.innerHTML;

// FIRST PASS

// remove scripts
words = words.replace(/<script[\s\S]*?>[\s\S]*?<\/script>/gi, '');
// remove CSS
words = words.replace(/<style[\s\S]*?>[\s\S]*?<\/style>/gi, '');
// remove comments
words = words.replace(/<!--[\s\S]*?-->/g, '');
// remove html character entities
words = words.replace(/&.*?;/g, ' ');
// remove all HTML
words = words.replace(/<[\s\S]*?>/g, '');

// SECOND PASS

// remove all newlines
words = words.replace(/\n/g, ' ');
// replace multiple spaces with 1 space
words = words.replace(/\s{2,}/g, ' ');

// split each word
words = words.split(/[^a-z-']+/gi);
发布评论

评论列表(0)

  1. 暂无评论