javascript - Finding comments in HTML

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file.

I want to extract all ments from this html file.

I can point out two problems in doing this:

What is a ment in one language may not be a ment in another.
In Javascript, remainder of lines are mented out using the // marker. But URLs also contain // within them and I therefore may well eliminate parts of URLs if I just apply substituting // and then the remainder of the line, with nothing.

So this is not a trivial problem.

Is there anywhere some solution for this already available?

Has anybody already done this?

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file.

I want to extract all ments from this html file.

I can point out two problems in doing this:

What is a ment in one language may not be a ment in another.
In Javascript, remainder of lines are mented out using the // marker. But URLs also contain // within them and I therefore may well eliminate parts of URLs if I just apply substituting // and then the remainder of the line, with nothing.

So this is not a trivial problem.

Is there anywhere some solution for this already available?

Has anybody already done this?

Share Improve this question edited Apr 15, 2022 at 5:28 brian d foy 133k31 gold badges213 silver badges605 bronze badges asked Oct 19, 2012 at 10:25 john-jones 7,78019 gold badges55 silver badges88 bronze badges

3 You are right that this is not trivial. In order to reliably remove ments, you need to fully parse the file (PHP, HTML, and Javascript). I suggest working in PHP if possible; while I like Perl better, PHP's tools to work on itself are better than Perl tools to work on PHP. Here is something to get you started: stackoverflow./questions/503871/…. Then you just need to find HTML and javascript parsers in PHP to do likewise for those portions of the file. – dan1111 Commented Oct 19, 2012 at 10:41
Why would you have PHP in your HTML file? I you just have CSS, JavaScript and HTML, then google "HTML Minifier" for products which can remove ments, whitespace, and generally "slim down" your pages. – RB. Commented Oct 19, 2012 at 10:43
@RB, the html to parse may at some point, not even be mine. – john-jones Commented Oct 19, 2012 at 10:56
Your point #2 is precisely why I always use /// in my ments -- just a random point, but I have e across this problem before and it changed my menting habits forever ;) what is your reasons behind needing this ability? and by "extract", do you mean to keep ments or discard them? – Pebbl Commented Oct 19, 2012 at 11:02
1 Well I intend to discard them, but to not be bound with doing that with them would be a more modular solution. – john-jones Commented Oct 19, 2012 at 11:07

| Show 3 more ments

4 Answers 4

Sorted by: Reset to default 2

Problem 2: Isn't every url quoted, with either "www.url." or 'www.url.', when you write it in either language? I'm not sure. If that's the case then all you haft to do is to parse the code and check if there's any quote marks preceding the backslashes to know if it's a real url or just a ment.

Look into parser generators like ANTLR which has grammars for many languages and write a nesting parser to reliably find ments. Regular expressions aren't going to help you if accuracy is important. Even then, it won't be 100% accurate.

Consider

Problem 3, a ment in a language is not always a ment in a language.

<textarea><!-- not a ment --></textarea>
<script>var re = /[/*]not a ment[*/]/, str = "//not a ment";</script>

Problem 4, a ment embedded in a language may not obviously be a ment.

<button onclick="&#47;&#47; this is a ment//&#10;notAComment()">

Problem 5, what is a ment may depend on how the browser is configured.

<noscript><!-- </noscript> Whether this is a ment depends on whether JS is turned on -->
<!--[if IE 8]>This is a ment, except on IE 8<![endif]-->

I had to solve this problem partially for contextual templating systems that elide ments from source code to prevent leaking software implementation details.

https://github./mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests//google/autoesc/HTMLEscapingWriterTest.java#L1146 shows a testcase where a ment is identified in JavaScript, and later testcases show ments identified in CSS and HTML. You may be able to adapt that code to find ments. It will not handle ments in PHP code sections.

It seems from your word that you are pondering some approach based on regular expressions: it is a pain to do so on the whole file, try to use some tools to highlight or to discard interesting or uninteresting text and then work on what is left from your sieve according to the keep/discard criteria. Have a look at HTML::Tree and TreeBuilder, it could be very useful to deal with the HTML markup.

I would convert the HTML file into a character array and parse it. You can detect key strings like "<", "--" ,"www", "http", as you move forward and either skip or delete those segments.

The start/end indices will have to be identified properly, which is a challenge but you will have full power.

There are also other ways to simplify the process if performance is not a problem. For example, all tags can be grabbed with XML::Twig and the string can be parsed to detect JS ments.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Finding comments in HTML - Stack Overflow

4 Answers 4

与本文相关的文章

评论列表(0)