最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

regex - Split string into words in javascript - Stack Overflow

programmeradmin0浏览0评论

At the moment i am working on text that is broken into floating columns to display it in a magazine-like way.

I asked in a previous question how to split the text into sentences and it works like a charm:

sentences = text.replace(/\.\s+/g,'.|').replace(/\?\s/g,'?|').replace(/\!\s/g,'!|').split("|");

Now i want to go a step further and split it into words. But i do also have some elements in it, that should not be splitted. Like subheadlines.

An example text would be:

A wonderful serenity has taken possession of my entire soul. <strong>This is a subheadline</strong><br><br>I am alone, and feel the charm of existence in this spot.

My desired result would look like the following:

Array [
    "A",
    "wonderful",
    "serenity",
    "has",
    "taken",
    "possession",
    "of",
    "my",
    "entire",
    "soul.",
    "<strong>This is a subheadline</strong>",
    "<br>",
    "<br>",
    "I",
    "am",
    "alone,",
    "and",
    "feel",
    "the",
    "charm",
    "of",
    "existence",
    "in",
    "this",
    "spot."
]

When i split at all whitespaces i do get the words, but the "<br>" won't be added as a new array entry. I also don't want to split the subheadline and markup.

The reason why i want to do this, is that i add sequence after sequence to a p-tag and when the height gets bigger than the surrounding element i remove the last added sequence and create a new floating p-tag. When i splitted it into sentences i saw, that the breakup was not good enough to ensure a good reading flow.

An example what i try to achieve can you see here

If you need any further information i will be glad to give it to you.

Thanks in advance,

Tobias

EDIT

The string could contain more html tags in the future. Is there a way to not touch anything between these tags?

EDIT 2

I created a jsfiddle: /

EDIT 3

Would it be a good idea to remove all html tags with encapsulated text and replace it with placeholders? Then split the string into words and add the untouched html-tags when the placeholder is reached? What would be the regex to extract all html tags?

At the moment i am working on text that is broken into floating columns to display it in a magazine-like way.

I asked in a previous question how to split the text into sentences and it works like a charm:

sentences = text.replace(/\.\s+/g,'.|').replace(/\?\s/g,'?|').replace(/\!\s/g,'!|').split("|");

Now i want to go a step further and split it into words. But i do also have some elements in it, that should not be splitted. Like subheadlines.

An example text would be:

A wonderful serenity has taken possession of my entire soul. <strong>This is a subheadline</strong><br><br>I am alone, and feel the charm of existence in this spot.

My desired result would look like the following:

Array [
    "A",
    "wonderful",
    "serenity",
    "has",
    "taken",
    "possession",
    "of",
    "my",
    "entire",
    "soul.",
    "<strong>This is a subheadline</strong>",
    "<br>",
    "<br>",
    "I",
    "am",
    "alone,",
    "and",
    "feel",
    "the",
    "charm",
    "of",
    "existence",
    "in",
    "this",
    "spot."
]

When i split at all whitespaces i do get the words, but the "<br>" won't be added as a new array entry. I also don't want to split the subheadline and markup.

The reason why i want to do this, is that i add sequence after sequence to a p-tag and when the height gets bigger than the surrounding element i remove the last added sequence and create a new floating p-tag. When i splitted it into sentences i saw, that the breakup was not good enough to ensure a good reading flow.

An example what i try to achieve can you see here

If you need any further information i will be glad to give it to you.

Thanks in advance,

Tobias

EDIT

The string could contain more html tags in the future. Is there a way to not touch anything between these tags?

EDIT 2

I created a jsfiddle: http://jsfiddle/m9r9q/1/

EDIT 3

Would it be a good idea to remove all html tags with encapsulated text and replace it with placeholders? Then split the string into words and add the untouched html-tags when the placeholder is reached? What would be the regex to extract all html tags?

Share Improve this question edited May 23, 2017 at 12:34 CommunityBot 11 silver badge asked Sep 20, 2013 at 23:20 Tobias GolbsTobias Golbs 4,6164 gold badges30 silver badges50 bronze badges 8
  • Can you put together a jsfiddle of the situation? – Jake Commented Sep 20, 2013 at 23:25
  • @Jake: Did you saw my example? And if not does that help you to understand what i want to achieve? But nevertheless i will create a jsfiddle :) – Tobias Golbs Commented Sep 20, 2013 at 23:27
  • 1 Did see the example, it's just that we can't modify that code :) – Jake Commented Sep 20, 2013 at 23:27
  • I might be missing something here but why not use CSS, caniuse./#search=column admittedly IE is the main non-conforming browser. – user2417483 Commented Sep 20, 2013 at 23:30
  • @Jeff: Please consider for this example css columns is not an option. The application needs to be as backwards patible as possible! – Tobias Golbs Commented Sep 20, 2013 at 23:34
 |  Show 3 more ments

2 Answers 2

Reset to default 3

Although i want to try to extract the html parts and add them afterwards untouched

Forget about it and about my previous post. I just got an idea that it's much better to use built in browser engine to operate on html code.

You can just use this:

var text = 'A wonderful serenity has taken possession of my entire soul. <strong>This is a subheadline</strong><br><br>I am alone, and feel the charm of existence in this spot.';    

var elem = document.createElement('div');
elem.innerHTML = text;

var array = [];

for(var i = 0, childs = elem.childNodes; i < childs.length; i ++) {
  if (childs[i].nodeType === 3 /* document.TEXT_NODE */) {
    array = array.concat(childs[i].nodeValue.trim().split(/\s+/));
  } else {
    array.push(childs[i].outerHTML);
  }
}

It DOES support nested tags this time, also it supports all possible syntax without hard-coded exceptions for non closable tags :)

As I stated before in ment - you shouldn't do this. But if you insist - here's a possible answer:

var text = 'A wonderful serenity has taken possession of my entire soul. <strong>This is a subheadline</strong><br><br>I am alone, and feel the charm of existence in this spot.';

var array = [],
  tagOpened = false,
  stringBuilder = [];

text.replace(/(<([^\s>]*)[^>]*>|\b[^\s<]*)\s*/g, function(all, word, tag) {
  if (tag) {
    var closing = tag[0] == '/';
    if (closing) {
      stringBuilder.push(all);
      word = stringBuilder.join('');
      stringBuilder = [];
      tagOpened = false;
    } else {
      tagOpened = tag.toLowerCase() != 'br';
    }
  }
  if (tagOpened) {
    stringBuilder.push(all);
  } else {
    array.push(word);
  }
  return '';
});

if (stringBuilder.length) array.push(stringBuilder.join(''));

It doesn't support nested tags. You can add this functionality by implementing a stack for your opened tags

发布评论

评论列表(0)

  1. 暂无评论