最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Regex to match MediaWiki template and its parameters- Stack Overflow

programmeradmin3浏览0评论

I'm writing a simple Javascript to add a specific parameter to a specific template in article that is currently being edited.

Wikipedia Templates are structured in the following format:

 {{Template name|unnamed parameter|named parameter=some value|another parameter=[[target article|article name]]|parameter={{another template|another tamplate's parameter}}}}

One template can also be over more lines, for example:

{{Template 
|name=John
|surname=Smith
|pob=[[London|London, UK]]
}}

For further reference, please have a look at :Template

So firstly I'd like to match the entire template. I came over partial solution, that is:

document.editform.wpTextbox1.value.match(/\{\{template name((.|\n)*?)\}\}$/gmis)

However the problem is that it only matches text from the initial brackets till the closing brackets of the first nested template (first example).

In addition I'd like to fetch its parameters in an array form. So for the result, I'd like to get an array with parameters in specific order. Array( value of paramter pob, value of paramter name, value of parameter surname, value of parameter pod (in this case empty, because it was unset) )

I'd use that to clean the unstandardised formatting in some articles and add some new parameters.

Thank you!

I'm writing a simple Javascript to add a specific parameter to a specific template in article that is currently being edited.

Wikipedia Templates are structured in the following format:

 {{Template name|unnamed parameter|named parameter=some value|another parameter=[[target article|article name]]|parameter={{another template|another tamplate's parameter}}}}

One template can also be over more lines, for example:

{{Template 
|name=John
|surname=Smith
|pob=[[London|London, UK]]
}}

For further reference, please have a look at http://en.wikipedia/wiki/Help:Template

So firstly I'd like to match the entire template. I came over partial solution, that is:

document.editform.wpTextbox1.value.match(/\{\{template name((.|\n)*?)\}\}$/gmis)

However the problem is that it only matches text from the initial brackets till the closing brackets of the first nested template (first example).

In addition I'd like to fetch its parameters in an array form. So for the result, I'd like to get an array with parameters in specific order. Array( value of paramter pob, value of paramter name, value of parameter surname, value of parameter pod (in this case empty, because it was unset) )

I'd use that to clean the unstandardised formatting in some articles and add some new parameters.

Thank you!

Share Improve this question edited Jul 2, 2011 at 20:07 smihael asked Jun 30, 2011 at 8:50 smihaelsmihael 92313 silver badges29 bronze badges 6
  • 1 It appears that Wikipedia templates are not a regular language and, as such, regular expressions aren't really the correct tool to parse them with. You might be better off looking for a parser in another language and porting it to JavaScript code. – Andy E Commented Jun 30, 2011 at 9:07
  • Hope you don't mind that I've added a regex tag, so that those who are good at regular expressions in javascript will notice this question. Also, I think the title is a bit hard to understand: I suggest using something like "Regular expression to match MediaWiki template inclusion syntax" (since Wikipedia uses MediaWiki engine). – Anton Strogonoff Commented Jun 30, 2011 at 9:08
  • I'm pretty sure one could parse parameters with regex. I've also found another similar question (link), partially solved with regex. But it's not all I need. – smihael Commented Jun 30, 2011 at 12:56
  • True, most regex implementations can parse (or match) far more than regular languages, but it's often not a good idea because it results in a horrific regex which is inprehensible by most people and therefor a nightmare to maintain. – Bart Kiers Commented Jun 30, 2011 at 13:15
  • So what would you suggest? However, there is still a limitation to JavaScript, because Wikipedia's installation of MediaWiki doesn't support other userscript languages. – smihael Commented Jul 2, 2011 at 13:47
 |  Show 1 more ment

1 Answer 1

Reset to default 8

Write simple parser.

Solving this kind of problem by regexp is not right. It's the same as matching brackets - difficult to do with regexp. Regexps are not suitable for nested expressions in general.

Try something like that:

var parts = src.split(/(\{\{|\}\})/);
for (var i in parts) {
  if (parts[i] == '{{') // starting new (sub) template
  else if (parts[i] == '}}') // ending (sub) template
  else // content (or outside)
}

This is just pseudo code, as I'm in rush now, will update this code to be working...

UPDATE (9th August 2011)

var NO_TPL = 0, // outside any tpl - ignoring...
    IN_TPL = 1, // inside tpl
    IN_LIST = 3; // inside list of arguments

function parseWiki(src) {
  var tokens = src.split(/(\{\{|\}\}|\||=|\[\[|\]\])/),
      i = -1, end = tokens.length - 1,
      token, next, state = NO_TPL,
      work = [], workChain = [], stateChain = [];

  function trim(value) {
    return value.replace(/^\s*/, '').replace(/\s*$/, '');
  }

  // get next non empty token
  function getNext(next) {
    while (!next && i < end) next = trim(tokens[++i]);
    return next;
  }

  // go into tpl / list of arguments
  function goDown(newState, newWork, newWorkKey) {
    stateChain.push(state);
    workChain.push(work);

    if (newWorkKey) {
      work[newWorkKey] = newWork;
    } else {
      work.push(newWork);
    }

    work = newWork;
    state = newState;
  }

  // jump up from tpl / list of arguments
  function goUp() {
    work = workChain.pop();
    state = stateChain.pop();
  }

  // state machine
  while ((token = getNext())) {
    switch(state) {

      case IN_TPL:
        switch(token) {
          case '}}': goUp(); break;
          case '|': break;
          default:
            next = getNext();
            if (next != '=') throw "invalid";
            next = getNext();
            if (next == '[[') {
              goDown(IN_LIST, [], token);
            } else if (next == '{{') {
              goDown(IN_TPL, {id: getNext()}, token);
            } else {
              work[token] = next;
            }
        }
        break;

      case IN_LIST:
        switch(token) {
          case ']]': goUp(); break;
          case '|': break;
          default: work.push(token);
        }
        break;

      case NO_TPL:
        if (token == '{{') {
          next = getNext();
          goDown(IN_TPL, {id: next});
        }
        break;
    }
  }

  return work;
}

UNIT TESTS

describe('wikiTpl', function() {
  it('should do empty tpl', function() {
    expect(parseWiki('{{name}}'))
      .toEqual([{id: 'name'}]);
  });

  it('should ignore text outside from tpl', function() {
    expect(parseWiki(' abc {{name}} x y'))
    .toEqual([{id: 'name'}]);
  });

  it('should do simple param', function() {
    expect(parseWiki('{{tpl | p1= 2}}'))
      .toEqual([{id: 'tpl', p1: '2'}]);
  });

  it('should do list of arguments', function() {
    expect(parseWiki('{{name | a= [[1|two]]}}'))
      .toEqual([{id: 'name', a: ['1', 'two']}]);
  });

  it('should do param after list', function() {
    expect(parseWiki('{{name | a= [[1|two|3]] | p2= true}}'))
      .toEqual([{id: 'name', a: ['1', 'two', '3'], p2: 'true'}]);
  });

  it('should do more tpls', function() {
    expect(parseWiki('{{first | a= [[1|two|3]] }} odd test {{second | b= 2}}'))
      .toEqual([{id: 'first', a: ['1', 'two', '3']}, {id: 'second', b: '2'}]);
  });

  it('should allow nested tpl', function() {
    expect(parseWiki('{{name | a= {{nested | p1= 1}} }}'))
      .toEqual([{id: 'name', a: {id: 'nested', p1: '1'}}]);
  });
});

Note: I'm using Jasmine's syntax for these unit tests. You can easily run it using AngularJS which contains whole testing environment - check it out at http://angularjs

发布评论

评论列表(0)

  1. 暂无评论