javascript - Use RegEx to parse a string with complicated delimiting

This is a RegEx question.

Thanks for any help and please be patient as RegEx is definitely not my strength !

Entirely as background...my reason for asking is that I want to use RegEx to parse strings similar to SVG path data segments. I’ve looked for previous answers that parse both the segments and their segment-attributes, but found nothing that does the latter properly.

Here are some example strings like the ones I need to parse:

M-11.11,-22
L.33-44  
ac55         66 
h77  
M88 .99  
Z

I need to have the strings parsed into arrays like this:

["M", -11.11, -22]
["L", .33, -44]
["ac", 55, 66]
["h", 77]
["M", 88, .99]
["Z"]

So far I found this code on this answer: Parsing SVG "path" elements with C# - are there libraries out there to do this? The post is C#, but the regex was useful in javascript:

var argsRX = /[\s,]|(?=-)/; 
var args = segment.split(argsRX);

Here's what I get:

 [ "M", -11.11, -22, <empty element>  ]
 [ "L.33", -44, <empty>, <empty> ]
 [ "ac55", <empty>, <empty>, <empty>, 66 <empty>  ]
 [ "h77", <empty>, <empty>  
 [ "M88", .99, <empty>, <empty> ]
 [ "Z", <empty> ]

Problems when using this regex:

An unwanted empty array element is being put at the end of each string's array.
If multiple spaces are delimiters, an unwanted empty array element is being created for each extra space.
If a number immediately follows the opening letters, that number is being attached to the letters, but should bee a separate array element.

Here are more plete definitions of ining strings:

Each string starts with 1 or more letters (mixed case).
Next are zero or more numbers.
The numbers might have minus signs (always preceeding).
The numbers might have a decimal point anywhere in the number (except the end).
Possible delimiters are: ma, space, spaces, the minus sign.
A Comma with space(s) in front or back is also a possible delimiter.
Even though minus signs are delimiters, they must also remain with their number.
A number might immediately follow the opening letters (no space) and that number should be separate.

Here is test code I've been using:

<!doctype html>
<html>
<head>
<link rel="stylesheet" type="text/css" media="all" href="css/reset.css" /> <!-- reset css -->
<script type="text/javascript" src=".min.js"></script>

<style>
    body{ background-color: ivory; }
</style>

<script>
    $(function(){


var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z" 

// separate pathData into segments
var segmentRX = /[a-z]+[^a-z]*/ig;
var segments = pathData.match(segmentRX);

for(var i=0;i<segments.length;i++){
    var segment=segments[i];
    //console.log(segment);

    var argsRX = /[\s,]|(?=-)/; 
    var args = segment.split(argsRX);
    for(var j=0;j<args.length;j++){
        var arg=args[j];
        console.log(arg.length+": "+arg);
    }

}

    }); // end $(function(){});
</script>

</head>

<body>
</body>
</html>

This is a RegEx question.

Thanks for any help and please be patient as RegEx is definitely not my strength !

Here are some example strings like the ones I need to parse:

M-11.11,-22
L.33-44  
ac55         66 
h77  
M88 .99  
Z

I need to have the strings parsed into arrays like this:

["M", -11.11, -22]
["L", .33, -44]
["ac", 55, 66]
["h", 77]
["M", 88, .99]
["Z"]

So far I found this code on this answer: Parsing SVG "path" elements with C# - are there libraries out there to do this? The post is C#, but the regex was useful in javascript:

var argsRX = /[\s,]|(?=-)/; 
var args = segment.split(argsRX);

Here's what I get:

 [ "M", -11.11, -22, <empty element>  ]
 [ "L.33", -44, <empty>, <empty> ]
 [ "ac55", <empty>, <empty>, <empty>, 66 <empty>  ]
 [ "h77", <empty>, <empty>  
 [ "M88", .99, <empty>, <empty> ]
 [ "Z", <empty> ]

Problems when using this regex:

An unwanted empty array element is being put at the end of each string's array.
If multiple spaces are delimiters, an unwanted empty array element is being created for each extra space.
If a number immediately follows the opening letters, that number is being attached to the letters, but should bee a separate array element.

Here are more plete definitions of ining strings:

Each string starts with 1 or more letters (mixed case).
Next are zero or more numbers.
The numbers might have minus signs (always preceeding).
The numbers might have a decimal point anywhere in the number (except the end).
Possible delimiters are: ma, space, spaces, the minus sign.
A Comma with space(s) in front or back is also a possible delimiter.
Even though minus signs are delimiters, they must also remain with their number.
A number might immediately follow the opening letters (no space) and that number should be separate.

Here is test code I've been using:

<!doctype html>
<html>
<head>
<link rel="stylesheet" type="text/css" media="all" href="css/reset.css" /> <!-- reset css -->
<script type="text/javascript" src="http://code.jquery./jquery.min.js"></script>

<style>
    body{ background-color: ivory; }
</style>

<script>
    $(function(){


var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z" 

// separate pathData into segments
var segmentRX = /[a-z]+[^a-z]*/ig;
var segments = pathData.match(segmentRX);

for(var i=0;i<segments.length;i++){
    var segment=segments[i];
    //console.log(segment);

    var argsRX = /[\s,]|(?=-)/; 
    var args = segment.split(argsRX);
    for(var j=0;j<args.length;j++){
        var arg=args[j];
        console.log(arg.length+": "+arg);
    }

}

    }); // end $(function(){});
</script>

</head>

<body>
</body>
</html>

Share Improve this question edited May 23, 2017 at 10:27 CommunityBot 11 silver badge asked Jun 10, 2013 at 6:08 markE 105k11 gold badges170 silver badges183 bronze badges

Is ["M", 88 .99] supposed to be ["M", 88, .99]? – Robert McKee Commented Jun 10, 2013 at 6:25
OOPS, a typo actually! I meant to type an array with 3 elements: "M", 88, and .99 -- sorry. – markE Commented Jun 10, 2013 at 6:26

Add a ment |

5 Answers 5

Sorted by: Reset to default 4

I had to perform very similar parsing of data for reporting live results at the nation's largest track meet. http://ksathletics./2013/statetf/liveresults.js Although there was a lot of both client and server-side code involved, the principles are the same. In fact, the kind of data was practically identical.

I suggest that you do not use one "jumbo" regular expression, but rather one expression which separates data pieces and another which breaks each data piece into its main identifier and the following values. This solves the problem of various delimiters by allowing the second-level regular expression to match the definition of data values rather than having to distinguish delimiters. (This also is more efficient than putting all of the logic into a single regular expression.)

This is a solution tested to work on the input you gave.

<script>
var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z" 

function parseData(pathData) {
    var pieces = pathData.match(/([a-z]+[-.,\d ]*)/gi), i;
    /* now parse each piece into its own array */
    for (i=0; i<pieces.length; i++)
        pieces[i] = pieces[i].match(/([a-z]+|-?[.\d]*\d)/gi);
    return pieces;
}

pathPieces = parseData(pathData);
document.write(pathPieces.join('<br />'));
console.log(pathPieces);
</script>

http://dropoff.us/private/1370846040-1-test-path-data.html

Update: The results are exactly equivalent to the specified output you want. One thought that came to mind, however, was whether you also want or need type conversion from strings to numbers. Do you need that as well? I'm just thinking of the next step beyond parsing the data.

^([a-z]+)(?:(-?\d*.?\d+)[^\d\n\r.-]*(-?\d*.?\d+)?)?

Explanation

^               # start of string
([a-z]+)        # any number of characters, match into group 1
(?:             # non-capturing group
  (-?\d*.?\d+)  #   first number (optional singn & decimal point, digits)
  [^\d\n\r.-]*  #   delimiting characters (anything but these)
  (-?\d*.?\d+)? #   second number
)?              # end non-capturing group, make optional

Use with "case insensitive" flag.

http://rubular./r/EyUNmoONJ7
https://regex101./r/gTczcD/1

function parsePathData(pathData)
{
    var tokenizer = /([a-z]+)|([+-]?(?:\d+\.?\d*|\.\d+))/gi,
        match,
        current,
        mands = [];

    tokenizer.lastIndex = 0;
    while (match = tokenizer.exec(pathData))
    {
        if (match[1])
        {
            if (current) mands.push(current);
            current = [ match[1] ];
        }
        else
        {
            if (!current) current = [];
            current.push(match[2]);
        }
    }
    if (current) mands.push(current);
    return mands;
}

var pathData = "M-11.11,-22 L.33-44  ac55    66 h77  M88 .99  Z";
var mands = parsePathData(pathData);
console.log(mands);

Output:

[ [ "M", "-11.11", "-22" ],
  [ "L", ".33", "-44" ],
  [ "ac", "55", "66" ],
  [ "h", "77" ],
  [ "M", "88", ".99" ],
  [ "Z" ] ]

Your "pattern" consists of one or more letters, followed by a decimal number, followed by another delimited by either a ma or whitespace.

Regex: /([a-z]+)(-?(?:\d*\.)?\d+)(?:[,\s]+|(?=-))(-?(?:\d*\.)?\d+)/i

You can try with this pattern:

/([a-z]+)(-?(?:\d*\.)?\d+)?(?:\s+|,|(-(?:\d*\.)?\d+))?(-?(?:\d*\.)?\d+)?/

(a bit long, but it seems to work)

Note that the last number can be in the capture group \3 or \4

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Use RegEx to parse a string with complicated delimiting - Stack Overflow

5 Answers 5

与本文相关的文章

评论列表(0)