This is a RegEx question.
Thanks for any help and please be patient as RegEx is definitely not my strength !
Entirely as background...my reason for asking is that I want to use RegEx to parse strings similar to SVG path data segments. I’ve looked for previous answers that parse both the segments and their segment-attributes, but found nothing that does the latter properly.
Here are some example strings like the ones I need to parse:
M-11.11,-22
L.33-44
ac55 66
h77
M88 .99
Z
I need to have the strings parsed into arrays like this:
["M", -11.11, -22]
["L", .33, -44]
["ac", 55, 66]
["h", 77]
["M", 88, .99]
["Z"]
So far I found this code on this answer: Parsing SVG "path" elements with C# - are there libraries out there to do this? The post is C#, but the regex was useful in javascript:
var argsRX = /[\s,]|(?=-)/;
var args = segment.split(argsRX);
Here's what I get:
[ "M", -11.11, -22, <empty element> ]
[ "L.33", -44, <empty>, <empty> ]
[ "ac55", <empty>, <empty>, <empty>, 66 <empty> ]
[ "h77", <empty>, <empty>
[ "M88", .99, <empty>, <empty> ]
[ "Z", <empty> ]
Problems when using this regex:
- An unwanted empty array element is being put at the end of each string's array.
- If multiple spaces are delimiters, an unwanted empty array element is being created for each extra space.
- If a number immediately follows the opening letters, that number is being attached to the letters, but should bee a separate array element.
Here are more plete definitions of ining strings:
- Each string starts with 1 or more letters (mixed case).
- Next are zero or more numbers.
- The numbers might have minus signs (always preceeding).
- The numbers might have a decimal point anywhere in the number (except the end).
- Possible delimiters are: ma, space, spaces, the minus sign.
- A Comma with space(s) in front or back is also a possible delimiter.
- Even though minus signs are delimiters, they must also remain with their number.
- A number might immediately follow the opening letters (no space) and that number should be separate.
Here is test code I've been using:
<!doctype html>
<html>
<head>
<link rel="stylesheet" type="text/css" media="all" href="css/reset.css" /> <!-- reset css -->
<script type="text/javascript" src=".min.js"></script>
<style>
body{ background-color: ivory; }
</style>
<script>
$(function(){
var pathData = "M-11.11,-22 L.33-44 ac55 66 h77 M88 .99 Z"
// separate pathData into segments
var segmentRX = /[a-z]+[^a-z]*/ig;
var segments = pathData.match(segmentRX);
for(var i=0;i<segments.length;i++){
var segment=segments[i];
//console.log(segment);
var argsRX = /[\s,]|(?=-)/;
var args = segment.split(argsRX);
for(var j=0;j<args.length;j++){
var arg=args[j];
console.log(arg.length+": "+arg);
}
}
}); // end $(function(){});
</script>
</head>
<body>
</body>
</html>
This is a RegEx question.
Thanks for any help and please be patient as RegEx is definitely not my strength !
Entirely as background...my reason for asking is that I want to use RegEx to parse strings similar to SVG path data segments. I’ve looked for previous answers that parse both the segments and their segment-attributes, but found nothing that does the latter properly.
Here are some example strings like the ones I need to parse:
M-11.11,-22
L.33-44
ac55 66
h77
M88 .99
Z
I need to have the strings parsed into arrays like this:
["M", -11.11, -22]
["L", .33, -44]
["ac", 55, 66]
["h", 77]
["M", 88, .99]
["Z"]
So far I found this code on this answer: Parsing SVG "path" elements with C# - are there libraries out there to do this? The post is C#, but the regex was useful in javascript:
var argsRX = /[\s,]|(?=-)/;
var args = segment.split(argsRX);
Here's what I get:
[ "M", -11.11, -22, <empty element> ]
[ "L.33", -44, <empty>, <empty> ]
[ "ac55", <empty>, <empty>, <empty>, 66 <empty> ]
[ "h77", <empty>, <empty>
[ "M88", .99, <empty>, <empty> ]
[ "Z", <empty> ]
Problems when using this regex:
- An unwanted empty array element is being put at the end of each string's array.
- If multiple spaces are delimiters, an unwanted empty array element is being created for each extra space.
- If a number immediately follows the opening letters, that number is being attached to the letters, but should bee a separate array element.
Here are more plete definitions of ining strings:
- Each string starts with 1 or more letters (mixed case).
- Next are zero or more numbers.
- The numbers might have minus signs (always preceeding).
- The numbers might have a decimal point anywhere in the number (except the end).
- Possible delimiters are: ma, space, spaces, the minus sign.
- A Comma with space(s) in front or back is also a possible delimiter.
- Even though minus signs are delimiters, they must also remain with their number.
- A number might immediately follow the opening letters (no space) and that number should be separate.
Here is test code I've been using:
<!doctype html>
<html>
<head>
<link rel="stylesheet" type="text/css" media="all" href="css/reset.css" /> <!-- reset css -->
<script type="text/javascript" src="http://code.jquery./jquery.min.js"></script>
<style>
body{ background-color: ivory; }
</style>
<script>
$(function(){
var pathData = "M-11.11,-22 L.33-44 ac55 66 h77 M88 .99 Z"
// separate pathData into segments
var segmentRX = /[a-z]+[^a-z]*/ig;
var segments = pathData.match(segmentRX);
for(var i=0;i<segments.length;i++){
var segment=segments[i];
//console.log(segment);
var argsRX = /[\s,]|(?=-)/;
var args = segment.split(argsRX);
for(var j=0;j<args.length;j++){
var arg=args[j];
console.log(arg.length+": "+arg);
}
}
}); // end $(function(){});
</script>
</head>
<body>
</body>
</html>
Share
Improve this question
edited May 23, 2017 at 10:27
CommunityBot
11 silver badge
asked Jun 10, 2013 at 6:08
markEmarkE
105k11 gold badges170 silver badges183 bronze badges
2
-
Is
["M", 88 .99]
supposed to be["M", 88, .99]
? – Robert McKee Commented Jun 10, 2013 at 6:25 - OOPS, a typo actually! I meant to type an array with 3 elements: "M", 88, and .99 -- sorry. – markE Commented Jun 10, 2013 at 6:26
5 Answers
Reset to default 4I had to perform very similar parsing of data for reporting live results at the nation's largest track meet. http://ksathletics./2013/statetf/liveresults.js Although there was a lot of both client and server-side code involved, the principles are the same. In fact, the kind of data was practically identical.
I suggest that you do not use one "jumbo" regular expression, but rather one expression which separates data pieces and another which breaks each data piece into its main identifier and the following values. This solves the problem of various delimiters by allowing the second-level regular expression to match the definition of data values rather than having to distinguish delimiters. (This also is more efficient than putting all of the logic into a single regular expression.)
This is a solution tested to work on the input you gave.
<script>
var pathData = "M-11.11,-22 L.33-44 ac55 66 h77 M88 .99 Z"
function parseData(pathData) {
var pieces = pathData.match(/([a-z]+[-.,\d ]*)/gi), i;
/* now parse each piece into its own array */
for (i=0; i<pieces.length; i++)
pieces[i] = pieces[i].match(/([a-z]+|-?[.\d]*\d)/gi);
return pieces;
}
pathPieces = parseData(pathData);
document.write(pathPieces.join('<br />'));
console.log(pathPieces);
</script>
http://dropoff.us/private/1370846040-1-test-path-data.html
Update: The results are exactly equivalent to the specified output you want. One thought that came to mind, however, was whether you also want or need type conversion from strings to numbers. Do you need that as well? I'm just thinking of the next step beyond parsing the data.
^([a-z]+)(?:(-?\d*.?\d+)[^\d\n\r.-]*(-?\d*.?\d+)?)?
Explanation
^ # start of string ([a-z]+) # any number of characters, match into group 1 (?: # non-capturing group (-?\d*.?\d+) # first number (optional singn & decimal point, digits) [^\d\n\r.-]* # delimiting characters (anything but these) (-?\d*.?\d+)? # second number )? # end non-capturing group, make optional
Use with "case insensitive" flag.
- http://rubular./r/EyUNmoONJ7
- https://regex101./r/gTczcD/1
function parsePathData(pathData)
{
var tokenizer = /([a-z]+)|([+-]?(?:\d+\.?\d*|\.\d+))/gi,
match,
current,
mands = [];
tokenizer.lastIndex = 0;
while (match = tokenizer.exec(pathData))
{
if (match[1])
{
if (current) mands.push(current);
current = [ match[1] ];
}
else
{
if (!current) current = [];
current.push(match[2]);
}
}
if (current) mands.push(current);
return mands;
}
var pathData = "M-11.11,-22 L.33-44 ac55 66 h77 M88 .99 Z";
var mands = parsePathData(pathData);
console.log(mands);
Output:
[ [ "M", "-11.11", "-22" ],
[ "L", ".33", "-44" ],
[ "ac", "55", "66" ],
[ "h", "77" ],
[ "M", "88", ".99" ],
[ "Z" ] ]
Your "pattern" consists of one or more letters, followed by a decimal number, followed by another delimited by either a ma or whitespace.
Regex: /([a-z]+)(-?(?:\d*\.)?\d+)(?:[,\s]+|(?=-))(-?(?:\d*\.)?\d+)/i
You can try with this pattern:
/([a-z]+)(-?(?:\d*\.)?\d+)?(?:\s+|,|(-(?:\d*\.)?\d+))?(-?(?:\d*\.)?\d+)?/
(a bit long, but it seems to work)
Note that the last number can be in the capture group \3 or \4