I have strings which consist of a name and two digits. I would like to extract the name and the digits into one variable for each. The problem I have is that some names have spaces in them. When I split on /\s+/ the name is split into two.
my (${st_name}, $val1, $val2) = split(/\s+/, $line, 3);
I have tried to split on /\d+/, I do not get the digits. I have tried to get the index of the first digit, not sure if it is really
my $index = index ($line, \d);
I will appreciated any assistance. Code tried
use strict;
use warnings;
while (my $line = <DATA>){
my (${st_name}, $val1, $val2) = split(/\s+/, $line, 3); #doesn't work
my $index = index ($line, \d);
${st_name}=$line(0, $index);
my ($val1, $val2) = $line($index)
__DATA__
Maputsoe 2 1
Butha-Buthe (Butha-Buthe District) 2 1
I have strings which consist of a name and two digits. I would like to extract the name and the digits into one variable for each. The problem I have is that some names have spaces in them. When I split on /\s+/ the name is split into two.
my (${st_name}, $val1, $val2) = split(/\s+/, $line, 3);
I have tried to split on /\d+/, I do not get the digits. I have tried to get the index of the first digit, not sure if it is really
my $index = index ($line, \d);
I will appreciated any assistance. Code tried
use strict;
use warnings;
while (my $line = <DATA>){
my (${st_name}, $val1, $val2) = split(/\s+/, $line, 3); #doesn't work
my $index = index ($line, \d);
${st_name}=$line(0, $index);
my ($val1, $val2) = $line($index)
__DATA__
Maputsoe 2 1
Butha-Buthe (Butha-Buthe District) 2 1
Share
Improve this question
edited 16 hours ago
Robert
8,60853 gold badges116 silver badges159 bronze badges
asked 16 hours ago
Zilore MumbaZilore Mumba
1,5124 gold badges26 silver badges35 bronze badges
2
|
3 Answers
Reset to default 1You can make a regular expression match and capture the pieces you want. Looks like you want some text, then a space, then a number, more space(s), and another number?
use strict;
use warnings;
while (my $line = <DATA>) {
my ($st_name, $val1, $val2) = $line =~ m/^(.+)\s+(\d+)\s+(\d+)/;
print "$st_name, $val1, $val2\n";
}
__DATA__
Maputsoe 2 1
Butha-Buthe (Butha-Buthe District) 2 1
This prints
Maputsoe, 2, 1
Butha-Buthe (Butha-Buthe District), 2, 1
The regular expression matches one or more (+
) characters (.
), followed by one or more spaces (\s
), followed by \d
numbers, and again spaces and numbers.
The expression /^(.*?)\s+(\d+)\s+(\d+)$/
should work.
Explanation:
^(.*?)
: This captures the name part. The.*?
is a non-greedy match that captures everything up to the first digit\s+
: Matches one or more whitespace(\d+)
: Captures the first group of digits\s+
: Matches one or more whitespace characters(\d+)$
: Captures the second sequence of digits at the end of the line
use strict;
use warnings;
while (my $line = <DATA>) {
if ($line =~ /^(.*?)\s+(\d+)\s+(\d+)$/) {
my $st_name = $1;
my $val1 = $2;
my $val2 = $3;
print "Name: $st_name, Val1: $val1, Val2: $val2\n";
} else {
warn "Line does not match the expected pattern: $line";
}
}
__DATA__
Maputsoe 2 1
Butha-Buthe (Butha-Buthe District) 2 1
Your code is filled with nonsense. For example:
while (my $line = <DATA>){
my (${st_name}, $val1, $val2) = split(/\s+/, $line, 3); #doesn't work
- You neved finished the while loop. How can it work?
- You don't have to use the curly braces for variables in regular code.
$st_name
is fine, and it doesn't do anything different than${st_name}
.
my $index = index ($line, \d);
You can't use \d
unquoted. That will turn into an error. index
takes a string as an argument, so you need to quote it. I.e. "\d"
. But you cannot use regexes with index
, only string literals, and \d
is a regex character class. So all in all, this is just a mess.
${st_name}=$line(0, $index);
my ($val1, $val2) = $line($index)
The idea that you could put parentheses on a variable to make it do something is quite strange, and certainly not a Perl idiom. That's just not how Perl works.
But the thing you are trying to do can be done with index
and substr
. With the exception that index
can't search for a regex. So you would have to use a pattern match and pos
instead. Then it would look something like:
my $line = "Butha-Buthe (Butha-Buthe District) 2 1";
if ($line =~ /\d/g) { # have to use /g
my $pos = pos $line;
$pos--; # back up one
my $start = substr $line, 0, $pos;
my $end = substr $line, $pos;
print "Pos: $pos, start: '$start', end: '$end'\n";
}
Although much simpler it could be done like this:
my $line = "Butha-Buthe (Butha-Buthe District) 2 1";
if ($line =~ /(.+)(\d.+)/) {
my $start = $1;
my $end = $2;
print "Start: '$start', end: '$end'\n";
}
Of course you could simplify that down to
my ($start, $end) = ($line =~ /(.+)(\d.+)/);
But my suspicion is that your data is actually tab-separated, because it kinda looks like it. I can prepare my code by changing your data to have tabs, but sadly Stackoverflow will not keep those tabs in the code due to formatting. Then it would look like this:
my $line = "Butha-Buthe (Butha-Buthe District) 2 1"; # <- tabs in here!!!
my @fields = split /\t/, $line;
print Dumper \@fields;
And print this:
$VAR1 = [
'Butha-Buthe (Butha-Buthe District)',
'2',
'1'
];
You can try using this split statement with your original code and see how that works.
\t+
and not whitespace\s+
. That would solve all your problems. – TLP Commented 14 hours ago\t
, not\t+
– ikegami Commented 4 hours ago