最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

php - Check if a string starts with a substring from an array of substrings and return that substring - Stack Overflow

programmeradmin6浏览0评论

I have an array with all German telephone area codes that is 5266 items long and looks like this:

$area_codes = array(
    '015019',
    '015020',
    '01511',
    '01512',
    '01514',
    '01515',
    '01516',
    '01517',
    '015180',
    '015181',
    '015182',
    '015183',
    '015184',
    '015185',
    'and so on'
);

I also have strings of telephone numbers that look like this:

015169999999

That is, the telephone number strings consist only of digits, without spaces, dashes, parentheses or other characters that are often used to visually structure telephone numbers.

I want to split the area code from the subscriber number and assign each to a separate variable. For the example number above, the expected result would be:

echo $area_code;
01516
echo $subscriber_number;
9999999

To identify the area code within the string, I am looping through the array of area codes:

foreach ($area_codes as $area_code) {
    if (str_starts_with($telephone_number, $area_code) == TRUE) {
        echo $area_code;
        $subscriber_number = str_replace($area_code, "", $telephone_number);
        echo $subscriber_number;
        break;
    }
}

How can I return the area code without looping?


Solutions like this require that I know where to split the string, but the area codes have different lengths.

I have an array with all German telephone area codes that is 5266 items long and looks like this:

$area_codes = array(
    '015019',
    '015020',
    '01511',
    '01512',
    '01514',
    '01515',
    '01516',
    '01517',
    '015180',
    '015181',
    '015182',
    '015183',
    '015184',
    '015185',
    'and so on'
);

I also have strings of telephone numbers that look like this:

015169999999

That is, the telephone number strings consist only of digits, without spaces, dashes, parentheses or other characters that are often used to visually structure telephone numbers.

I want to split the area code from the subscriber number and assign each to a separate variable. For the example number above, the expected result would be:

echo $area_code;
01516
echo $subscriber_number;
9999999

To identify the area code within the string, I am looping through the array of area codes:

foreach ($area_codes as $area_code) {
    if (str_starts_with($telephone_number, $area_code) == TRUE) {
        echo $area_code;
        $subscriber_number = str_replace($area_code, "", $telephone_number);
        echo $subscriber_number;
        break;
    }
}

How can I return the area code without looping?


Solutions like this require that I know where to split the string, but the area codes have different lengths.

Share Improve this question edited Apr 3 at 1:56 mickmackusa 48.3k13 gold badges94 silver badges161 bronze badges Recognized by PHP Collective asked Mar 22 at 10:43 BenBen 73114 bronze badges 4
  • 1 I think, if you know that area code can be minimum x and maximum y digits long, then you can loop in the range from x to y (both inclusive) and then select that much characters from the telephone number and search in the area_codes list. Without any loop I think it would be more difficult. Also, if the area_codes is properly sorted, then use can use binary search also to make it more efficient. – Vansh Patel Commented Mar 22 at 10:50
  • Usually, phone numbers should have a fixed rule, such as the area code length for numbers starting with 0150 being 6 digits. You should only need to read a few specific digits to know the length of the area code. – shingo Commented Mar 22 at 11:51
  • @Ben I'd still like to see your prepared input array so I can see if it can be distilled into a branched regex pattern within the character limit. Can you pastebin it into your question? Future readers might like such a solution. – mickmackusa Commented Mar 25 at 20:41
  • @mickmackusa I don't currently have the time to properly implement any of the solutions, but here is the list of all German telephone area codes (as of today), so you can work with it, if you want: pastebin/NmkeYu5S Thank you for your help here and with my other questions. I greatly appreciate it. – Ben Commented Mar 26 at 5:55
Add a comment  | 

5 Answers 5

Reset to default 3

You can still use the foreach, and then return an array with 2 values if there is a match, so you return early from the loop.

Note that you don't need the equals check here because str_starts_with returns a boolean.

if (str_starts_with($telephone_number, $code)) {

Example

$telephone_number = "015169999999";

foreach ($area_codes as $code) {
    if (str_starts_with($telephone_number, $code)) {
        $result = [$code, substr($telephone_number, strlen($code))];
        break;
    }
}

var_export($result ?? null);

Output

array (
  0 => '01516',
  1 => '9999999',
)

See a PHP demo.

I don't know if you can have multiple areacode matches for a single phone number, but if you want the longest to match first, you might sort your array beforehand starting with the longest string.

If you want a fast lookup, you need an associative array with the area codes as the keys. For example (assuming area codes have between 4 and 6 digits):

$area_codes = array(
    '015019',
    '015020',
    '01511',
    '01512',
    '01514',
    '01515',
    '01516',
    '01517',
    '015180',
    '015181',
    '015182',
    '015183',
    '015184',
    '015185',
);

// Build an associative array
$area_lookup = [];
foreach($area_codes as $code)
    $area_lookup[$code] = 1;

function splitNumber(array $area_lookup, string $telephone_number): array
{
    for($len=4;$len<=6;$len++)
    {
        $prefix = substr($telephone_number, 0, $len);
        if(isset($area_lookup[$prefix]))
            return ['area_code' => $prefix, 'subscriber_number' => substr($telephone_number, $len)];
    }
    throw new Exception('Invalid number');
}

$res1 = splitNumber($area_lookup, '015169999999');
$res2 = splitNumber($area_lookup, '015180999999');
var_dump($res1, $res2);

Output:

array(2) {
  ["area_code"]=>
  string(5) "01516"
  ["subscriber_number"]=>
  string(7) "9999999"
}
array(2) {
  ["area_code"]=>
  string(6) "015180"
  ["subscriber_number"]=>
  string(6) "999999"
}

(Demo)

Here is a regex based approach. We can form an alternation of all area codes, then build a regex. Note that more specific (i.e. longer) area codes come first in the alternation.

$area_codes = array(
    '015019',
    '015020',
    '015180',
    '015181',
    '015182',
    '015183',
    '015184',
    '015185',
    '01511',
    '01512',
    '01514',
    '01515',
    '01516',
    '01517'
);
$regex = "/^(".implode("|", $area_codes).")(\d+)$/";
$input = "015169999999";
preg_match($regex, $input, $matches);
echo "area code: " . $matches[1] . "\n";
echo "number: " . $matches[2];

This prints:

area code: 01516
number: 9999999

Edit:

As a workaround in case the resulting alternation might exceed the compiled PCRE limit, we can fallback to checking one prefix/postfix at a time:

$input = "015169999999";
foreach ($area_codes as $area_code) {
    $regex = "/^(" . $area_code . ")(\d+)$/";
    if (preg_match($regex, $input, $matches)) {
        echo "area code: " . $matches[1] . "\n";
        echo "number: " . $matches[2];
    }
}

My initial advice to use a regex pattern with preg_split() revealed itself as an inappropriate suggestion because:

  1. simply imploding the 5200 element array exceeded PCRE's character limit, and
  2. Germany's valid phone number list is "open" and therefore prone to expansion over time -- maintenance of even a working, compacted regex would be difficult to maintain in the future. (Here's a broken regex that I found.)

Instead, I endorse a script similar to Olivier's answer because key-based comparisons always outperform value-based comparisons in PHP.

From your existing array, you can populate a segmented lookup array which is structured to efficiently facilitate your task.

var_export(
    array_reduce(
        $area_codes,
        function($result, $ac) {
            $result[strlen($ac)][$ac] = '';
            return $result;
        },
        []
    )
);

The printed result can be simply copied to a config file in your application. Once this is established, manual maintainance will be intuitive and simple. This structure also allows you to store the tele-name or region affiliated with the area codes -- in case you want that later.

define('GERMAN_AREA_CODE_LOOKUP', [
    5 => [
        '01511' => 'T-Mobile',
        '01512' => 'T-Mobile',
        '01514' => 'T-Mobile',
        '01515' => 'T-Mobile',
        '01516' => 'T-Mobile',
        '01517' => 'T-Mobile',
    ],
    6 => [
        '015019' => 'T-Mobile',
        '015020' => 'T-Mobile',
        '015180' => 'T-Mobile',
        '015181' => 'T-Mobile',
        '015182' => 'T-Mobile',
        '015183' => 'T-Mobile',
        '015184' => 'T-Mobile',
        '015185' => 'T-Mobile',
    ]
]);

Now a small helper function can iterate as many times as needed to parse a phone number into its two halves:

function parseGermanPhoneNumber(string $phone): array
{
    foreach (GERMAN_AREA_CODE_LOOKUP as $length => $set) {
        if (
            sscanf($phone, "%{$length}s%s", $area_code, $subscriber_number)
            && isset($set[$area_code])
        ) {
            return compact('area_code', 'subscriber_number');
        }
    }
    throw new Exception("Phone number's area code not found in whitelist");
}

Tests: Demo

$tests = ['015169999999', '015180999999'];
var_export(
    array_map(
        parseGermanPhoneNumber(...),
        $tests
    )
);

A single regular expression can be used to efficiently validate the area code from a German phone number. This requires collapsing the valid values into condensed character classes and alternations so that the maximum character limit is not exceeded.

Voila -- a regex pattern that doesn't exceed PHP's regex character limit
(more than 5260 numeric strings reduced to a 5530-character pattern):

#^(0(3(0|3([15]|0([1-467]|5[1-6]|8[02-9]|9[34])|2([127-9]|0\d|3[0-57-9])|3([124578]|3[1-8]|6[1-9]|9[3-8])|4([1246]|3[2-9]|5[1246-8]|7[02-9])|6([1246]|[07][1-9]|3[1-8]|5[2-7])|7([1257-9]|0[1-48]|3[1-4]|4[1-8]|6[02-9])|8([1256]|3\d|4[13-9]|7[02-8])|9([145]|[27]\d|3[1-3]|6[2-9]|8[1-469]))|4([015]|2([135]|0[2-8]|[24][1-4]|6[1-3]|9[1-9])|3([1357]|2[124578]|4[1-8]|6[1-4]|8[1-6])|4([13578]|2[2-6]|4[13-6]|6[1-7]|9[1-8])|6([1246]|0[0-79]|3[235-9]|5[1-4689]|7[1-3]|9[12])|7([1356]|2[12]|4[1-356]|7[1-69]|8[1-35])|9([1346]|0[13-79]|2\d|5[3-6]|7[35-9]))|5([15]|0([14]|2[0-8]|3[23]|5[2-8])|2([1-3589]|[04]\d|6[3-8])|3([1357]|2[2-79]|4[1-3]|6[1-5]|8[3-9])|4([1246]|3[3-69]|5[1-6]|7[1-8])|6([1-4]|0\d|9[1-8])|7([13468]|2[2-8]|5[1-6]|7[1-5]|9[235-7])|8([13568]|2[0235-9]|4[1-4]|7[2-7]|9[1-5])|9([1246]|3\d|5[1-5]|7[13-5]))|6([15]|0([1356]|2\d|4[1-3]|7[124-7]|8[1-57])|2([1-489]|0\d|5[2-9])|3([124-6]|3[0-8]|7\d)|4([1347]|2[1-8]|5[0-489]|6[1-5]|8[1-4])|6([13]|0[1-8]|2[1-68]|4[02-9]|5[1-3]|9[1-5])|7([12579]|[08][1-5]|3\d|4[1-4]|6[1246])|8([1-356]|4\d|7[013-58])|9([135]|2\d|4[013-9]|6[1-9]))|7([15]|2([1-7]|0[02-46-9]|9[1-8])|3([1357]|[26]\d|4[1-46-9]|8[1-4])|4([145]|2[1-3]|3\d|6[2-578])|6([1-5]|0\d)|7([1-4]|5[24-7]))|8([15]|2(1|0[1-9]|2\d|3[1-4]|9[2-7])|3([1468]|[07]\d|2[0-8]|3[1-4]|5[1-6]|9[1-3])|4([1347]|2[2-9]|5\d|6[1246]|8[1-68])|6[0135-9]|7([1467]|[25]\d|3[1-35-8]|8[0-57-9]|9[1-467])|8([136]|2[1-8]|4[1-578]|5[0-689]|7[1-6]))|9([15]|0([12479]|[0358]\d|6[12])|2([1358]|0\d|2[1-6]|[49][1-8]|6[2-8])|3([1357]|2[0-57-9]|4[1-9]|6[1-6]|8[2-46-9]|9\d)|4([134679]|0\d|2[1-8]|5[1-9]|8[1-57-9])|6([1-9]|0[0-8])|7([136]|2[1-46-8]|4\d|5[1-4]|7[1-9])|8([147]|2\d|[36][1-3]|[58][1-9])|9([1468]|[29][1-9]|3[1-4]|5[1-79]|7[1-35-8])))|4(0|2(1|0[2-9]|2[1-4]|[346]\d|5[1-8]|7[1-7]|8[1-9]|9[2-8])|3(1|0[23578]|2[0-46-9]|3\d|4[02-46-9]|5[1-8]|6[1-7]|7[12]|8[1-5]|9[2-4])|4(1|[069][1-9]|2[1-356]|3[1-5]|4[1-7]|5[1-68]|7[1-57-9]|8\d)|5(1|0[1-689]|2[1-9]|3[1-79]|4[1-7]|5\d|6[1-4])|6(1|0[2-9]|2[1-7]|3\d|4[1-46]|51|6[1-8]|[78][1-4])|7(1|0[2-8]|2[1-5]|3[1-7]|[47]\d|5[1-8]|6[1-9]|9[1-6])|8(1|0[2-6]|[245][1-9]|3[02-9]|[68][1-5]|7[1-7]|9[23])|9(1|0[23]|[25]\d|3[1-689]|[46][1-8]|7[1-7])|1([0367][1-9]|[28]\d|4[0-4689]|5[1-689]|9[1-5]))|6(9|1(1|0[1-9]|2[02-46-9]|3[0-689]|4[24-7]|5[0-2457-9]|6[1-7]|7[1-5]|8[1-8]|9[02568])|2(1|[04][1-79]|2[0-46-9]|[36][1-9]|[59][1-8]|7[124-6]|8[1-7])|3(1|[09][1-8]|[23][1-9]|4\d|5[1-35-9]|6[1-4]|7[1-5]|8[1-7])|4(1|[02]\d|3[0-689]|4[0-79]|5[1-8]|6[124-8]|7[1-9]|8[2-6])|5(1|[058]\d|2[2-7]|3[1-6]|4[1-5]|6[1-9]|7[1-58]|9[1-79])|6(1|[25]\d|[36][013-9]|[49][1-8]|7[02-8]|8[1-4])|7(1|0[1346-9]|[25][1-8]|[34][1-7]|[67][1-6]|8[1-9])|8(1|0[2-69]|2[14-7]|[35][1-8]|4[1-489]|6[14-9]|7[1-6]|8[178]|9[3478])|0(0[2-478]|2[0-46-9]|3[1-69]|4[1-9]|5\d|6[1-368]|7[1348]|8[1-7]|9[2-6]))|8(9|1(1|0[24-6]|2[1-4]|3[13-9]|[49][1-6]|5[1-378]|6[15-8]|7[016-9])|2(1|0[2-8]|[29][1-6]|3[0-46-9]|4[135-9]|5[0-47-9]|6[1-35-9]|7[1-46]|8[1-5])|3(1|0[2-46]|[23][0-8]|[48]\d|6[1-9]|7[02-9]|9[2-5])|4(1|0[2-7]|2[1-467]|3[1-5]|4[1-6]|5[02-46-9]|6\d)|5(1|0[1-79]|3[1-8]|4[1-9]|5[0-8]|6[1-5]|7[1-4]|8[1-6]|9[1-3])|6(1|2[1-489]|3[013-9]|4[0-29]|5[0-2467]|6[1-79]|7[017-9]|8[1-7])|7(1|0[2-9]|2[1-8]|[348][1-5]|5[1-46]|6[124-6]|7[1-4])|8(1|0[1-35-9]|2[1-5]|4[15-7]|5[16-8]|6[0-27-9])|0(2\d|3[1-689]|4[1-356]|[56][1-7]|[78][1-6]|9[1-5]))|1(6[023]|7\d|5(1([124-7]|8[0-6])|2[0-3569]|7([357-9]|0[0-46])|90|0(19|20)|310|5(1[01]|6\d)|6(30|7[89])|888))|2(0([1-389]|4[135]|5[1-468]|6[4-6])|1([14]|2(9)|0[2-4]|3[1-37]|5[0-46-9]|6[1-6]|7[13-5]|8[1-3]|9[1-356])|2([18]|[023][2-8]|4[1-8]|[59][1-7]|6[1-9]|7[1-5])|3([14]|0[1-9]|2[3-57]|[36]\d|[578][1-57-9]|9[1-5])|4(1|[02][1-9]|[35][1-6]|4[013-9]|6[1-5]|7[1-4]|8[24-6])|5(1|0[124-9]|[29]\d|3[2-68]|4[1-35-8]|[568][1-8]|7[1-5])|6(1|[07][1-8]|2[0-8]|[38]\d|[459][1-7]|6[1-467])|7(1|2[1-5]|3[2-9]|4[1-57]|5[0-589]|6[1-4]|7\d)|8(1|[07][1-4]|2[1-8]|3[1-9]|4[1-5]|5[0-35-9]|6[1-7])|9(1|0[2-5]|[2-5][1-578]|[69][1-4]|7[1-57]|8[1-5]))|5(1(1|0[1-3589]|2[136-9]|3[0-25-9]|[45][1-9]|6[1-8]|[78][1-7]|9\d)|2(1|0[1-9]|2[1-68]|[37][1-8]|4[124-8]|5[0-57-9]|[68][1-6]|9[2-5])|3(1|[02]\d|3[1-79]|4[14-7]|[56][1-8]|7[1-9]|8[1-4])|4(1|0[1-79]|[235][1-9]|4[1-8]|6[124-8]|7[1-6]|[89][1-5])|5(1|0[2-9]|2[0-57-9]|[3-5][1-6]|6[1-5]|7[1-4]|8[2-6]|9[2-4])|6(1|0[1-9]|[2389][1-6]|4[1-8]|5\d|6[1-5]|7[1-7])|7(1|0[2-7]|[24][1-6]|3[1-4]|5[1-5]|6[13-9]|7[1-7])|8(1|0[2-8]|2\d|3[1-9]|4[0-689]|5[0-57-9]|6[1-5]|7[2-5]|8[23])|9(1|0[1-9]|[26][1-6]|3[1-79]|4[1-8]|5[1-7]|7[135-8])|0(2[1-8]|3[1-7]|4[1-5]|5[1-6]|6[02-9]|7[1-4]|8[2-6]))|7(1(1|2[1-9]|3[0-689]|4[1-8]|5[0-46-9]|[67][1-6]|8[1-4]|9[1-5])|2(1|0[2-4]|[256]\d|[37][1-7]|4[02-9])|3(1|0[02-9]|[28][1-9]|[36][1-7]|4[03-8]|5[1-8]|7[13-6]|9[1-5])|4(1|0[2-4]|2[02-9]|3[1-6]|4\d|5[1-9]|6[1-7]|7[1-8]|8[2-6])|5(1|0[2-6]|2[02457-9]|3[1-4]|4[1-6]|5[1-8]|6[1-9]|7\d|8[1-7])|6(1|02|[26]\d|[347][1-6]|5[1-7]|8[1-5])|7(1|0[2-9]|2\d|3[1-689]|4[1-8]|5[13-5]|6[1-5]|7[13-57])|8(1|0[2-8]|2[1-6]|3[1-9]|[45][1-4])|9(1|0[3-7]|[34]\d|5[0-57-9]|[67][1-7])|0([245][1-6]|3[1-4]|6[236]|7[1-3]|8[1-5]))|9(0(6|7[0-8]|8\d|9[0-479])|1(1|[06][1-7]|2[0236-9]|3[1-5]|4[1-9]|5[1-8]|[7-9]\d)|2(1|0[1-9]|2[0-357-9]|3[1-68]|4[1-6]|5[1-7]|[6-8]\d|9[2-5])|3(1|0[235-7]|2[13-6]|3[1-9]|[45]\d|6[03-79]|[79][1-8]|8[1-6])|4(1|[06][1-9]|2[0-46-9]|3[13-689]|4[1-8]|[57][1-4]|8[0-24]|9[1-357-9])|5(1|0[2-5]|2[1-9]|[357][1-6]|4[2-9]|6\d)|6(1|0[2-8]|2[124-8]|[35][1-9]|4[1-8]|6[1-6]|7[1-7]|8[1-3])|7(1|0[148]|2\d|3[2-8]|4[124-9]|6[1-6]|7[1-9])|8(1|0[2-5]|2[02-9]|[35][1-7]|4[1-8]|6[157-9]|7[1-6])|9(1|0[13-8]|2\d|3[1-35-8]|[47][1-8]|[56][1-6]))))\K#

If you are splitting (not validating), the pattern should end with \K to divide the two parts of the input string:

[$area_code, $subscriber_number] = preg_split($regex, $germanPhoneNumber);

If you are validating, the area code will be in capture group 1, and use \K\d+$ at the end of the regex so that the post-area-code digits are in the fullstring match.

preg_match($regex, $germanPhoneNumber, $match);
$area_code = $match[1];
$subscriber_number = $match[0];

With this regex pattern, you can implement server-side AND client-side validation/parsing with a single, consistent, and portable source of truth.


The hard part is the flawless production and maintenance of that sufficiently compact yet accurate regex. To invest in read-time performance, I recommend building tools to remove the human error and write-time tedium of the regex maintenance.

Every time your array is updated, generate an updated regex. How much time and energy you put into automating this in your project will depend on your project design, expected frequency of area code updates, and your willingness to manually manage the regex generation. ...I mean, it can be fully automated whenever the mod-time of the area codes file is changed.

After some obsessive-compulsive toil, here is the PHP code that I used to produce the aforementioned regex: Demo

function subsetToSubPattern(array $digits): string {
    $count = count($digits);
    if ($count === 10) {
        return '\d';
    }
    if ($count < 2) {
        return implode($digits);
    }

    $result = [];
    $last = null;
    foreach ($digits as $d) {
        if ($last !== $d - 1) {
            unset($ref);
            $result[] =& $ref;
        }
        $ref[] = $d;
        $last = $d;
    }

    return sprintf(
        '[%s]',
        implode(
            array_map(
                fn($set) => count($set) > 2 ? "{$set[0]}-" . end($set) : implode($set),
                $result
            )
        )
    );
}

function buildTrie(array $strings): array {
    $trie = [];
    foreach ($strings as $string) {
        $node =& $trie;
        for ($i = 0, $len = strlen($string); $i < $len; ++$i) {
            $char = $string[$i];
            $node[$char] ??= [];
            $node =& $node[$char];
        }
        $node['$'] = true;
    }
    return $trie;
}

function trieToRegex(array $trie): string {
    $isTerminal = isset($trie['$']);
    unset($trie['$']);

    $groups = [];
    foreach ($trie as $char => $subTrie) {
        $groups[trieToRegex($subTrie)][] = $char;
    }

    $alternatives = [];
    foreach ($groups as $subPattern => $chars) {
        sort($chars);
        $prefix = ctype_digit(implode($chars)) ? subsetToSubPattern($chars) : implode('|', $chars);
        $alternatives[] = "$prefix$subPattern";
    }

    $regex = count($alternatives) > 1 ? '(' . implode('|', $alternatives) . ')' : ($alternatives[0] ?? '');

    return $isTerminal && $regex !== '' ? "($regex)" : $regex;
}

sort($area_codes);
$regex = trieToRegex(buildTrie($area_codes));
$regex = ctype_digit($regex) ? $regex : "($regex)";
$regex = "#^$regex\K#";  // adjust the wrappings & modifiers however you require

echo "$regex\n";

Note that a "trie" is an ideal structure for distilling the shared leading characters for the numbers in the input array. The trie stores strings by their prefixes. Shared prefixes are represented by a single path in the tree, which saves space and allows for fast lookups. These generated paths are then distilled into compact regex subpatterns using alternations, parenthetical grouping, and character class (potentially using hyphenated numeric ranges).

The above regex generator should be suitable for any flat array of single-byte, digital strings.

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论