最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - regex for ip-address, domain and url - Stack Overflow

programmeradmin3浏览0评论

Problem statement:

I am trying to generate regex for ip-address, domain and url. These are my defitions:

IP Address:

93.114.205.169 

Domain:

example 
sub.example

Url:

93.114.205.169/path 
example/path
sub.example/path

So, an url always has a path to resource. But an IP-Address or domain should not have path to resource otherwise it would be an URL. Also note that these IP-address, domain and url can have http or https optionally with or without www.


My attempt:

I have tried various ways for these regex:

[[rules]]
id = "ip-address"
description = "Potential IP Address detected."
regex = '''\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'''
entropy = 2
keywords = ["ip"]


[[rules]]
id = "domain"
description = "Potential domain name detected."
regex = '''\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'''
entropy = 2
keywords = ["domain"]

[[rules]]
id = "url"
description = "Potential URL detected."
regex = '''\b(?:https?|ftp):\/\/(?:[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}|\d{1,3}(?:\.\d{1,3}){3})(?::\d{1,5})?(?:\/[^\s\"<>]*)?\b'''
entropy = 2
keywords = ["http", "https", "ftp", "url"]

But, these regex covering ip-address as url. For example, this ip-address http://93.114.205.169 is covering in url not as ip-address which should be only as ip-address but not as url according to my above definitions.

I changed to these regex:

[[rules]]
id = "ip-address"
description = "Potential IP Address detected."
regex = '''\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'''
entropy = 2
keywords = ["ip"]

[[rules]]
id = "domain"
description = "Potential domain name detected."
regex = '''\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'''
entropy = 2
keywords = ["domain"]

[[rules]]
id = "url"
description = "Potential URL detected."
regex = '''\b(?:https?|ftp):\/\/(?:[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}|\d{1,3}(?:\.\d{1,3}){3})(?::\d{1,5})?(?:\/[^\s\"<>]*)?\b'''
entropy = 2
keywords = ["http", "https", "ftp", "url"]

This also has same problem as above as http://93.114.205.169 recogize as url, but also it recognizing as ip-address too. It means it indentifies ip-address but it can also returning ip-addresses from urls like this http://93.114.205.169/path as url and 93.114.205.169 as ip-address.


Could you suggest me correct regex for my these definitions:

IP Address:

93.114.205.169 

Domain:

example 
sub.example

Url:

93.114.205.169/path 
example/path
sub.example/path

These IP-address, domain and url can have http or https optionally with or without www.

Problem statement:

I am trying to generate regex for ip-address, domain and url. These are my defitions:

IP Address:

93.114.205.169 

Domain:

example 
sub.example

Url:

93.114.205.169/path 
example/path
sub.example/path

So, an url always has a path to resource. But an IP-Address or domain should not have path to resource otherwise it would be an URL. Also note that these IP-address, domain and url can have http or https optionally with or without www.


My attempt:

I have tried various ways for these regex:

[[rules]]
id = "ip-address"
description = "Potential IP Address detected."
regex = '''\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'''
entropy = 2
keywords = ["ip"]


[[rules]]
id = "domain"
description = "Potential domain name detected."
regex = '''\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'''
entropy = 2
keywords = ["domain"]

[[rules]]
id = "url"
description = "Potential URL detected."
regex = '''\b(?:https?|ftp):\/\/(?:[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}|\d{1,3}(?:\.\d{1,3}){3})(?::\d{1,5})?(?:\/[^\s\"<>]*)?\b'''
entropy = 2
keywords = ["http", "https", "ftp", "url"]

But, these regex covering ip-address as url. For example, this ip-address http://93.114.205.169 is covering in url not as ip-address which should be only as ip-address but not as url according to my above definitions.

I changed to these regex:

[[rules]]
id = "ip-address"
description = "Potential IP Address detected."
regex = '''\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'''
entropy = 2
keywords = ["ip"]

[[rules]]
id = "domain"
description = "Potential domain name detected."
regex = '''\b[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'''
entropy = 2
keywords = ["domain"]

[[rules]]
id = "url"
description = "Potential URL detected."
regex = '''\b(?:https?|ftp):\/\/(?:[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}|\d{1,3}(?:\.\d{1,3}){3})(?::\d{1,5})?(?:\/[^\s\"<>]*)?\b'''
entropy = 2
keywords = ["http", "https", "ftp", "url"]

This also has same problem as above as http://93.114.205.169 recogize as url, but also it recognizing as ip-address too. It means it indentifies ip-address but it can also returning ip-addresses from urls like this http://93.114.205.169/path as url and 93.114.205.169 as ip-address.


Could you suggest me correct regex for my these definitions:

IP Address:

93.114.205.169 

Domain:

example 
sub.example

Url:

93.114.205.169/path 
example/path
sub.example/path

These IP-address, domain and url can have http or https optionally with or without www.

Share Improve this question asked Feb 16 at 15:45 hululuhululu 2242 gold badges13 silver badges26 bronze badges 2
  • 1 In case your input text is tokenized and "words" are space-separated, simply use (?<!\S) at the start and (?!\S) at the end of your patterns. – Wiktor Stribiżew Commented Feb 16 at 16:19
  • The \b means a boundary between a word and a non-word character. Numbers and the underscore are word characters as of course are the a-z (upper and lower) characters. Therefore the / character is a non-word character, thus the regex selecting an IP address cannot use the \b. Either use a negative lookahead and lookbehind to check that the characters around the IP address are NOT the / character, or a positive lookahead (and lookbehind) to look for a space or start/end of line. – Terry R Commented Feb 16 at 19:30
Add a comment  | 

3 Answers 3

Reset to default 0

Try this code:

import re

reg_ip = repile(r'\.\d+\/?$')
reg_url = repile(r'\b\/\w*$')
reg_domain = repile(r'^((ht|f)tps?:\/\/)?(?=[a-zA-Z].*)[\w\.\-]+$')

checking = [
    '93.114.205.169',
    'example',
    'sub.example',
    '93.114.205.169/path',
    'example/path',
    'sub.example/path',
]

print('            checking\t\tIP      URL     DOMAIN')
for c in checking:
    print(f'%20s\t\t{bool(reg_ip.search(c))}\t{bool(reg_url.search(c))}\t{bool(reg_domain.search(c))}' % c)

and it will show:

            checking        IP      URL     DOMAIN
      93.114.205.169        True    False   False
         example        False   False   True
     sub.example        False   False   True
 93.114.205.169/path        False   True    False
    example/path        False   True    False
sub.example/path        False   True    False

Explaination:

  • reg_ip = repile(r'\.\d+\/?$') means an IP-address must end with a number followed by a dot, or addtionally a / followed them. Like 192.168.0.1 or 192.168.0.1/.
  • reg_url = repile(r'\b\/\w*$') means an url must end with a single / or a single / followed by some resource path.
  • reg_domain = repile(r'^((ht|f)tps?:\/\/)?(?=[a-zA-Z].*)[\w\.\-]+$') means a domain must include a letter (case insensitive) after the protocol part and filled with letters, _, ., and -.

Here are three patterns that will allow you to capture valid IP addresses, valid Domains, and valid URLs anywhere in the text

See the demos below each pattern. Please let me know if there are any issues or any that are not correctly captured, I would be curious to know and resolve if I can. The code samples are Python, the regex flavor is Python and works with the re module.

IP ADDRESS:
The IP address pattern is x.x.x.x where x is a number between 0 and 255. The pattern below matches that IP address requirement:

ip_address_pattern = r"(?:(?<=^)|(?<=\s))((?:2[0-5][0-5]|1[0-9][0-9]|[1-9][0-9]|[0-9])(?:\.(?:2[0-5][0-5]|1[0-9][0-9]|[1-9][0-9]|[0-9])){3})(?=\s|$)"

IP address DEMO: https://regex101/r/hDYTV3/2

DOMAIN:
In the domain pattern are included the subdomain(s), domain and the top level domain (TLD):

domain_pattern = r"(?:(?<=^)|(?<=\s))(?:(?:ht|f)tps?://)?(?!\d+\.)((?:[^\W_]|-)[^\W_]*(?:-[^\W_]*)*(?:\.(?:[^\W_]|-)[^\W_]*(?:-[^\W_]*)*)*\.[^\W_][^\W_]*)(?:\s|$)"

DOMAIN DEMO: https://regex101/r/R4PZsf/10

URL:

url_pattern = r"(?:(?<=^)|(?<=\s))(?:(?:https?|ftp)://)?((?:[^\W_]|-)[^\W_]*(?:-[^\W_]*)*(?:\.(?:[^\W_]|-)[^\W_]*(?:-[^\W_]*)*)*\.[^\W_][^\W_]*)(?::\d+)?/(?:[\w~_.:/?#\[\]@!$&'*+,;=()-]*(?:(?:%[0-9a-fA-F][0-9a-fA-F])+[\w~_.:/?#\[\]@!$&'*+,;=()-]*)*)?(?=\s|$)"

# Accepted Characters in the path:
uri_chars = r"[\w~_.:/?#\[\]@!$&'*+,;=()-]"  
# Percentage must be followed by two hexadecimal characters
percent_encoding_pattern = r"%[0-9a-fA-F][0-9a-fA-F]"

URL DEMO: https://regex101/r/UhhGZU/4

Some Details Are Not Clear

The "code" that you posted is not Python but rather appears to be some sort of configuration file. Without understanding how the input is being processed with this configuration, it is difficult to give you a precise answer. An example will illustrate this:

It appears based on your English language description that a URL is essentially either an IP address or a domain specification followed by a path, which starts with a '/' character (I will assume that a such a forward can but need not be followed by alpha characters so that '200.12.119.1/' is a URL). Let's say that we have a regex for detecting IP addresses and we are able to match, for example, '250.127.100.2' or 'http://250.127.100.2' with this regex. But it would be erroneous to match an IP address within the string '250.127.100.2/somepath'.

We could create a single regex that was the "or-ing" of separate regular expressions for detecting a URL, a domain and an IP address such as:

ip_regex = r'some regex'
domain_regex = r'some regex'
url_regex = fr'(?:{ip_regex}|{domain_regex})/[a-zA-z]*'
rex = f'{url_regex}|{ip_regex}|{domain_regex}'

So rex is a final or-ing of 3 subexpressions with the match for a URL being the first alternate subexpression. If we were to use this regular expression using method re.finditer we could then iterate the return from this method and find all matches and we would only match for example an IP address if it were not part of a larger URL match since we are trying to match a URL first. But what you posted leaves it very open to question as this is even possible. Your actual Python code would need to take the individual regexes in your configuration file and join them together with a '|' between them.

The second and most likely alternative is that the input is being tested by individual regular expressions. So if we are just looking for say IP addresses, our regex for such a match now needs to use a negative lookahead to ensure that the candidate match is not followed by a '/' character.

Suggestions

First, if you really want to validate proper IP addresses, you would want to use something like:

(?x)
(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])

There are more concise expressions for validating an IP address, but the above regex is the most readable. It would accept '123.255.12.1' but not '333.255.12.1'. We might even want to reject '123.255.12.1' depending on what precedes it and follows it. For example, the string '123.255.12.1.99' contains a couple of valid IP addresses, i.e. '123.255.12.1' and '255.12.1.99', but I suspect we might not wish to accept either. In this case, we might add some negative assertions:

(?x)
(?<![0-9.])
(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}
(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])
(?![0-9.])

Now we are ensuring that out candidate match is not preceded or followed by a digit or decimal point.

The following program demonstrates the above points. The first 2 calls to re.finditer where we are matching IP addresses and domains use regexes that have negative lookahead assertions. These regexes are what you would use if the Python code that uses the configuration file needs the ability to just look for one specific type of entity. The final call to re.finditer uses the "or-ing" of 3 regexes the first two of which do not require the negative lookahead insertions because an IP or domain is only matched if we can't match the longer URL.

Needless to say, if you need to initialize a configuration file, then where I use f-strings to join together previously defined regex expressions, you would need to do this manually. I would suggest then that you print out the regexes and remove the extraneous whitespace I use with the (?x) flag.

import re

prefix = r'(?:https?://)'

basic_ip_regex = fr'''
    (?:{prefix}|(?<![0-9.]))  # preceded by http:// or not preceded by a digit or period
    (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){{3}}
    (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])
    (?![0-9.])  # Not followed by a digit or period
'''

ip_regex = fr'''(?x)
    {basic_ip_regex}
    (?!/)  # Not followed additionally by a /
'''

basic_domain_regex = fr'''
    (?:{prefix}|(?<!\.))  # optionaly preceded by http:// or not preceded by a period
    (?:www\.)?(?:[a-zA-Z]+\.)+[a-zA-Z]++  # Match as many alpha characters as possible
    (?!\.)  # not followed by a period
'''

domain_regex = fr'''(?x)
    {basic_domain_regex}
    (?!/)  # Not followed additionally by a /
'''

basic_url_regex = fr'''
    (?:
        {basic_ip_regex}
        |
        {basic_domain_regex}
    )
    /[a-zA-Z]*  # / by iteself is a path
'''

url_regex = f'(?x){basic_url_regex}'

# If we can use finditer:
rex = fr'''(?x)
    (?P<url>){basic_url_regex}
    |
    (?P<ip>){basic_ip_regex}
    |
    (?P<domain>){basic_domain_regex}
'''

text = """
  123.45.67.89 # IP address
  http://123.45.67.89 # IP address
  https://123.45.67.89 # IP address
  123.45.67.89.99  # Invalid
  323.45.67.89  # Invalid
  booboo  # domain
  www.booboo  # domain
  http://www.booboo  # domain
  https://www.booboo  # domain
  123.45.67.89/abc  # URL
  http://23.45.67.89/abc  # URL
  https://23.45.67.89/abc  # URL
  https://booboo/abc  # URL
  123.45.67.89/  # URL
"""

# Just look for IP addresses:
for m in re.finditer(ip_regex, text):
    print('IP', m[0])

print('\n************\n')

# Just look for domains
for m in re.finditer(domain_regex, text):
    print('domain', m[0])

print('\n************\n')

# Just look for URLs
for m in re.finditer(url_regex, text):
    print('URL', m[0])

print('\n************\n')

# Look for everything:
for m in re.finditer(rex, text):
    print(m.lastgroup, m[0])

Prints:

IP 123.45.67.89
IP http://123.45.67.89
IP https://123.45.67.89

************

domain booboo
domain www.booboo
domain http://www.booboo
domain https://www.booboo

************

URL 123.45.67.89/abc
URL http://23.45.67.89/abc
URL https://23.45.67.89/abc
URL https://booboo/abc
URL 123.45.67.89/

************

ip 123.45.67.89
ip http://123.45.67.89
ip https://123.45.67.89
domain booboo
domain www.booboo
domain http://www.booboo
domain https://www.booboo
url 123.45.67.89/abc
url http://23.45.67.89/abc
url https://23.45.67.89/abc
url https://booboo/abc
url 123.45.67.89/
发布评论

评论列表(0)

  1. 暂无评论