I am trying to match bible verses that can be any of these formats:
1 John 4:5 - 6
2 john 4:5 - 4:6
3 john 4:5 - 3 John 4:6
John 4:5 - 6
john 4:5 - 4:6
John 4:5 - 1 John 4:6
1john4:6
john 4
john 4-5
1 john 4-5
-any spaces in the above examples should be ignored when matched -any of the above can appear anywhere in a string of text:
text this is text John 4:5 - 1 John 4:6 text text john 4-5 more text
this is what I have but barely works and doesnt match correctly in a long string of text:
\b[a-zA-Z]+(?:\s+\d+)?(?::\d+(?:–\d+)?(?:,\s*\d+(?:–\d+)?)*)?
I am trying to match bible verses that can be any of these formats:
1 John 4:5 - 6
2 john 4:5 - 4:6
3 john 4:5 - 3 John 4:6
John 4:5 - 6
john 4:5 - 4:6
John 4:5 - 1 John 4:6
1john4:6
john 4
john 4-5
1 john 4-5
-any spaces in the above examples should be ignored when matched -any of the above can appear anywhere in a string of text:
text this is text John 4:5 - 1 John 4:6 text text john 4-5 more text
this is what I have but barely works and doesnt match correctly in a long string of text:
\b[a-zA-Z]+(?:\s+\d+)?(?::\d+(?:–\d+)?(?:,\s*\d+(?:–\d+)?)*)?
Share
Improve this question
edited Mar 7, 2014 at 15:57
Liam
29.7k28 gold badges137 silver badges200 bronze badges
asked Mar 7, 2014 at 15:55
user3071933user3071933
2071 gold badge4 silver badges10 bronze badges
4
- 3 What does it match, being that it 'barely works'? What doesn't it match? What should it match and should it not match? – George Stocker Commented Mar 7, 2014 at 15:59
- 6 So a regular expression to match an irregular pattern? Good luck! – David Thomas Commented Mar 7, 2014 at 16:00
- Writing something that organises your data is likely the beststart , no point in letting your application code see the data until it is nice and tidy – Rob Sedgwick Commented Mar 7, 2014 at 16:02
- 2 I was thinking of meeting up with John 4:10-4:15 - what do you think? :-D – Code Jockey Commented Mar 7, 2014 at 16:07
4 Answers
Reset to default 9Let's break down your format.
First of all, the main thing I see is that "there can be a dash followed by stuff" so let's split this problem up into two parts: first deal with the start bit, then the optional dash and end bit.
Your first bit is focussed around the name, and there may be a number before it. After it there is a number, which may be followed by a colon then another number. So we have:
(\d*)\s*([a-z]+)\s*(\d+)(?::(\d+))?
Now for the bit after the dash. It's a number, which may be followed by the name and another number. The whole thing may then be followed by a colon and another number. And remember the whole thing is optional:
(\s*-\s*(\d+)(?:\s*([a-z]+)\s*(\d+))?(?::(\d+))?)?
Put the two together and wrap it in a literal with case-insensitivity and you get:
/(\d*)\s*([a-z]+)\s*(\d+)(?::(\d+))?(\s*-\s*(\d+)(?:\s*([a-z]+)\s*(\d+))?(?::(\d+))?)?/i
Which, depending on how devout you are, may be described by any variety of colourful language.
But since when were Regexes pretty?
Anyway, in your result match, you will have:
- Initial number
- Name
- Second number
- Number after the colon
- Number after the dash
- Second name
- Number after the name
- Final number after the second colon
Of course, any of these can be empty, except for 2 and 3.
This is as specific as one could get, utilizing stuff like an optional capital letter at the start so things like "jOhn" don't match.
(?:\d\s*)?[A-Z]?[a-z]+\s*\d+(?:[:-]\d+)?(?:\s*-\s*\d+)?(?::\d+|(?:\s*[A-Z]?[a-z]+\s*\d+:\d+))?
You can try this:
/(?:\d+ ?)?[a-z]+ ?\d+(?:(?::\d+)?(?: ?- ?(?:\d+ [a-z]+ )?\d+(?::\d+)?)?)?/i
FWIW I've found that RegexPal to be a huge help in these cases. Here's what I ended up with:
([\d ]*[a-zA-Z]+( \d*:\d*)?)(( - )| )?(((\d* )?[a-zA-Z]+ )?\d*([:-]+\d*)?)
Which breaks down as:
// zero of more digit(s) or a space
[\d ]*
// any number of upper/lowercase letters
[a-zA-Z]+
// a space followed by an optional any number of digits, a colon,
// and any number of digits again
( \d*:\d*)?)
// an optional hyphen with a space either side, or a space.
(( - )| )
Repeat for the other side of the optional hyphen except for this difference:
// one or more of either a colon or a hyphen
[:-]+