Saturday, 21 November 2015

Python Weekly #6 : Regular Expressions - a starter for 10

Regular expressions

or "How I stopped worrying and learned to love regular expressions"
(with apologies to Stanley Kubrick)

In a previous post (Batteries Included - 5 modules every python developer should know) I mentioned that everyone should learn about regular expressions and the re package in the Standard Library. I will be completely honest at this point and say that when I wrote that post, I really did not understand the package myself, but I did understand how important it is, and that I really should learn it one day. That day is now - I have started to learn regular expressions, and some of my early experiments are documented in this article.

In general terms, a regular expression (or regex) is a way of executing a very flexible search (and potentially replace) type operation against a piece of text, without having to worry about parsing each character at a time. Python has the re package which implements a powerful set of regex syntax, and I will be using this in this article.

Useful sites

A few useful sites which would be worth bookmarking :
This article will work through building a regex to extract telephone numbers from some arbitary text. Throughout this article, I am going to use the findall method to find matching sub strings. Depending on your uses, it could be more advantageous to use some of the other methods within the re package. For instance if you just need the first match, it is likely to be much more efficient to use the search method, or if you plan to simply separate the text into various sections - then the split method is probably what you need.
Lets start by simply trying to find numbers within your text, and the regex will build with complexity and functionality as we go.
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"\d+", text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334', '456', '789', '07664', '282', '777', '8', '6', '08', '00', '18','00']
And there is the power of regex in a nutshell - one simple line. Let's explain that one line :
return re.findall(r"\d+", text_str)
The key part of this is the first argument to the findall - this is the regular expression (regex):
  • r : this signifies a raw string - which means that python won't try to do anything with the '\' characters within the string - i.e. what gets passed to the findall method is exactly what you type. It is strongly recommended that you always use raw strings when constructing regexs.
  • \d : A special regex sequence which specifically matches a single digit
  • + : A special character (a repeater) which means that you should match at least one or more of the previous matched sequence - i.e. we match one or more digits
The second argument to the findall method is termed the 'target text', which is the text to be searched or matched. The findall method searches the target text to find all the non-overlapping sub strings with match the regex - in our case all substrings which are numbers. This regex is greedy, which means that each match is the longest possible substring it can be - which is what we wanted here.

Greedy vs not-greedy

Many discussions and tutorials use the terms 'greedy' and 'non-greedy' without clear explanations. Hopefully these examples will make things clear.
  • A regex of r"\d+" with a target text of "1234" will match "1234" - it is a greedy match, and uses the longest possible string.
  • A regex of r"\d+?" with a target text of "1234" will match "1" - it is non-greedy match - and uses the shortest possible string.
As you can see - findall returns a list of the matched sub-strings if there are any, or any empty list if there are no matches, but we don't actually want all of those matches (as you can see it has found the text relating to the opening times, as well as just the phone numbers.

Let's start building up a more complex regex - that will just extract the phone numbers (or at least things that look like phone numbers). From now on I will use the verbose flag - so that we can document the regex as we go.
Lets start with our first format - that will match a phone number of the format nnnn nnnn nnn (4 digits, 4 digits, 3 digits), with simple spaces separating the various number groups.
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...         (?x)              # Verbose mode
...         \d{4}             # 4 digits
...         \s                # space
...         \d{4}             # 4 digits
...         \s                # space
...         \d{3}           # 3 digits - followed by a non digit
...         """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334 456 789']
This seems to do what we want - but just for one format, and it would probably be inefficient to do separate searches for each possible format, so we will use the '|' combination operator to combine multiple searches into a single regex
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...         (?x)                   # Verbose mode
...         \d{4}\s\d{4}\s\d{3}    # nnnn nnnn nnn
...         |                      # Or
...         \d{5}\s\d{3}\s\d{3}    # nnnnn nnn nnn
...         """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334 456 789', '07664 282 777']
There is no limit to how complex the search sequence can be between the '|' characters, and brackets are not needed to group together the regex patterns either side of the '|'.
 Our regex has found both of the phone numbers, and importantly it no longer finds the time information. So how could we extend it ? In the uk - all full format numbers will always start with a zero :
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...         (?x)                   # Verbose mode
...         0\d{3}\s\d{4}\s\d{3}    # 0nnn nnnn nnn
...         |                      # Or
...         0\d{4}\s\d{3}\s\d{3}    # 0nnnn nnn nnn
...         """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334 456 789', '07664 282 777']

We continue to add functionality : In the UK we have a habit of putting brackets around the local code (the first block of numbers), but the brackets are optional - so let's extend our regex:
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...     (?x)                    # Verbose mode
...     \(?                     # Optional opening bracket
...     0\d{3}                  # 0nnn dialing code
...     \)?                     # Optional closing bracket
...     \s\d{4}\s\d{3}          # remainder of number : nnnn nnn
...     |                       # Or
...     \(?                     # Optional opening bracket
...     0\d{4}                  # 0nnnn dialing code 
...     \)?                     # Optional closing bracket
...     \s\d{3}\s\d{3}          # remainder of number : nnn nnn
...     """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334 456 789', '07664 282 777']
>>> find_nums( "Call us on (0133) 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334 456 789', '07664 282 777']
There is a problem - this simple regex is very naive - and will match a number which is probably malformed : 
>>> import re
>>> find_nums( "Call us on 0133) 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
['01334) 456 789', '07664 282 777']
and this is probably not what you want to happen - it certainly looks wrong, so lets make our regex more robust :
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...      (?x)                    # Verbose mode
...      (                       # Outer Group 0nnn nnnn nnn or 0nnnn nnn nnnn
...          (                   # Group 1 - 
...               \(0\d{3}\)     # match (0nnn)
...               |              # or
...               0\d{3}         # match 0nnn
...          )                   # Close Group 1
...          \s\d{4}\s\d{3}      # remainder of number : nnnn nnn
...          |                   # Or
...          (                   # Group 2 
...               \(0\d{4}\)     # match (0nnnn)
...               |              # or
...               0\d{4}         # match 0nnnn
...          )                   # Close Group 2
...          \s\d{3}\s\d{3}      # remainder of number : nnn nnn
...      )                       # Close Outer Group
...          """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('0133 4456 789', '', ''), ('07664 282 777', '', '')]
>>> find_nums( "Call us on (0133) 4456 789 or (07664) 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('(0133) 4456 789', '(', ''), ('(07664) 282 777', '', '(')]
>>> find_nums( "Call us on 0133) 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('07664 282 777', '', '')]
Because we are now using groups - the findall method returns a list of tuples (where each element of the tuple is associated with a group in the regex, in the same order that the group starts in the regex - if you are not sure simply count along group starting '(' in your regex - be beware of other '(' in your regex which don't start groups.

Beware

One issue that is commonly faced is that the regex syntax uses the same symbols for lots of different things, and they can be tough to read. This is why the verbose mode is so useful, as you can clearly annotate exactly what your regex does, and where groups start (and end).
In our case we have a outer group which surrounds all of our regex - and this appears as the 1st element in our tuple. The other elements are associated with the groups we have used to capture the opening brackets, and they can be ignored in this example.

This seems like the finished article - we can detect both common number formats (with and without brackets), and we can reject numbers with incorrect brackets. But - there is an issue (we are ignoring that there are a number of other UK number formats or UK numbers with the international prefix). We tested for a misplaced closing bracket - but what about a misplaced opening bracket ?
>>> find_nums( "Call us on (0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('(0133 4456 789', '' ''), ('07664 282 777', '', '')]
By including another test case we can see that our current regex will ignore that one of our numbers has a leading brace and no closing brace, and match against it anyway. The problem we have is that although the regex allows for brackets around the dialling code, it doesn't exclude a '(' in the case which is meant to be matching the un-bracketed local code, so lets fix our regex. We will use a special syntax called a Negative Lookbehind Assertion : the syntax is (?<!...). This looks back along the string being matched and generates a positive result if the preceeding characters do not match the given pattern - in our case (?<!\() will only match if the previous character is not a '(' - which is exactly what we need.
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...      (?x)                    # Verbose mode
...      (                       # Outer Group 0nnn nnnn nnn or 0nnnn nnn nnnn
...          (                   # Group 1 - 
...               \(0\d{3}\)     # match (0nnn)
...               |              # or
...               (?<!\()0\d{3}         # match 0nnn
...          )                   # Close Group 1
...          \s\d{4}\s\d{3}      # remainder of number : nnnn nnn
...          |                   # Or
...          (                   # Group 2 
...               \(0\d{4}\)     # match (0nnnn)
...               |              # or
...               (?<!\()0\d{4}         # match 0nnnn
...          )                   # Close Group 2
...          \s\d{3}\s\d{3}      # remainder of number : nnn nnn
...      )                       # Close Outer Group
...          """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('0133 4456 789', '0133', ''), ('07664 282 777', '', '07664')]
>>> find_nums( "Call us on (0133) 4456 789 or (07664) 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('(0133) 4456 789', '(0133)', ''), ('(07664) 282 777', '', '(07664)')]
>>> find_nums( "Call us on 0133) 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('07664 282 777', '', '07664')]
>>> find_nums( "Call us on (0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [(('07664 282 777', '', '07664')]
This now seems to do exactly what we expect - well formed phone numbers are extracted from an arbitrary piece of text. The lesson from this step is if there is an optional substring, your regex might need two paths: one which looks for the optional substring, and a second path which works on the premise that the option substring. There is one final issue to address, which can be illustrated by the following test case :
>>> find_nums( "Call me on 0133 4456 78977")
 [('0133 4456 789', '', '')]
Our regex matches numbers which are too long, but which contain a valid number format - we need to ensure that there are no digits after the last one we match. This is relatively easy - We can use a Negative Lookahead Assertion : the syntax is (?!...). This looks back forward along the string being matched and generates a positive result if the next characters do not match the given pattern - in our case (?!\d) will only match if the next characters are not digits - which is exactly what we need.
>>> import re
>>> def find_nums_re(text_str):
>>>     return re.findall(r"""
...      (?x)                        # Verbose mode
...      (                           # Outer Group 0nnn nnnn nnn or 0nnnn nnn nnnn
...          (                       # Group 1 - 
...               \(0\d{3}\)         # match (0nnn)
...               |                  # or
...               (?<!\()0\d{3}      # match 0nnn
...          )                       # Close Group 1
...          \s\d{4}\s\d{3}(?!\d)    # remainder of number : nnnn nnn
...          |                       # Or
...          (                       # Group 2 
...               \(0\d{4}\)         # match (0nnnn)
...               |                  # or
...               (?<!\()0\d{4}      # match 0nnnn
...          )                       # Close Group 2
...          \s\d{3}\s\d{3}(?!\d)    # remainder of number : nnn nnn
...      )                           # Close Outer Group
...          """, text_str)
>>>
>>> find_nums( "Call us on 0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('0133 4456 789', '0133', ''), ('07664 282 777', '', '07664')]
>>> find_nums( "Call us on (0133) 4456 789 or (07664) 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('(0133) 4456 789', '(0133)', ''), ('(07664) 282 777', '', '(07664)')]
>>> find_nums( "Call us on 0133) 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('07664 282 777', '', '07664')]
>>> find_nums( "Call us on (0133 4456 789 or 07664 282 777 between 8am to 6pm (08:00 to 18:00)")
 [('07664 282 777', '', '07664')]
>>> find_nums( "Call me on 0133 4456 78977")
 []

Non-capturing regex

There is a concept called a non-capturing regex, the syntax is (?:...). The intention is that these patterns match against the 'target text', consume the characters that are matched, but unlike ordinary groups, the results aren't captured as a separate group in the results (i.e. no extra elements in the reported tuple from the findall method, or reported as group in the results from the search or match methods.
At first glance it seems that we could have used them instead of the Lookahead or Lookbehind assertions (which we used in the last two iterations), as all we want to do is ensure that certain substrings exist (or don't), but we don't care what they are. But, in reality, there is a wrinkle :
It is true that uhe non-capturing group (?[^\(]) will only match if there isn't a '(', and there wont be a group created for whatever is there, but the character sequence that is there will be reported as part of the containing group, and since the entire regex is treated as group, the character sequence will be reported as the content of the regex - which is not what we want in this case.
The main benefit of a non-capturing regex is that they can be added to existing sequences without impacting the numbering of the existing groups, and there is a small performance improvement.

Final Comments

It should be noted that we are only testing our regex pattern with 5 test cases - which is definitely not sufficient for production software, but it does seem like we have most of the issues solved.

For those of you who like picture, this is a visualization (created by Debugex) of our final regex.

No comments:

Post a Comment