View on GitHub

kickstart_regex

Overview Back (Anchors) Next (Lookaround)

Groups

So far we have only been printing the whole match object. What if we only want to print or process the actual match or just a specific part of the match?

This is where groups come into play.

import re

match = re.search(r"\d{5}", "07231")
if match:
    print(match.group())    # will only print the match 07231

In the above example we will only print the actual match by using match.group(). This requires a match object. If re.search finds no match it returns a None value. Calling a group() method on a None value will result in an exception. So checking if we get an actual match object is good practice to avoid runtime errors.

Let’s see how we can define groups in a RegEx and how to use them in multiple ways.

Groups: Remember sub-matches

We define groups in a RegEx with round brackets ( ). Everything within this group will be captured and can be referenced later on.

Let’s illustrate this with an example.

files is a list of filenames. We now want only PDF files beginning with a specific string like invoice.

Start by writing a RegEx that will match on files beginning with invoice and ending with pdf.

import re

files = [
    "holiday1999.png",
    "invoice_car_insurance.pdf",
    "invoice_telekom2021.pdf",
    "invoice_vattenfall2021.pdf",
    "manual_mazda_cx5.pdf",
    "passport.jpg",
    "resumee.pdf",
]

# place your regex here
pattern = r"..."
filtered = [re.match(pattern, file).group() for file in files if re.match(pattern, file)]

assert len(filtered) == 3, "list length is 3"
assert "invoice_car_insurance.pdf" in filtered
assert "invoice_telekom2021.pdf" in filtered
assert "invoice_vattenfall2021.pdf" in filtered
print(filtered)
print("Good RegEx")

If your RegEx is correct, you will see that the list contains the 3 invoice files. What if we only want the actual filename without the file ending pdf? We could filter our list afterwards and remove the filename, but it would be nice to have this done by our RegEx.

We can do this with groups. So your RegEx might look like this: pattern = r"^invoice_[\w]+\.pdf$". If we now want to be able to reference the actual filename without the ending, we can put round brackets around these characters.

import re

files = [
    "holiday1999.png",
    "invoice_car_insurance.pdf",
    "invoice_telekom2021.pdf",
    "invoice_vattenfall2021.pdf",
    "manual_mazda_cx5.pdf",
    "passport.jpg",
    "resumee.pdf",
]

# Usage of round brackets around the file name
pattern = r"^(invoice_[\w]+)\.pdf$"

# we reference the group (we use `group(1)` explicitly)
filtered = [re.match(pattern, file).group(1) for file in files if re.match(pattern, file)]

assert len(filtered) == 3, "list length is 3"
assert "invoice_car_insurance" in filtered
assert "invoice_telekom2021" in filtered
assert "invoice_vattenfall2021" in filtered
print(filtered)
print("Good RegEx")

As you see when filtered got printed the filenames now only contain the content of our specified group.

Groups get referenced by index. Every opening round bracket will create a new index we can reference later on. The first group starts with index 1, the next has index 2 and so on. If we want to access a group that is not available, we will get an error.

So in summary, groups allow us to reference sub-matches later on.

Groups: Packing things together

Another common use-case is to be able to use quantifiers for multiple characters. Have a look at these examples.


m = re.search(r"abc{3}", "abccc")
print(m.group())

# How can we match "abc" 3 times
m = re.search(r"abcabcabc", "abcabcabc")
print(m.group())

# This looks simpler
m = re.search(r"(abc){3}", "abcabcabc")
print(m.group())

# With this we can match any combination of abc 3 times
m = re.search(r"([abc]{3}){3}", "abccbacbca")
print(m.group())

As you have seen in the examples above we can use groups to be able to repeat certain patterns with a quantifier.

Groups: Alternation

Another use case for groups is using alternations. The meta character | means or and we can combine multiple regular expressions within a group with that.

Suppose we want to extract the salutation of a letter. The salutation may be “Dear Sir” or Dear Madam”.

We could write a RegEx which matches one or the other like this: r"Dear (Sir|Madam)". This will match on both cases, but not if Sir and Madam are missing. Be aware that we can use every meta characters or “Sub RegEx” within the groups, not just string literals as seen in this example.

Exercise (Valid mobile number)

We will now apply our new knowledge about groups in an exercise.

We want to validate some mobile numbers. For that we write a function that returns True if passed string contains a valid mobile number, or False otherwise.

We define valid numbers as followed:

valid_1 = "+49179/123456789"
valid_2 = "0179/123456789"
invalid = "+490179/123456789"

# Tipp: (aaa|bbb) matches either `aaa` or `bbb`

def is_valid(number):
    # Replace ... with valid RegEx
    return bool(re.match(r"...", number))

assert is_valid(valid_1) == True, "Check valid number with +49"
assert is_valid(valid_2) == True, "Check valid number with 0179"
assert is_valid(invalid) == False, "Check invalid number"
print("Good RegEx")

Alternation and capturing example

Have a look at this example which makes use of group alternation and referencing different capture groups for printing.

number = "0179/123456789"
number_2 = "+49179/123456789"

m = re.match(r"(\+\d{5}|\d{4})/(\d{9})$", number_2)
print("Complete number:", m.group())
print("First part:", m.group(1))
print("Second part:", m.group(2))

# IndexError. This will not work, because we have no group with index 3
# print("Third part:", m.group(3))    # IndexError

Exercise (Valid hour)

We want to write a RegEx which will verify valid times.

def valid_hour(string):
    # insert regex here
    return re.match(r"...", string) is not None

assert valid_hour("00:00") is True
assert valid_hour("23:59") is True
assert valid_hour("24:00") is False
assert valid_hour("25:59") is False
assert valid_hour("15:20") is True
assert valid_hour("23:60") is False
print("Good RegEx")

Could you solve the exercise? If yes, congratulations! This was no easy task at all!

Hints

This exercise is not easy. Try to separate to problem into smaller sub-problems.

For that try to solve the “minute-problem” first.

If you get stuck, use RegEx101 and use this test string for the minute problem. Can you find a pattern?

:00
:01
:02
:03
:04
:05
:06
:07
:08
:09

:10
:11
:12
:13
:14
:15
:16
:17
:18
:19

:20
:21
:22
:23
:29

:30
:39

:40
:49

:50
:59

Note: Not every minute is posted in the test data… several minutes were skipped when starting with a 2 and only 2 minutes were posted when starting with 4 or 5. Feel free to extend the test data if the pattern is not clear to you.

More hints

If the pattern did not emerge, have a look at the last number. It does not matter what number we have here, everything is valid. So we can use \d.

For the first number only 0 to 5 is valid, so we can use [0-5].

We can combine these two and use: [0-5]\d.

With this knowledge, try to solve the “hour problem”.

More hints

If you have solved the “minute problem”, continue with the hour problem separately. If you get stuck on the “minute problem”, you can try to solve the hour problem in isolation first, as the two parts are independent of each other.

If you RegEx101, use this test string:

00:
01:
02:
03:
04:
05:
06:
07:
08:
09:

10:
11:
12:
13:
14:
15:
16:
17:
18:
19:

20:
21:
22:
23:

Try to find a pattern. :)

More hints

If we look at the last pattern we see that if we have a 0 or a 1 as a first digit, every second digit is valid: [01]\d

If we have a 2 as first digit, only [0-3] is valid as the second digit.

Exercise Valid IP address revisited

We now want to have only valid IP addresses from a range between 0 and 255.

import re

def better_ip_validator(ip_address):
    # Replace ... with valid RegEx
    m = re.match(r"...", ip_address)
    return m is not None

assert better_ip_validator("192.168.1.1") is True
assert better_ip_validator("192.168.1.11") is True
assert better_ip_validator("192.168.1.111") is True
assert better_ip_validator("192.168.1.255") is True
assert better_ip_validator("192.168.1.256") is False
assert better_ip_validator("192.168.1.999") is False
assert better_ip_validator("192.168.1.x") is False
assert better_ip_validator("192.168.1.xx") is False
assert better_ip_validator("192.168.1.xxx") is False
print("Good RegEx!")

This was a hard one! Nicely done if you got it!

But don’t be frustrated if not. It is not easy at all and needs patience ;)


There is another use case for groups: the so-called lookaround groups. We will have a look at them in the next chapter.

Overview Back (Anchors) Next (Lookaround)