Overview | Back (Character classes) | Next (Groups) |
Anchors
In this chapter we will learn about anchors. With anchors we can match positions. For example if we want to match specific characters only at the beginning or end of the line.
The following anchors exist:
Be aware that a position is not a character
^ Matches on beginning of the line
$ Matches on the end of a line
\b Matches on a word boundary (beginning or end of a word)
Matches, without consuming any characters, immediately between a character
matched by \w and a character not matched by \w (in either order)
So here we have the caret ^
again. If you remember from the previous chapter we had the ^
specified as a negated character class.
This is why the character classes can be seen as a separate mini-language within RegEx. Within a character class a ^
means negating, outside of character classes it means beginning of the line
.
So don’t confuse these two different uses of the caret ^
. When you encounter a ^
, make sure that you are clear about the context in which you are using it and therefore the meaning of the ^
in that given context.
Examples with anchors
import re
s = "Error on columbia01. A Fatal Error occurred!"
print(re.findall(r"^Error", s))
s = "Subject: fire. Dear Sir/Madam, I am writing to inform you of a fire"
print(re.findall(r"fire$", s))
s = "A38, 42, 36, 48"
print("With word boundaries:", re.findall(r"\b[0-9]{2}\b", s))
# compare the result of the above print statement with the following findall
print("Without word boundaries:", re.findall(r"[0-9]{2}", s))
By using word boundaries you can avoid getting matches within other words, as seen in the last example.
Exercise “valid filenames” revisited
Let’s attempt the valid filenames problem in a more real-world context. In the Windows OS the following symbols may not be used in a filename: \/:*?"<>|
This means our restrictions are as follows:
- at least 1 arbitrary character except for
\/:*?"<>|
- followed by a dot (
"."
) - followed by exactly 3 arbitrary characters, but not
\/:*?"<>|
# set this to True, if you would like to have the actual match printed
DEBUG = False
def valid_filename(filename):
# insert regex here
m = re.search(r'...', filename)
if DEBUG:
if m:
print("The actual match is:", m.group())
else:
print("The actual match is: None")
return m is not None
assert valid_filename("test.txt") is True
assert valid_filename("1.txt") is True
assert valid_filename(".txt") is False
assert valid_filename("test") is False
assert valid_filename("test.tt") is False
assert valid_filename("test.text") is False
assert valid_filename("te:st.txt") is False
assert valid_filename("te*st.txt") is False
assert valid_filename("test.t?t") is False
assert valid_filename("my-test-file.txt") is True
print("Good RegEx")
Attention: The example above uses search, so you are forced to use anchors
Exercise “is integer”
Write a function with a regular expression which evaluates whether a given string is an integer.
is_integer(string) -> bool
The function should evaluate whether a given string is a valid integer.
We define valid integers as this:
Consists of 1 or more digits
May optionally begin with -
Does not contain any other non-digit characters.
import re
# set this to True if you would like to have the actual match printed
DEBUG = False
def is_integer(string):
# Replace ... with regex
m = re.search(r"...", string)
if DEBUG:
if m:
print("The actual match is:", m.group())
else:
print("The actual match is: None")
return m is not None
assert is_integer("42") is True
assert is_integer("-5000") is True
assert is_integer("-") is False
assert is_integer(" 42") is False
assert is_integer("0.0") is False
assert is_integer("+999") is False
assert is_integer("--500") is False
print("Good RegEx")
Attention: The example above uses search, so you are forced to use anchors
We finish this chapter with a little smile.
Overview | Back (Character classes) | Next (Groups) |