Regular expressions
Findall method returns a list of the strings containing the matches:
import re
my_string = "My emails are xyz@mail.xor and zyx@liam.rox"
for i, email in enumerate(re.findall("\S+@\S+", my_string)):
print("Email {}: {}".format(i, email))
Email 0: xyz@mail.xor Email 1: zyx@liam.rox
If the position is needed as well one can use finditer. It returns a list of re.MatchObject
objects.
The main methods are .group(group_index)
, that returns the matched string, .start(group_index)
,
that returns the position of the first character of the matched string in the original string, and .end(group_index)
.
group_index
is needed when more than one parentheses () is used.
for i, email in enumerate(re.finditer("\S+@\S+", my_string)):
print("Email {}: {}, starts at position: {} and ends at position: {}"
.format(i, email.group(0), email.start(0), email.end(0)))
Email 0: xyz@mail.xor, starts at position: 14 and ends at position: 26 Email 1: zyx@liam.rox, starts at position: 31 and ends at position: 43
- (?=foo): lookahead (asserts that what immediately follows the current position in the string is foo)
- (?<=foo): lookbehind (asserts that what immediately precedes the current position in the string is foo**)
- (?!foo): negative lookahead (asserts that what immediately follows the current position in the string is not foo)
- (?<!foo): negative lookbehind (asserts that what immediately precedes the current position in the string is not foo)
Example:
re.sub('(?<=\()(\d+)(?=\)\.txt$)', '6', 'my_file(5).txt')
returns ‘my_file(6).txt’.
A good cheat sheet by tartley is reported here:
Non-special chars match themselves. Exceptions are special characters:
\ Escape special char or start a sequence.
. Match any char except newline, see re.DOTALL
^ Match start of the string, see re.MULTILINE
$ Match end of the string, see re.MULTILINE
[] Enclose a set of matchable chars
R|S Match either regex R or regex S.
() Create capture group, & indicate precedence
After ‘[
’, enclose a set, the only special chars are:
] End the set, if not the 1st char
- A range, eg. a-c matches a, b or c
^ Negate the set only if it is the 1st char
Quantifiers (append ‘?
’ for non-greedy):
{m} Exactly m repetitions
{m,n} From m (default 0) to n (default infinity)
* 0 or more. Same as {,}
+ 1 or more. Same as {1,}
? 0 or 1. Same as {,1}
Special sequences:
\A Start of string
\b Match empty string at word (\w+) boundary
\B Match empty string not at word boundary
\d Digit
\D Non-digit
\s Whitespace [ \t\n\r\f\v], see LOCALE,UNICODE
\S Non-whitespace
\w Alphanumeric: [0-9a-zA-Z_], see LOCALE
\W Non-alphanumeric
\Z End of string
\g<id> Match prev named or numbered group,
'<' & '>' are literal, e.g. \g<0>
or \g<name> (not \g0 or \gname)
Special character escapes are much like those already escaped in Python string
literals. Hence regex ‘\n
’ is same as regex ‘\\n
’:
\a ASCII Bell (BEL)
\f ASCII Formfeed
\n ASCII Linefeed
\r ASCII Carriage return
\t ASCII Tab
\v ASCII Vertical tab
\\ A single backslash
\xHH Two digit hexadecimal character goes here
\OOO Three digit octal char (or just use an
initial zero, e.g. \0, \09)
\DD Decimal number 1 to 99, match
previous numbered group
Extensions. Do not cause grouping, except ‘P<name>
’:
(?iLmsux) Match empty string, sets re.X flags
(?:...) Non-capturing version of regular parens
(?P<name>...) Create a named capturing group.
(?P=name) Match whatever matched prev named group
(?#...) A comment; ignored.
(?=...) Lookahead assertion, match without consuming
(?!...) Negative lookahead assertion
(?<=...) Lookbehind assertion, match if preceded
(?<!...) Negative lookbehind assertion
(?(id)y|n) Match 'y' if group 'id' matched, else 'n'
Flags for re.compile(), etc. Combine with '|'
:
re.I == re.IGNORECASE Ignore case
re.L == re.LOCALE Make \w, \b, and \s locale dependent
re.M == re.MULTILINE Multiline
re.S == re.DOTALL Dot matches all (including newline)
re.U == re.UNICODE Make \w, \b, \d, and \s unicode dependent
re.X == re.VERBOSE Verbose (unescaped whitespace in pattern
is ignored, and '#' marks comment lines)
Module level functions:
compile(pattern[, flags]) -> RegexObject
match(pattern, string[, flags]) -> MatchObject
search(pattner, string[, flags]) -> MatchObject
findall(pattern, string[, flags]) -> list of strings
finditer(pattern, string[, flags]) -> iter of MatchObjects
split(pattern, string[, maxsplit, flags]) -> list of strings
sub(pattern, repl, string[, count, flags]) -> string
subn(pattern, repl, string[, count, flags]) -> (string, int)
escape(string) -> string
purge() # the re cache
RegexObjects (returned from compile()
):
.match(string[, pos, endpos]) -> MatchObject
.search(string[, pos, endpos]) -> MatchObject
.findall(string[, pos, endpos]) -> list of strings
.finditer(string[, pos, endpos]) -> iter of MatchObjects
.split(string[, maxsplit]) -> list of strings
.sub(repl, string[, count]) -> string
.subn(repl, string[, count]) -> (string, int)
.flags # int, Passed to compile()
.groups # int, Number of capturing groups
.groupindex # {}, Maps group names to ints
.pattern # string, Passed to compile()
MatchObjects (returned from match()
and search()
):
.expand(template) -> string, Backslash & group expansion
.group([group1...]) -> string or tuple of strings, 1 per arg
.groups([default]) -> tuple of all groups, non-matching=default
.groupdict([default]) -> {}, Named groups, non-matching=default
.start([group]) -> int, Start/end of substring match by group
.end([group]) -> int, Group defaults to 0, the whole match
.span([group]) -> tuple (match.start(group), match.end(group))
.pos int, Passed to search() or match()
.endpos int, "
.lastindex int, Index of last matched capturing group
.lastgroup string, Name of last matched capturing group
.re regex, As passed to search() or match()
.string string, "