Regular expressions

Findall method returns a list of the strings containing the matches:

import re

my_string = "My emails are xyz@mail.xor and zyx@liam.rox"

for i, email in enumerate(re.findall("\S+@\S+", my_string)):
    print("Email {}: {}".format(i, email))

Email 0: xyz@mail.xor Email 1: zyx@liam.rox

If the position is needed as well one can use finditer. It returns a list of re.MatchObject objects. The main methods are .group(group_index), that returns the matched string, .start(group_index), that returns the position of the first character of the matched string in the original string, and .end(group_index). group_index is needed when more than one parentheses () is used.

for i, email in enumerate(re.finditer("\S+@\S+", my_string)):
    print("Email {}: {}, starts at position: {} and ends at position: {}"
        .format(i, email.group(0), email.start(0), email.end(0)))

Email 0: xyz@mail.xor, starts at position: 14 and ends at position: 26 Email 1: zyx@liam.rox, starts at position: 31 and ends at position: 43

(?=foo): lookahead (asserts that what immediately follows the current position in the string is foo)
(?<=foo): lookbehind (asserts that what immediately precedes the current position in the string is foo**)
(?!foo): negative lookahead (asserts that what immediately follows the current position in the string is not foo)
(?<!foo): negative lookbehind (asserts that what immediately precedes the current position in the string is not foo)

Example:

re.sub('(?<=\()(\d+)(?=\)\.txt$)', '6', 'my_file(5).txt')

returns ‘my_file(6).txt’.

A good cheat sheet by tartley is reported here:

Non-special chars match themselves. Exceptions are special characters:

\       Escape special char or start a sequence.
.       Match any char except newline, see re.DOTALL
^       Match start of the string, see re.MULTILINE
$       Match end of the string, see re.MULTILINE
[]      Enclose a set of matchable chars
R|S     Match either regex R or regex S.
()      Create capture group, & indicate precedence

After ‘[’, enclose a set, the only special chars are:

]   End the set, if not the 1st char
-   A range, eg. a-c matches a, b or c
^   Negate the set only if it is the 1st char

Quantifiers (append ‘?’ for non-greedy):

{m}     Exactly m repetitions
{m,n}   From m (default 0) to n (default infinity)
*       0 or more. Same as {,}
+       1 or more. Same as {1,}
?       0 or 1. Same as {,1}

Special sequences:

\A  Start of string
\b  Match empty string at word (\w+) boundary
\B  Match empty string not at word boundary
\d  Digit
\D  Non-digit
\s  Whitespace [ \t\n\r\f\v], see LOCALE,UNICODE
\S  Non-whitespace
\w  Alphanumeric: [0-9a-zA-Z_], see LOCALE
\W  Non-alphanumeric
\Z  End of string
\g<id>  Match prev named or numbered group,
        '<' & '>' are literal, e.g. \g<0>
        or \g<name> (not \g0 or \gname)

Special character escapes are much like those already escaped in Python string literals. Hence regex ‘\n’ is same as regex ‘\\n’:

\a  ASCII Bell (BEL)
\f  ASCII Formfeed
\n  ASCII Linefeed
\r  ASCII Carriage return
\t  ASCII Tab
\v  ASCII Vertical tab
\\  A single backslash
\xHH   Two digit hexadecimal character goes here
\OOO   Three digit octal char (or just use an
       initial zero, e.g. \0, \09)
\DD    Decimal number 1 to 99, match
       previous numbered group

Extensions. Do not cause grouping, except ‘P<name>’:

(?iLmsux)     Match empty string, sets re.X flags
(?:...)       Non-capturing version of regular parens
(?P<name>...) Create a named capturing group.
(?P=name)     Match whatever matched prev named group
(?#...)       A comment; ignored.
(?=...)       Lookahead assertion, match without consuming
(?!...)       Negative lookahead assertion
(?<=...)      Lookbehind assertion, match if preceded
(?<!...)      Negative lookbehind assertion
(?(id)y|n)    Match 'y' if group 'id' matched, else 'n'

Flags for re.compile(), etc. Combine with '|':

re.I == re.IGNORECASE   Ignore case
re.L == re.LOCALE       Make \w, \b, and \s locale dependent
re.M == re.MULTILINE    Multiline
re.S == re.DOTALL       Dot matches all (including newline)
re.U == re.UNICODE      Make \w, \b, \d, and \s unicode dependent
re.X == re.VERBOSE      Verbose (unescaped whitespace in pattern
                        is ignored, and '#' marks comment lines)

Module level functions:

compile(pattern[, flags]) -> RegexObject
match(pattern, string[, flags]) -> MatchObject
search(pattner, string[, flags]) -> MatchObject
findall(pattern, string[, flags]) -> list of strings
finditer(pattern, string[, flags]) -> iter of MatchObjects
split(pattern, string[, maxsplit, flags]) -> list of strings
sub(pattern, repl, string[, count, flags]) -> string
subn(pattern, repl, string[, count, flags]) -> (string, int)
escape(string) -> string
purge() # the re cache

RegexObjects (returned from compile()):

.match(string[, pos, endpos]) -> MatchObject
.search(string[, pos, endpos]) -> MatchObject
.findall(string[, pos, endpos]) -> list of strings
.finditer(string[, pos, endpos]) -> iter of MatchObjects
.split(string[, maxsplit]) -> list of strings
.sub(repl, string[, count]) -> string
.subn(repl, string[, count]) -> (string, int)
.flags      # int, Passed to compile()
.groups     # int, Number of capturing groups
.groupindex # {}, Maps group names to ints
.pattern    # string, Passed to compile()

MatchObjects (returned from match() and search()):

.expand(template) -> string, Backslash & group expansion
.group([group1...]) -> string or tuple of strings, 1 per arg
.groups([default]) -> tuple of all groups, non-matching=default
.groupdict([default]) -> {}, Named groups, non-matching=default
.start([group]) -> int, Start/end of substring match by group
.end([group]) -> int, Group defaults to 0, the whole match
.span([group]) -> tuple (match.start(group), match.end(group))
.pos       int, Passed to search() or match()
.endpos    int, "
.lastindex int, Index of last matched capturing group
.lastgroup string, Name of last matched capturing group
.re        regex, As passed to search() or match()
.string    string, "