Regular Expression

Introduction

The re module was added in Python 1.5, and provides Perl-style regular expression patterns. The earlier version, regex module was removed completely in Python 2.5.

Match, Search, and more

`match()` checks for a match only at the beginning of the string.

`search()` checks for a match anywhere of the string.

`findall()` find all substrings where the RE matches, and return them as a list

`finditer()` find all substrings where the RE matches, and return them as an iterator


import re

pat = [ 'first', 'second']
sent = 'What is first and second? what! *'

match = re.search(pat[0], sent)
if match:
   print "found"
else:
   print "not found"


import re

pat = [ 'first', 'second']
sent = 'What is first and second? what! *'

for p in pat:
   if re.search(p, sent):
      print "found"
   else:
      print "not found"

Deal with return value


import re

pat =  'first'
sent = 'What is first and second? what! *'

match = re.search(pat, sent)
s = match.start()
e = match.end()

print '"%s" is found in "%s" from index %d to %d ("%s")' % \
    (match.re.pattern, match.string, s, e, sent[s:e])
print "match.group(): " , match.group()

output

 
"first" is found in "What is first and second? what! *" from index 8 to 13 which is "first"

match.group():  first

Replace


import re

pat =  'first'
sent = 'What is first and second? what! *'

match = re.sub(pat, "1st", sent)

print match

output

 
What is 1st and second? what! *

\d	[0-9]
\D	[^0-9]
\s	Any whitespace-> [ \t\n\r\f\v]
\S	Any non-whitespace-> [^ \t\n\r\f\v]
\w	Any alphanumeric-> [a-zA-Z0-9_]
\W	Any non-alphanumeric-> [^a-zA-Z0-9_]
[s,.]	Any whitespace or ',' or '.'
[..]	Anythin exceprt new line

Repeating things

ca*t	ct, cat, caaat, etc
a[bcd]*b	a + zero or more letter from [bcd] + b
ca+t	cat, caat, but won't match ct
home-?brew	homebrew or home-brew
a/{1,3}b	a/b, a//b, a///b, but won't match ab

Compile

p = re.compile('ab*c')
p.search("Where is abc?")
print match.group()

output

abc

More Metacharacters

</table>

Grouping

Groups are marked by the '(', ')' metacharacters. (ab)* will match zero or more repetitions of ab.

p = re.compile('(ab)*')
print p.match('ababababab').span()

output

(0,10)

p = re.compile('(a)b')
m = p.match('ab')
m.group()

output

'ab'

a\|b	"or" operator. any string that matches either a or b
^	Matches at the beginning of lines.
$	Matches at the end of a line.
\A	Matches only at the start of the string.
\Z	Matches only at the end of the string.
\b	Word boundary, empty string at the beginning or end of a word
\B	empty string not at the beginning or end of a word

Chris IJ Hwang

Contents

Home

Data Science/Machine Learning related

Quantitative Finance Modeling and Analysis

Visualization

Math Finance

Others

Regular Expression

Introduction

Match, Search, and more

Deal with return value

Replace

Repeating things

Compile

More Metacharacters

Grouping