Chris IJ Hwang

I am a Quantitative Analyst/Developer and Data Scientist with backgroud of Finance, Education, and IT industry. This site contains some exercises, projects, and studies that I have worked on. If you have any questions, feel free to contact me at ih138 at columbia dot edu.

View My GitHub Profile



Contents

Regular Expression

Introduction

The re module was added in Python 1.5, and provides Perl-style regular expression patterns. The earlier version, regex module was removed completely in Python 2.5.

Match, Search, and more

`match()` checks for a match only at the beginning of the string.

`search()` checks for a match anywhere of the string.

`findall()` find all substrings where the RE matches, and return them as a list

`finditer()` find all substrings where the RE matches, and return them as an iterator


import re

pat = [ 'first', 'second']
sent = 'What is first and second? what! *'

match = re.search(pat[0], sent)
if match:
   print "found"
else:
   print "not found"

import re

pat = [ 'first', 'second']
sent = 'What is first and second? what! *'

for p in pat:
   if re.search(p, sent):
      print "found"
   else:
      print "not found"
Deal with return value

import re

pat =  'first'
sent = 'What is first and second? what! *'

match = re.search(pat, sent)
s = match.start()
e = match.end()

print '"%s" is found in "%s" from index %d to %d ("%s")' % \
    (match.re.pattern, match.string, s, e, sent[s:e])
print "match.group(): " , match.group()

output

 
"first" is found in "What is first and second? what! *" from index 8 to 13 which is "first"

match.group():  first
Replace

import re

pat =  'first'
sent = 'What is first and second? what! *'

match = re.sub(pat, "1st", sent)

print match

output

 
What is 1st and second? what! *
\d [0-9]
\D [^0-9]
\s Any whitespace-> [ \t\n\r\f\v]
\S Any non-whitespace-> [^ \t\n\r\f\v]
\w Any alphanumeric-> [a-zA-Z0-9_]
\W Any non-alphanumeric-> [^a-zA-Z0-9_]
[s,.] Any whitespace or ',' or '.'
[..] Anythin exceprt new line
Repeating things
ca*t ct, cat, caaat, etc
a[bcd]*b a + zero or more letter from [bcd] + b
ca+t cat, caat, but won't match ct
home-?brew homebrew or home-brew
a/{1,3}b a/b, a//b, a///b, but won't match ab
Compile
p = re.compile('ab*c')
p.search("Where is abc?")
print match.group()

output

abc
More Metacharacters
</table>
Grouping

Groups are marked by the '(', ')' metacharacters. (ab)* will match zero or more repetitions of ab.

p = re.compile('(ab)*')
print p.match('ababababab').span()

output

(0,10)
p = re.compile('(a)b')
m = p.match('ab')
m.group()

output

'ab'
a|b "or" operator. any string that matches either a or b
^ Matches at the beginning of lines.
$ Matches at the end of a line.
\A Matches only at the start of the string.
\Z Matches only at the end of the string.
\b Word boundary, empty string at the beginning or end of a word
\B empty string not at the beginning or end of a word