Python For Everyone: Advanced Course by @jeremy_carson Presentation developed with Remark 1 / 112 Details This presentation was developed with Remark, a MarkDown presentation API. Course materials Python Advanced Course [HTML] [PDF] Raw Markdown Data [course data] Source code for presentation and any examples [link] License Unless otherwise specified, all Python Presentation source code is released under the MIT license. This tutorial uses examples from several sources, I tried to provide attribution where applicable. Many online tutorials I referenced do not include an explicit license so I attributed it to their website. The MIT License (MIT) Copyright (c) 2014 Jeremy Carson Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 2 / 112 Advanced Fundamentals 3 / 112 Regular Expressions Regular expressions (regexes, REs, regex patterns) are algebraic ways of describing natural language. Features: Incredibly useful for parsing semistructured data Pattern Matching/Searching (e.g. Exact match vs Partial match) String Splitting Substring Extraction Semistructured data: name,phone jenny,867‐5309 other,867‐5308 jason,1234567 mike,(978) 407‐1866 Row = (Name,PhoneNumber) Phone numbers are not well formatted (but are good enough for REs) No str.find() method is smart enough extract these data. Requires pattern matching. Regular Expressions are a language unto themselves, read up: Official Python Regex Tutorial [link] Official Python Regex Documentation [link] Good Regular Expression Site [link] 4 / 112 Regular Expressions 5 / 112 Regular Expressions Example: \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b Regex metacharacters: . ^ $ * + ? { } [ ] \ | ( ) These are not definitive. Several metacharacters even have different meanings, depending on context. See the [official docs] for more [] Match anything inside the square brackets for ONE character position. A ‐ inside brackets indicates a range separator. [abcdefg] could be written as [a‐g] A ^ inside brackets indicates a negation. [^abc] match any characaters not a,b,c. \ Escape character with special meaning. e.g. \d is equivalent to [0‐9] You can also use the backslash to remove special meanings. e.g. finding square brackets: [\[\]] Special sequences can be included in character classes. e.g. [\s,.] matches any whitespace, comma, and period. * Not a wildcard, means that the previous character can be matched 0 or more times. e.g. ca*t matches "ct","cat","caaaaaaaat" + Previous character can be matched 1 or more times. 6 / 112 Regular Expressions Let's go back to the phone numbers: data =[ "jenny,867‐5309", "other,867‐5308", "jason,1234567", "mike,(978) 407‐1866"] Attempt 1 import re p = re.compile(r"\d+‐\d+") for phone in data: result = p.findall(phone) if result: print(result) re.compile(<regex>) Create a new regular expression program using some regex string Lead regex strings w/ r to avoid problems parsing \ r"\test" is equivalent to r"\\test" Important for regex, \d is to be evaluated by regex, not Pythjon. findall(<string>) Returns all substrings of <string> that match the regex 7 / 112 Regular Expressions Attempt 2 p = re.compile(r"\d+‐*\d+") for phone in data: result = p.findall(phone) if result: print(result) Better but doesn't get last phone number. Attempt 3 (right off the slide???): (?:(?:\+?1\s*(?:[.‐]\s*)?)?(?:\(\s*([2‐9]1[02‐9]|[2‐9][02‐8]1|[2‐9][02‐8][02‐9])\s*\)|([2‐9]1[02‐9]|[2‐9][02‐8]1|[2‐9][02‐8][02‐9]))\s*(?:[.‐]\s*)?)?([2‐9]1[02‐9]|[2‐9][02‐9]1|[2‐9][02‐9]{2})\s*(?:[.‐]\s*)?([0‐9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$ Found this online, I hope the user was just trolling cause this is awful. Attempt 4 (work smarter): p = re.compile(r"(?:\d{10}|\d{7})") for phone in data: phone = re.sub(r"[()\‐ ]",'',phone) result = p.findall(phone) if result: print(result) Why not remove all the whitespace and parens and just look for sequences of numbers. re.compile(r"(?:\d{10}|\d{7})") Search for 10 numbers in a row or 7 numbers in a row. re.sub(r"[()\‐ ]",'',phone) Remove whitespace, parens, and dashes. This was all our phone numbers 8 / 112 become just strings of digits. Regular Expressions match() Determine if the RE matches at the beginning of the string. search() Scan through a string, looking for any location where this RE matches. findall() Find all substrings where the RE matches, and returns them as a list. finditer() Find all substrings where the RE matches, and returns them as an iterator. match and search return None if no match is found. findall pulls all substring matches: import re p = re.compile("[a‐z]{3}") # match 3 letters result = p.findall("abcdefghi") print(result) # prints ['abc','def','geh'] finditer pulls all substring match objects: p = re.compile("[a‐z]{3}") # match 3 letters result = p.finditer("abcdefghi") print([i.group() for i in result]) group() always returns the match Note: result is an iterator, once used it cannot be used again. 9 / 112 Regular Expressions Up until now we've just been matching strings, ( ) can be used to extract substrings from matches: # Match groups of 3 letters, assume there are 3 groups. p = re.compile(r"(?P<id>[a‐z]{3})([a‐z]{3})([a‐z]{3})") result_iterator = p.finditer("abcdefghi") for result in result_iterator: print(result.group(0)) # abcdefgeh ‐ always the full match print(result.group(1)) # abc print(result.group(2)) # def print(result.group(3)) # ghi print(result.group("id")) # abc There are lots more ways of doing groups. Groups with ? at the beginning are treated differently (?P<id>) Creates an identifier group, allows result.group("id") (?:[abc]|) Creates a noncapturing group, useful if you are using groups and want to avoid parsing unnecessary data. Regular expressions are a completely new language and require time and practice. The best advice I can give is when creating a regex, work through your desired comparison left to right. Start by matching pieces first, then the whole. 10 / 112
Description: