ebook img

Python Regular Expressions: A Little Guide PDF

53 Pages·2018·1.367 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Python Regular Expressions: A Little Guide

Python REGEX A Little Guide Scientific Programmer Thisbookisforsaleathttp://leanpub.com/pythonregex Thisversionwaspublishedon2018-09-09 ThisisaLeanpubbook.Leanpubempowersauthorsand publisherswiththeLeanPublishingprocess.LeanPublishingis theactofpublishinganin-progressebookusinglightweighttools andmanyiterationstogetreaderfeedback,pivotuntilyouhave therightbookandbuildtractiononceyoudo. ©2018ScientificProgrammer Contents PythonRegEx . . . . . . . . . . . . . . . . . . . . . . . . . 1 Pythonregexmatchfunction . . . . . . . . . . . . . . . . 4 Pythonregexsearchfunction . . . . . . . . . . . . . . . . 8 Pythonregexmatchvs.searchfunctions . . . . . . . . . 12 Pythonregexgroupfunctions . . . . . . . . . . . . . . . . 14 Pythonregexsubfunctionforsearchandreplace . . . . 16 Pythonregexsplitfunction . . . . . . . . . . . . . . . . 18 Pythonregexfindallfunction . . . . . . . . . . . . . . . 19 Pythonregexcompilefunction . . . . . . . . . . . . . . . 21 Pythonregexfinditerfunction . . . . . . . . . . . . . . 24 PythonRegex-LookaroundsandGreedySearch . . . . 25 PythonLookahead . . . . . . . . . . . . . . . . . . . . . . 27 PythonLookbehind . . . . . . . . . . . . . . . . . . . . . 28 PythonLazyandGreedySearch . . . . . . . . . . . . . . 29 Project 2: Parsing data from a HTML file with Python andREGEX . . . . . . . . . . . . . . . . . . . . . . 32 Project3:PDFscrapinginPython+REGEX . . . . . . . 34 Project4:WebscrapinginPython+REGEX . . . . . . . 37 Project5:AmazonwebcrawlinginPython+REGEX . . 40 Quiz#1-REGEXPatterns . . . . . . . . . . . . . . . . . 42 Quiz#2PythonREGEXFunctions . . . . . . . . . . . . . 45 References . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 CONTENTS 1 Python RegEx Hellocoders!Let’sstartourquestwithregularexpressions(RegEx). InPython,themodulereprovidesfullsupportforPerl-likeregular expressions in Python. We need to remember that there are many characters in Python, which would have special meaning when they are used in regular expression. To avoid bugs while dealing withregularexpressions,weuserawstringsasr'expression'. The re module in Python provides multiple methods to perform queries on an input string. Here are the most commonly used methods: • re.match() • re.search() • re.split() • re.sub() • re.findall() • re.compile() We will look at these function and related flags with examples in thenextsection. Python Regular Expression Patterns List Thefollowingtableliststheregularexpressionsyntaxthatisavail- able in Python. Note that any Regex can be concatenated to form new regular expressions; if X and Y are both regular expressions, thenXYisalsoaregularexpression. CONTENTS 2 Pattern Description . Matchesanysinglecharacterexcept newline.Usingmoptionallowsittomatch newlineaswell. ^ Matchesthestartofthestring,andin re.MULTILINE(seethenextlessononhowto changetomultiline)modealsomatches immediatelyaftereachnewline. $ Matchesendofline.Inre.MULTILINEmode alsomatchesbeforeanewline. [.] Matchesanysinglecharacterinbrackets. [^.] Matchesanysinglecharacternotinbrackets. * Matches0ormoreoccurrencesofpreceding expression. + Matches1ormoreoccurrenceofpreceding expression. ? Matches0or1occurrenceofpreceding expression. {n} Matchesexactlynnumberofoccurrencesof precedingexpression. {n,} Matchesnormoreoccurrencesofpreceding expression. {n, m} Matchesatleastnandatmostmoccurrences ofprecedingexpression.Forexample,x{3,5} willmatchfrom3to5'x'characters. Pattern Description xy Matcheseitherxory. \d Matchesdigits.Equivalentto[0-9]. \D Matchesnondigits. \w Matcheswordcharacters. \W Matchesnonwordcharacters. \z Matchesendofstring. \G Matchespointwherelastmatchfinished. \b Matchestheemptystring,butonlyatthe beginningorendofaword.Boundary betweenwordandnon-wordand/Bis oppositeof /b.Exampler"\btwo\b"for searchingtwofrom'one two three'. CONTENTS 3 Pattern Description \B Matchesnonwordboundaries. \n, \t Matchesnewlines,carriagereturns,tabs,etc. \s Matcheswhitespace. \S Matchesnonwhitespace. \A Matchesbeginningofstring. \Z Matchesendofstring.Ifanewlineexists,it matchesjustbeforenewline. Groups and Lookarounds Moredetailslater: Pattern Description (re) Groupsregularexpressionsandremembers matchedtext. (?: re) Groupsregularexpressionswithout rememberingmatchedtext.Forexample, theexpression(?:x{6})*matchesany multipleofsix‘x’characters. (?#...) Comment. (?= ...) Matchesif...matchesnext,butdoesn’t consumeanyofthestring.Thisiscalleda lookaheadassertion.Forexample, Scientific (?=Python)willmatch Scientificonlyifit’sfollowedbyPython. (?!...) Matchesif...doesn’tmatchnext.Thisisa negativelookaheadassertion. (?<=...) Matchesifthecurrentpositioninthestring isprecededbyamatchfor...thatendsat thecurrentposition. CONTENTS 4 Python regex match function The match function attempts to match a re pattern to string with optionalflags. Hereisthesyntaxforthisfunction− 1 re.match(pattern, string, flags=0) Where, • patternistheregularexpressiontobematched, • stringisthestringtobesearchedtomatchthepatternatthe beginningofstringand • flags,whichyoucanspecifydifferentflagsusingbitwiseOR (|). Match Flags Modifier Description re.I Performscase-insensitivematching. re.L Interpretswordsaccordingtothecurrent locale.Thisinterpretationaffectsthe alphabeticgroup(\wand\W),aswellas wordboundarybehavior(\band\B). re.M Makes$matchtheendofalineandmakes ^matchthestartofanyline. re.S Makesaperiod(dot)matchanycharacter, includinganewline. re.U InterpretslettersaccordingtotheUnicode characterset.Thisflagaffectsthebehavior of \w,\W,\b,\B. re.X Itignoreswhitespace(exceptinsideaset[] orwhenescapedbyabackslashandtreats unescaped#asacommentmarker. CONTENTS 5 Return values • There.matchfunctionreturnsamatch objectonsuccessand Noneuponfailure.- • Use group(n) or groups() function of match object to get matchedexpression,e.g.,group(n=0)returnsentirematch(or specificsubgroupn) • The function groups() returns all matching subgroups in a tuple(emptyifthereweren’tany). Example1 Let’sfindthewordsbeforeandafterthewordto: 1 #!/usr/bin/python 2 import re 3 4 line = "Learn to Analyze Data with Scientific Python"; 5 6 m = re.match( r'(.*) to (.*?) .*', line, re.M|re.I) 7 8 if m: 9 print "m.group() : ", m.group() 10 print "m.group(1) : ", m.group(1) 11 print "m.group(2) : ", m.group(2) 12 else: 13 print "No match!!" Thefirstgroup(.*)identifiedthestring:Learnandthenextgroup (*.?)identifiedthestring:Analyze.Output: CONTENTS 6 1 m.group() : Learn to Analyze Data with Scientific Python 2 m.group(1) : Learn 3 m.group(2) : Analyze Example2 groups([default])returnsatuplecontainingallthesubgroupsof thematch,from1uptohowevermanygroupsareinthepattern. 1 #!/usr/bin/python 2 import re 3 4 line = "Learn Data, Python"; 5 6 m = re.match( r'(\w+) (\w+)', line, re.M|re.I) 7 8 if m: 9 print "m.group() : ", m.groups() 10 print "m.group (1,2)", m.group(1, 2) 11 else: 12 print "No match!!" Output: 1 m.group() : ('Learn', 'Data') 2 m.group (1,2) ('Learn', 'Data') Example3 groupdict([default])returnsadictionarycontainingallthenamed subgroupsofthematch,keyedbythesubgroupname. CONTENTS 7 1 #!/usr/bin/python 2 import re 3 4 number = "124.13"; 5 6 m = re.match( r'(?P<Expotent>\d+)\.(?P<Fraction>\d+)', nu\ 7 mber) 8 9 if m: 10 print "m.groupdict() : ", m.groupdict() 11 else: 12 print "No match!!" Output: m.groupdict() : {'Expotent': '124', 'Fraction': '13'} Example4 Start, end. How can we match the start or end of a string? We can use the “A” and “Z” metacharacters. We precede them with a backslash. We match strings that start with a certain letter, and thosethatendwithanother. 1 import re 2 3 values = ["Learn", "Live", "Python"]; 4 5 for value in values: 6 # Match the start of a string. 7 result = re.match("\AL.+", value) 8 if result: 9 print("START MATCH [L]:", value) 10 11 # Match the end of a string. 12 result2 = re.match(".+n\Z", value)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.