Regex for matching Python multiline string with escaped characters -
i'm parsing python source code, , i've got regular expressions single , double quoted strings (obtained reading ridgerunner's answer this thread).
single_quote_re = "'([^'\\\\]*(?:\\\\.[^'\\\\]*)*)'"; double_quote_re = '"([^"\\\\]*(?:\\\\.[^"\\\\]*)*)"';
i'm trying handle python multiline strings (three double-quotes).
s = '"""string one\'s end isn\'t here; \\""" it\'s here """ """string 2 here"""' # correct output findall should be: # ['string one\'s end isn\'t here; \\""" it\'s here ','string 2 here']
i tried messing around bit, still it's not right.
multiline_string_re = '"""([^(""")\\\\]*(?:\\\\.[^(""")\\\\]*)*)"""'
there's gotta way """ isn't preceded backslash (in other words, first double-quote isn't escaped).
edit: should getting closer; i've tried following:
r'(?<!\\)""".*(?<!\\)"""' # matches entire string; not i'm going for. r'(?<!\\)"""[^((?<!\\)""")](?<!\\)"""' # matches space between 2 strings ('""" """') in sample string s (see code above, prior edit). r'(?<!\\)"""([^((?<!\\)""")]*(?:\\.[^((?<!\\)""")]*)*)(?<!\\)"""' # same result before, triple quotes shaved off (' '). # note: indeed want triple quotes excluded.
update: the solution, sln, appears """[^"\\](?:(?:\\.|"")[^"\\])*"""
multiline_string_re = '"""[^"\\\\]*(?:(?:\\\\.|"")[^"\\\\]*)*"""' re.findall(multiline_string_re, s, re.dotall) # result: # ['"""string one\'s end isn\'t here; \\""" it\'s here """', '"""string 2 here"""']
the updated solution, again sln:
multiline_single_re = "'''[^'\\\\]*(?:(?:\\\\.|'{1,2}(?!'))[^'\\\\]*)*'''" multiline_double_re = '"""[^"\\\\]*(?:(?:\\\\.|"{1,2}(?!"))[^"\\\\]*)*"""'
here test case using regex in perl. if going allow escape
escaped double quote form "", modify 1 of the
regex's you've sited allow double, double quote.
the source string removed of single quote escaping.
use strict; use warnings; $/ = undef; $str = <data>; while ($str =~ /"[^"\\]*(?:(?:\\.|"")[^"\\]*)*"/sg ) { print "found $&\n"; } __data__ """string one's end isn't here; \""" it's here """ """string 2 here"""
output >>
found """string one's end isn't here; \""" it's here """ found """string 2 here"""
note validity , error processing, regex need contain
pass-through constructs (alternation) can processed in body of while loop.
example /"[^"\\]*(?:(?:\\.|"")[^"\\]*)*"|(.)/sg
,
then
while(){
// if matched group 1, , not whitespace = possible error
}
add - in reply comments.
after research on python block literals,
it appears have handle not escaped characters, but
2 double quotes in body. ie. "
or ""
to change regex simple. add 1-2 quantifier , restrain lookahead assertion.
below raw , string'd regex parts can pick , choose from.
tested in perl, works.
luck!
# raw - # (?s: # """[^"\\]*(?:(?:\\.|"{1,2}(?!"))[^"\\]*)*""" # | # '''[^'\\]*(?:(?:\\.|'{1,2}(?!'))[^'\\]*)*''' # ) # string'd - # '(?s:' # '"""[^"\\\]*(?:(?:\\\.|"{1,2}(?!"))[^"\\\]*)*"""' # '|' # "'''[^'\\\\]*(?:(?:\\\\.|'{1,2}(?!'))[^'\\\\]*)*'''" # ')' (?s: # dot-all # double quote literal block """ # """ block open [^"\\]* # 0 - many non " nor \ (?: # grp start (?: \\ . # escape | # or "{1,2} # 1 - 2 " (?! " ) # not followed " ) [^"\\]* # 0 - many non " nor \ )* # grp end, 0 - many times """ # """ block close | # or, # single quote literal block ''' # ''' block open [^'\\]* # 0 - many non ' nor \ (?: # grp start (?: \\ . # escape | # or '{1,2} # 1 - 2 ' (?! ' ) # not followed ' ) [^'\\]* # 0 - many non ' nor \ )* # grp end, 0 - many times ''' # ''' block close )
Comments
Post a Comment