Regex for matching Python multiline string with escaped characters -


i'm parsing python source code, , i've got regular expressions single , double quoted strings (obtained reading ridgerunner's answer this thread).

single_quote_re = "'([^'\\\\]*(?:\\\\.[^'\\\\]*)*)'";  double_quote_re = '"([^"\\\\]*(?:\\\\.[^"\\\\]*)*)"'; 

i'm trying handle python multiline strings (three double-quotes).

s = '"""string one\'s end isn\'t here; \\""" it\'s here """ """string 2 here"""' # correct output findall should be: #     ['string one\'s end isn\'t here; \\""" it\'s here ','string 2 here'] 

i tried messing around bit, still it's not right.

multiline_string_re = '"""([^(""")\\\\]*(?:\\\\.[^(""")\\\\]*)*)"""' 

there's gotta way """ isn't preceded backslash (in other words, first double-quote isn't escaped).

edit: should getting closer; i've tried following:

r'(?<!\\)""".*(?<!\\)"""' # matches entire string; not i'm going for.  r'(?<!\\)"""[^((?<!\\)""")](?<!\\)"""' # matches space between 2 strings ('""" """') in sample string s (see code above, prior edit).  r'(?<!\\)"""([^((?<!\\)""")]*(?:\\.[^((?<!\\)""")]*)*)(?<!\\)"""' # same result before, triple quotes shaved off (' '). # note: indeed want triple quotes excluded. 

update: the solution, sln, appears """[^"\\](?:(?:\\.|"")[^"\\])*"""

multiline_string_re = '"""[^"\\\\]*(?:(?:\\\\.|"")[^"\\\\]*)*"""' re.findall(multiline_string_re, s, re.dotall) # result: # ['"""string one\'s end isn\'t here; \\""" it\'s here """', '"""string 2 here"""'] 

the updated solution, again sln:

multiline_single_re = "'''[^'\\\\]*(?:(?:\\\\.|'{1,2}(?!'))[^'\\\\]*)*'''" multiline_double_re = '"""[^"\\\\]*(?:(?:\\\\.|"{1,2}(?!"))[^"\\\\]*)*"""' 

here test case using regex in perl. if going allow escape
escaped double quote form "", modify 1 of the
regex's you've sited allow double, double quote.

the source string removed of single quote escaping.

 use strict;  use warnings;   $/ = undef;   $str = <data>;   while ($str =~ /"[^"\\]*(?:(?:\\.|"")[^"\\]*)*"/sg )  {   print "found $&\n";   }    __data__    """string one's end isn't here; \""" it's here """ """string 2 here""" 

output >>

 found """string one's end isn't here; \""" it's here """  found """string 2 here""" 

note validity , error processing, regex need contain
pass-through constructs (alternation) can processed in body of while loop.
example /"[^"\\]*(?:(?:\\.|"")[^"\\]*)*"|(.)/sg,
then
while(){
// if matched group 1, , not whitespace = possible error
}

add - in reply comments.

after research on python block literals,

it appears have handle not escaped characters, but
2 double quotes in body. ie. " or ""

to change regex simple. add 1-2 quantifier , restrain lookahead assertion.

below raw , string'd regex parts can pick , choose from.
tested in perl, works.
luck!

 # raw -   #   (?s:  #   """[^"\\]*(?:(?:\\.|"{1,2}(?!"))[^"\\]*)*"""  #   |  #   '''[^'\\]*(?:(?:\\.|'{1,2}(?!'))[^'\\]*)*'''  #   )  # string'd -  #   '(?s:'  #   '"""[^"\\\]*(?:(?:\\\.|"{1,2}(?!"))[^"\\\]*)*"""'  #   '|'  #   "'''[^'\\\\]*(?:(?:\\\\.|'{1,2}(?!'))[^'\\\\]*)*'''"  #   ')'    (?s:                # dot-all       # double quote literal block       """                 # """ block open       [^"\\]*             # 0 - many non " nor \       (?:                 # grp start            (?:                 \\ .                # escape              |                      # or                 "{1,2}              # 1 - 2 "                 (?! " )             # not followed "            )            [^"\\]*             # 0 - many non " nor \       )*                  # grp end, 0 - many times       """                 # """ block close     |                      # or,         # single quote literal block       '''                 # ''' block open       [^'\\]*             # 0 - many non ' nor \       (?:                 # grp start            (?:                 \\ .                # escape              |                      # or                 '{1,2}              # 1 - 2 '                 (?! ' )             # not followed '            )            [^'\\]*             # 0 - many non ' nor \       )*                  # grp end, 0 - many times       '''                 # ''' block close  ) 

Comments

Popular posts from this blog

php - regexp cyrillic filename not matches -

c# - OpenXML hanging while writing elements -

sql - Select Query has unexpected multiple records (MS Access) -