python - re.sub() not working as I expect -
i have string given below.
appcodename: mozilla<br>appversion: 5.0 (x11; linux x86_64) applewebkit/537.36 (khtml, gecko) ubuntu chromium/41.0.2272.76 chrome/41.0.2272.76 safari/537.36<br>
i want extract mozilla
above string. use following python program.
import re import json open('data.txt','rb') f: data = json.load(f) message = data['message'] appcodename = re.sub('.+appcodename: ([^<br>])(.*)',r'\1',message,1) print('appcode name {}'.format(appcodename))
the output
appcode name m
what wrong regex.
the problem regex twofold:
you using negated class
[^<br>]
matches character except<
,b
,r
,>
(their order irrelevant). not cause problem particular case, not advised use negated class prevent matches specific sequence of characters.you want
([^<br>])
can match 1 character matchmozilla
several characters long.
quick & dirty fix:
appcodename = re.sub('.*appcodename: ([^<br>]+)(.*)',r'\1',message,1)
.*
allows matches if string begins appcodename
, ([^<br>]+)
allows matching of more 1 character.
as mentioned above, negated character class not advised. thus, next step make above better:
appcodename = re.sub(r'.*appcodename: ((?:(?!<br>).)+).*',r'\1',message,1)
(?:(?!<br>).)+
bit slow (this uses negative lookahead (?! ... )
), match number of characters long <br>
not within characters. checking each character, , each time, makes sure there no <br>
@ character before attempting match it. next, rawing regex string advised avoid unexpected behaviours.
finally, replacing before , after not practical; matching make things simpler:
appcodename = re.search(r'appcodename: ((?:(?!<br>).)+)', message).group(1)
at point, might use instead, not use negative lookahead , simpler read believe:
appcodename = re.search(r'appcodename: (.+?)<br>', message).group(1)
Comments
Post a Comment