Alternatives to variable-width lookbehind in Python regex -
i've decided jump deep end of python pool , start converting of r code on python , i'm stuck on important me. in line of work, spend lot of time parsing text data, which, know, unstructured. result, i've come rely on lookaround feature of regex , r's lookaround functionality quite robust. example, if i'm parsing pdf might introduce spaces in between letters when ocr file, i'd value want this:
oacctnum <- str_extract(textblock[indexval], "(?<=orig\\s?:\\s?/\\s?)[a-z0-9]+")
in python, isn't possible because use of ?
makes lookbehind variable-width expression opposed fixed-width. functionality important enough me deters me wanting use python, instead of giving on language i'd know pythonista way of addressing issue. have preprocess string before extracting text? this:
oacctnum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "") oacctnum = re.search(r"(?<=orig:/)([a-z0-9])", textblock[indexval]).group(1)
is there more efficient way this? because while example trivial, issue comes in complex ways data work , i'd hate have kind of preprocessing every line of text analyze.
lastly, apologize if not right place ask question; wasn't sure else post it. in advance.
you need use capture groups in case described:
"(?<=orig\\s?:\\s?/\\s?)[a-z0-9]+"
will become
r"orig\s?:\s?/\s?([a-z0-9]+)"
the value in .group(1)
. note raw strings preferred.
here sample code:
import re p = re.compile(r'orig\s?:\s?/\s?([a-z0-9]+)', re.ignorecase) test_str = "orig:/texthere" print re.search(p, test_str).group(1)
unless need overlapping matches, capturing groups usage instead of look-behind rather straightforward.
Comments
Post a Comment