Alternatives to variable-width lookbehind in Python regex -

- August 15, 2010

i've decided jump deep end of python pool , start converting of r code on python , i'm stuck on important me. in line of work, spend lot of time parsing text data, which, know, unstructured. result, i've come rely on lookaround feature of regex , r's lookaround functionality quite robust. example, if i'm parsing pdf might introduce spaces in between letters when ocr file, i'd value want this:

oacctnum <- str_extract(textblock[indexval], "(?<=orig\\s?:\\s?/\\s?)[a-z0-9]+")

in python, isn't possible because use of ? makes lookbehind variable-width expression opposed fixed-width. functionality important enough me deters me wanting use python, instead of giving on language i'd know pythonista way of addressing issue. have preprocess string before extracting text? this:

oacctnum = re.sub(r"(?<=\b\w)\s(?=\w\b)", "") oacctnum = re.search(r"(?<=orig:/)([a-z0-9])", textblock[indexval]).group(1)

is there more efficient way this? because while example trivial, issue comes in complex ways data work , i'd hate have kind of preprocessing every line of text analyze.

lastly, apologize if not right place ask question; wasn't sure else post it. in advance.

you need use capture groups in case described:

"(?<=orig\\s?:\\s?/\\s?)[a-z0-9]+"

will become

r"orig\s?:\s?/\s?([a-z0-9]+)"

the value in .group(1). note raw strings preferred.

here sample code:

import re p = re.compile(r'orig\s?:\s?/\s?([a-z0-9]+)', re.ignorecase) test_str = "orig:/texthere" print re.search(p, test_str).group(1)

ideone demo

unless need overlapping matches, capturing groups usage instead of look-behind rather straightforward.

Search This Blog

Post

Alternatives to variable-width lookbehind in Python regex -

Comments

Post a Comment

Popular posts from this blog

Fail to load namespace Spring Security http://www.springframework.org/security/tags -

sql - MySQL query optimization using coalesce -

Maven Javadoc 'Cannot find default setter' and fails -