programing tip

문자열의 줄을 반복합니다.

itbloger 2020. 8. 12. 07:53
반응형

문자열의 줄을 반복합니다.


다음과 같이 정의 된 여러 줄 문자열이 있습니다.

foo = """
this is 
a multi-line string.
"""

이 문자열은 내가 쓰고있는 파서의 테스트 입력으로 사용했습니다. 파서 함수는 file입력 으로 -object를 수신 하고 반복합니다. 또한 next()줄을 건너 뛰기 위해 메서드를 직접 호출 하므로 반복자가 아닌 입력으로 반복기가 필요합니다. file-object가 텍스트 파일의 줄을 넘기는 것처럼 해당 문자열의 개별 줄을 반복하는 반복기가 필요 합니다. 물론 다음과 같이 할 수 있습니다.

lineiterator = iter(foo.splitlines())

더 직접적인 방법이 있습니까? 이 시나리오에서 문자열은 분할을 위해 한 번 통과 한 다음 파서가 다시 통과해야합니다. 내 테스트 케이스에서는 문제가되지 않습니다. 문자열이 매우 짧기 때문에 호기심에서 묻고 있습니다. 파이썬에는 그러한 것들을위한 유용하고 효율적인 내장 기능이 너무 많지만,이 필요에 맞는 것을 찾을 수 없었습니다.


세 가지 가능성이 있습니다.

foo = """
this is 
a multi-line string.
"""

def f1(foo=foo): return iter(foo.splitlines())

def f2(foo=foo):
    retval = ''
    for char in foo:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

def f3(foo=foo):
    prevnl = -1
    while True:
      nextnl = foo.find('\n', prevnl + 1)
      if nextnl < 0: break
      yield foo[prevnl + 1:nextnl]
      prevnl = nextnl

if __name__ == '__main__':
  for f in f1, f2, f3:
    print list(f())

이것을 메인 스크립트로 실행하면 세 가지 기능이 동일하다는 것을 확인할 수 있습니다. timeit(과 * 100에 대한 것은 foo더 정확한 측정을위한 실질적인 문자열을 얻을 수) :

$ python -mtimeit -s'import asp' 'list(asp.f3())'
1000 loops, best of 3: 370 usec per loop
$ python -mtimeit -s'import asp' 'list(asp.f2())'
1000 loops, best of 3: 1.36 msec per loop
$ python -mtimeit -s'import asp' 'list(asp.f1())'
10000 loops, best of 3: 61.5 usec per loop

list()반복자가 빌드 된 것이 아니라 순회되도록 하려면 호출이 필요합니다 .

IOW, the naive implementation is so much faster it isn't even funny: 6 times faster than my attempt with find calls, which in turn is 4 times faster than a lower-level approach.

Lessons to retain: measurement is always a good thing (but must be accurate); string methods like splitlines are implemented in very fast ways; putting strings together by programming at a very low level (esp. by loops of += of very small pieces) can be quite slow.

Edit: added @Jacob's proposal, slightly modified to give the same results as the others (trailing blanks on a line are kept), i.e.:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip('\n')
        else:
            raise StopIteration

Measuring gives:

$ python -mtimeit -s'import asp' 'list(asp.f4())'
1000 loops, best of 3: 406 usec per loop

not quite as good as the .find based approach -- still, worth keeping in mind because it might be less prone to small off-by-one bugs (any loop where you see occurrences of +1 and -1, like my f3 above, should automatically trigger off-by-one suspicions -- and so should many loops which lack such tweaks and should have them -- though I believe my code is also right since I was able to check its output with other functions').

But the split-based approach still rules.

An aside: possibly better style for f4 would be:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl == '': break
        yield nl.strip('\n')

at least, it's a bit less verbose. The need to strip trailing \ns unfortunately prohibits the clearer and faster replacement of the while loop with return iter(stri) (the iter part whereof is redundant in modern versions of Python, I believe since 2.3 or 2.4, but it's also innocuous). Maybe worth trying, also:

    return itertools.imap(lambda s: s.strip('\n'), stri)

or variations thereof -- but I'm stopping here since it's pretty much a theoretical exercise wrt the strip based, simplest and fastest, one.


I'm not sure what you mean by "then again by the parser". After the splitting has been done, there's no further traversal of the string, only a traversal of the list of split strings. This will probably actually be the fastest way to accomplish this, so long as the size of your string isn't absolutely huge. The fact that python uses immutable strings means that you must always create a new string, so this has to be done at some point anyway.

If your string is very large, the disadvantage is in memory usage: you'll have the original string and a list of split strings in memory at the same time, doubling the memory required. An iterator approach can save you this, building a string as needed, though it still pays the "splitting" penalty. However, if your string is that large, you generally want to avoid even the unsplit string being in memory. It would be better just to read the string from a file, which already allows you to iterate through it as lines.

However if you do have a huge string in memory already, one approach would be to use StringIO, which presents a file-like interface to a string, including allowing iterating by line (internally using .find to find the next newline). You then get:

import StringIO
s = StringIO.StringIO(myString)
for line in s:
    do_something_with(line)

If I read Modules/cStringIO.c correctly, this should be quite efficient (although somewhat verbose):

from cStringIO import StringIO

def iterbuf(buf):
    stri = StringIO(buf)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip()
        else:
            raise StopIteration

Regex-based searching is sometimes faster than generator approach:

RRR = re.compile(r'(.*)\n')
def f4(arg):
    return (i.group(1) for i in RRR.finditer(arg))

I suppose you could roll your own:

def parse(string):
    retval = ''
    for char in string:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

I'm not sure how efficient this implementation is, but that will only iterate over your string once.

Mmm, generators.

Edit:

Of course you'll also want to add in whatever type of parsing actions you want to take, but that's pretty simple.

참고URL : https://stackoverflow.com/questions/3054604/iterate-over-the-lines-of-a-string

반응형