programing tip

값이 목록에 있는지 확인하는 가장 빠른 방법

itbloger 2020. 9. 30. 08:54
반응형

값이 목록에 있는지 확인하는 가장 빠른 방법


값이 목록 (수백만 개의 값이 포함 된 목록)에 존재하는지 여부와 해당 색인이 무엇인지 알 수있는 가장 빠른 방법은 무엇입니까?

이 예 에서처럼 목록의 모든 값이 고유하다는 것을 알고 있습니다.

내가 시도하는 첫 번째 방법은 (실제 코드에서 3.8 초) :

a = [4,2,3,1,5,6]

if a.count(7) == 1:
    b=a.index(7)
    "Do something with variable b"

내가 시도하는 두 번째 방법은 (2 배 더 빠름 : 실제 코드의 경우 1.9 초) :

a = [4,2,3,1,5,6]

try:
    b=a.index(7)
except ValueError:
    "Do nothing"
else:
    "Do something with variable b"

Stack Overflow 사용자가 제안한 방법 (실제 코드의 경우 2.74 초) :

a = [4,2,3,1,5,6]
if 7 in a:
    a.index(7)

내 실제 코드에서 첫 번째 방법은 3.81 초, 두 번째 방법은 1.88 초가 걸립니다. 좋은 개선이지만 :

저는 Python / 스크립팅을 사용하는 초보자인데 동일한 작업을 수행하고 처리 시간을 더 절약 할 수있는 더 빠른 방법이 있습니까?

내 응용 프로그램에 대한 더 구체적인 설명 :

Blender API에서 입자 목록에 액세스 할 수 있습니다.

particles = [1, 2, 3, 4, etc.]

여기에서 입자의 위치에 액세스 할 수 있습니다.

particles[x].location = [x,y,z]

그리고 각 입자에 대해 다음과 같이 각 입자 위치를 검색하여 이웃이 존재하는지 테스트합니다.

if [x+1,y,z] in particles.location
    "Find the identity of this neighbour particle in x:the particle's index
    in the array"
    particles.index([x+1,y,z])

7 in a

가장 명확하고 빠른 방법입니다.

을 (를) 사용하는 것도 고려할 수 set있지만 목록에서 해당 집합을 구성하는 것은 빠른 멤버십 테스트로 절약되는 것보다 더 많은 시간이 걸릴 수 있습니다. 확실한 유일한 방법은 벤치마킹을 잘하는 것입니다. (이것은 또한 필요한 작업에 따라 다릅니다)


다른 사람들이 언급했듯이 in큰 목록의 경우 매우 느릴 수 있습니다. 다음은 in, set의 성능을 비교 한 것입니다 bisect. 시간 (초)은 로그 스케일입니다.

여기에 이미지 설명 입력

테스트 용 코드 :

import random
import bisect
import matplotlib.pyplot as plt
import math
import time

def method_in(a,b,c):
    start_time = time.time()
    for i,x in enumerate(a):
        if x in b:
            c[i] = 1
    return(time.time()-start_time)   

def method_set_in(a,b,c):
    start_time = time.time()
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = 1
    return(time.time()-start_time)

def method_bisect(a,b,c):
    start_time = time.time()
    b.sort()
    for i,x in enumerate(a):
        index = bisect.bisect_left(b,x)
        if index < len(a):
            if x == b[index]:
                c[i] = 1
    return(time.time()-start_time)

def profile():
    time_method_in = []
    time_method_set_in = []
    time_method_bisect = []

    Nls = [x for x in range(1000,20000,1000)]
    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        time_method_in.append(math.log(method_in(a,b,c)))
        time_method_set_in.append(math.log(method_set_in(a,b,c)))
        time_method_bisect.append(math.log(method_bisect(a,b,c)))

    plt.plot(Nls,time_method_in,marker='o',color='r',linestyle='-',label='in')
    plt.plot(Nls,time_method_set_in,marker='o',color='b',linestyle='-',label='set')
    plt.plot(Nls,time_method_bisect,marker='o',color='g',linestyle='-',label='bisect')
    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

항목을 set. 세트 조회는 매우 효율적입니다.

시험:

s = set(a)
if 7 in s:
  # do stuff

edit In a comment you say that you'd like to get the index of the element. Unfortunately, sets have no notion of element position. An alternative is to pre-sort your list and then use binary search every time you need to find an element.


def check_availability(element, collection: iter):
    return element in collection

Usage

check_availability('a', [1,2,3,4,'a','b','c'])

I believe this is the fastest way to know if a chosen value is in an array.


a = [4,2,3,1,5,6]

index = dict((y,x) for x,y in enumerate(a))
try:
   a_index = index[7]
except KeyError:
   print "Not found"
else:
   print "found"

This will only be a good idea if a doesn't change and thus we can do the dict() part once and then use it repeatedly. If a does change, please provide more detail on what you are doing.


It sounds like your application might gain advantage from the use of a Bloom Filter data structure.

In short, a bloom filter look-up can tell you very quickly if a value is DEFINITELY NOT present in a set. Otherwise, you can do a slower look-up to get the index of a value that POSSIBLY MIGHT BE in the list. So if your application tends to get the "not found" result much more often then the "found" result, you might see a speed up by adding a Bloom Filter.

For details, Wikipedia provides a good overview of how Bloom Filters work, and a web search for "python bloom filter library" will provide at least a couple useful implementations.


Be aware that the in operator tests not only equality (==) but also identity (is), the in logic for lists is roughly equivalent to the following (it's actually written in C and not Python though, at least in CPython):

for element in s:
    if element is target:
        # fast check for identity implies equality
        return True
    if element == target:
        # slower check for actual equality
        return True
return False

In most circumstances this detail is irrelevant, but in some circumstances it might leave a Python novice surprised, for example, numpy.NAN has the unusual property of being not being equal to itself:

>>> import numpy
>>> numpy.NAN == numpy.NAN
False
>>> numpy.NAN is numpy.NAN
True
>>> numpy.NAN in [numpy.NAN]
True

To distinguish between these unusual cases you could use any() like:

>>> lst = [numpy.NAN, 1 , 2]
>>> any(element == numpy.NAN for element in lst)
False
>>> any(element is numpy.NAN for element in lst)
True 

Note the in logic for lists with any() would be:

any(element is target or element == target for element in lst)

However, I should emphasize that this is an edge case, and for the vast majority of cases the in operator is highly optimised and exactly what you want of course (either with a list or with a set).


Or use __contains__:

sequence.__contains__(value)

Demo:

>>> l=[1,2,3]
>>> l.__contains__(3)
True
>>> 

This is not the code, but the algorithm for very fast searching.

If your list and the value you are looking for are all numbers, this is pretty straightforward. If strings: look at the bottom:

  • -Let "n" be the length of your list
  • -Optional step: if you need the index of the element: add a second column to the list with current index of elements (0 to n-1) - see later
  • Order your list or a copy of it (.sort())
  • Loop through:
    • Compare your number to the n/2th element of the list
      • If larger, loop again between indexes n/2-n
      • If smaller, loop again between indexes 0-n/2
      • If the same: you found it
  • Keep narrowing the list until you have found it or only have 2 numbers (below and above the one you are looking for)
  • This will find any element in at most 19 steps for a list of 1.000.000 (log(2)n to be precise)

If you also need the original position of your number, look for it in the second, index column.

If your list is not made of numbers, the method still works and will be fastest, but you may need to define a function which can compare/order strings.

Of course, this needs the investment of the sorted() method, but if you keep reusing the same list for checking, it may be worth it.


@Winston Ewert's solution yields a big speed-up for very large lists, but this stackoverflow answer indicates that the the try:/except:/else: construct will be slowed down if the except branch is often reached. An alternative is to take advantage of the .get() method for the dict:

a = [4,2,3,1,5,6]

index = dict((y, x) for x, y in enumerate(a))

b = index.get(7, None)
if b is not None:
    "Do something with variable b"

The .get(key, default) method is just for the case when you can't guarantee a key will be in the dict. If key is present, it returns the value (as would dict[key]), but when it is not, .get() returns your default value (here None). You need to make sure in this case that the chosen default will not be in a.


For me it was 0.030 sec (real), 0.026 sec (user), and 0.004 sec (sys).

try:
print("Started")
x = ["a", "b", "c", "d", "e", "f"]

i = 0

while i < len(x):
    i += 1
    if x[i] == "e":
        print("Found")
except IndexError:
    pass

Code to check whether two elements exist in array whose product equals k:

n = len(arr1)
for i in arr1:
    if k%i==0:
        print(i)

참고 URL : https://stackoverflow.com/questions/7571635/fastest-way-to-check-if-a-value-exists-in-a-list

반응형