programing tip

죄와 cos를 함께 계산하는 가장 빠른 방법은 무엇입니까?

itbloger 2020. 8. 22. 08:16
반응형

죄와 cos를 함께 계산하는 가장 빠른 방법은 무엇입니까?


값의 사인과 코사인을 함께 계산하고 싶습니다 (예 : 회전 행렬 생성). 물론 a = cos(x); b = sin(x);, 처럼 하나씩 하나씩 따로 계산할 수 는 있지만 두 값이 모두 필요할 때 더 빠른 방법이 있는지 궁금합니다.

편집 : 지금까지 답변을 요약하려면 :

  • VladFSINCOS두 가지를 모두 계산하는 asm 명령이 있다고 말했습니다(FSIN혼자호출과 거의 동시에)

  • Chi가 알아 차린 것처럼 ,이 최적화는 때때로 컴파일러에 의해 이미 수행됩니다 (최적화 플래그를 사용할 때).

  • 카페는 기능이 있음을 지적sincos하고sincosf아마 가능하며 단지 포함하여 직접 호출 할 수 있습니다math.h

  • 조회 테이블을 사용하는 tanascius 접근 방식은 논란의 여지가 있습니다. (그러나 내 컴퓨터와 벤치 마크 시나리오에서는sincos32 비트 부동 소수점에 대해 거의 동일한 정확도보다 3 배 빠르게 실행됩니다.)

  • Joel Goodwin 은 매우 정확한 근사화 기법에 대한 흥미로운 접근 방식에 연결했습니다 (저에게는 테이블 조회보다 훨씬 빠릅니다).


최신 Intel / AMD 프로세서에는 FSINCOS사인 및 코사인 함수를 동시에 계산하는 명령 이 있습니다. 강력한 최적화가 필요한 경우이를 사용해야합니다.

다음은 작은 예입니다. http://home.broadpark.no/~alein/fsincos.html

다음은 MSVC의 다른 예입니다. http://www.codeguru.com/forum/showthread.php?t=328669

다음은 또 다른 예입니다 (gcc 사용) : http://www.allegro.cc/forums/thread/588470

그들 중 하나가 도움이되기를 바랍니다. (이 지침을 직접 사용하지 않았습니다.)

프로세서 수준에서 지원되므로 테이블 조회보다 훨씬 빠를 것으로 기대합니다.

편집 :
WikipediaFSINCOS387 프로세서에 추가 되었다고 제안 하므로 지원하지 않는 프로세서를 거의 찾을 수 없습니다.

편집 :
인텔의 문서에 따르면 부동 소수점 분할 FSINCOS보다 약 5 배 정도 느립니다 FDIV.

편집 :
모든 최신 컴파일러가 사인 및 코사인 계산을 FSINCOS. 특히 내 VS 2008은 그렇게하지 않았습니다.

편집 :
첫 번째 예제 링크는 죽었지 만 Wayback Machine에 여전히 버전있습니다 .


최신 x86 프로세서에는 사용자가 요청한 것을 정확히 수행하는 fsincos 명령어가 있습니다. sin과 cos를 동시에 계산합니다. 좋은 최적화 컴파일러는 같은 값에 대해 sin과 cos를 계산하는 코드를 감지하고이를 실행하기 위해 fsincos 명령을 사용해야합니다.

이것이 작동하려면 컴파일러 플래그를 약간 돌리는 것이 필요했지만 :

$ gcc --version
i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5488)
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ cat main.c
#include <math.h> 

struct Sin_cos {double sin; double cos;};

struct Sin_cos fsincos(double val) {
  struct Sin_cos r;
  r.sin = sin(val);
  r.cos = cos(val);
  return r;
}

$ gcc -c -S -O3 -ffast-math -mfpmath=387 main.c -o main.s

$ cat main.s
    .text
    .align 4,0x90
.globl _fsincos
_fsincos:
    pushl   %ebp
    movl    %esp, %ebp
    fldl    12(%ebp)
    fsincos
    movl    8(%ebp), %eax
    fstpl   8(%eax)
    fstpl   (%eax)
    leave
    ret $4
    .subsections_via_symbols

Tada, fsincos 명령어를 사용합니다!


성능이 필요할 때 미리 계산 된 sin / cos 테이블을 사용할 수 있습니다 (하나의 테이블이 가능하며 사전으로 저장 됨). 글쎄, 그것은 당신이 필요로하는 정확도에 달려 있지만 (아마도 테이블이 클 것입니다), 정말 빨라야합니다.


Technically, you’d achieve this by using complex numbers and Euler’s Formula. Thus, something like (C++)

complex<double> res = exp(complex<double>(0, x));
// or equivalent
complex<double> res = polar<double>(1, x);
double sin_x = res.imag();
double cos_x = res.real();

should give you sine and cosine in one step. How this is done internally is a question of the compiler and library being used. It could (and might) well take longer to do it this way (just because Euler’s Formula is mostly used to compute the complex exp using sin and cos – and not the other way round) but there might be some theoretical optimisation possible.


Edit

The headers in <complex> for GNU C++ 4.2 are using explicit calculations of sin and cos inside polar, so it doesn’t look too good for optimisations there unless the compiler does some magic (see the -ffast-math and -mfpmath switches as written in Chi’s answer).


You could compute either and then use the identity:

cos(x)2 = 1 - sin(x)2

but as @tanascius says, a precomputed table is the way to go.


If you use the GNU C library, then you can do:

#define _GNU_SOURCE
#include <math.h>

and you will get declarations of the sincos(), sincosf() and sincosl() functions that calculate both values together - presumably in the fastest way for your target architecture.


There is very interesting stuff on this forum page, which is focused on finding good approximations that are fast: http://www.devmaster.net/forums/showthread.php?t=5784

Disclaimer: Not used any of this stuff myself.

Update 22 Feb 2018: Wayback Machine is the only way to visit the original page now: https://web.archive.org/web/20130927121234/http://devmaster.net/posts/9648/fast-and-accurate-sine-cosine


Many C math libraries, as caf indicates, already have sincos(). The notable exception is MSVC.

  • Sun has had sincos() since at least 1987 (twenty-three years; I have a hard-copy man page)
  • HPUX 11 had it in 1997 (but isn't in HPUX 10.20)
  • Added to glibc in version 2.1 (Feb 1999)
  • Became a built-in in gcc 3.4 (2004), __builtin_sincos().

And regarding look-up, Eric S. Raymond in the Art of Unix Programming (2004) (Chapter 12) says explicitly this a Bad Idea (at the present moment in time):

"Another example is precomputing small tables--for example, a table of sin(x) by degree for optimizing rotations in a 3D graphics engine will take 365 × 4 bytes on a modern machine. Before processors got enough faster than memory to demand caching, this was an obvious speed optimization. Nowadays it may be faster to recompute each time rather than pay for the percentage of additional cache misses caused by the table.

"But in the future, this might turn around again as caches grow larger. More generally, many optimizations are temporary and can easily turn into pessimizations as cost ratios change. The only way to know is to measure and see." (from the Art of Unix Programming)

But, judging from the discussion above, not everyone agrees.


I don't believe that lookup tables are necessarily a good idea for this problem. Unless your accuracy requirements are very low the table needs to be very large. And modern CPUs can do a lot of computation while a value is fetched from main memory. This is not one of those questions which can be properly answered by argument (not even mine), test and measure and consider the data.

But I'd look to the fast implementations of SinCos that you find in libraries such as AMD's ACML and Intel's MKL.


If you are willing to use a commercial product, and are calculating a number of sin/cos calculations at the same time (so you can use vectored functions), you should check out Intel's Math Kernel Library.

It has a sincos function

According to that documentation, it averages 13.08 clocks/element on core 2 duo in high accuracy mode, which i think will be even faster than fsincos.


This article shows how to construct a parabolic algorithm that generates both the sine and the cosine:

DSP Trick: Simultaneous Parabolic Approximation of Sin and Cos

http://www.dspguru.com/dsp/tricks/parabolic-approximation-of-sin-and-cos


When performance is critical for this kind of thing it is not unusual to introduce a lookup table.


For a creative approach, how about expanding the Taylor series? Since they have similar terms, you could do something like the following pseudo:

numerator = x
denominator = 1
sine = x
cosine = 1
op = -1
fact = 1

while (not enough precision) {
    fact++
    denominator *= fact
    numerator *= x

    cosine += op * numerator / denominator

    fact++
    denominator *= fact
    numerator *= x

    sine += op * numerator / denominator

    op *= -1
}

This means you do something like this: starting at x and 1 for sin and cosine, follow the pattern - subtract x^2 / 2! from cosine, subtract x^3 / 3! from sine, add x^4 / 4! to cosine, add x^5 / 5! to sine...

I have no idea whether this would be performant. If you need less precision than the built in sin() and cos() give you, it may be an option.


There is a nice solution in the CEPHES library which can be pretty fast and you can add/remove accuracy quite flexibly for a bit more/less CPU time.

Remember that cos(x) and sin(x) are the real and imaginary parts of exp(ix). So we want to calculate exp(ix) to get both. We precalculate exp(iy) for some discrete values of y between 0 and 2pi. We shift x to the interval [0, 2pi). Then we select the y that is closest to x and write
exp(ix)=exp(iy+(ix-iy))=exp(iy)exp(i(x-y)).

We get exp(iy) from the lookup table. And since |x-y| is small (at most half the distance between the y-values), the Taylor series will converge nicely in just a few terms, so we use that for exp(i(x-y)). And then we just need a complex multiplication to get exp(ix).

Another nice property of this is that you can vectorize it using SSE.


You may want to have a look at http://gruntthepeon.free.fr/ssemath/, which offers an SSE vectorized implementation inspired from CEPHES library. It has good accuracy (maximum deviation from sin/cos on the order of 5e-8) and speed (slightly outperforms fsincos on a single call basis, and a clear winner over multiple values).


I have posted a solution involving inline ARM assembly capable of computing both the sine and cosine of two angles at a time here: Fast sine/cosine for ARMv7+NEON


An accurate yet fast approximation of sin and cos function simultaneously, in javascript, can be found here: http://danisraelmalta.github.io/Fmath/ (easily imported to c/c++)


Have you thought of declaring lookup tables for the two functions? You'd still have to "calculate" sin(x) and cos(x), but it'd be decidedly faster, if you don't need a high degree of accuracy.


The MSVC compiler may use the (internal) SSE2 functions

 ___libm_sse2_sincos_ (for x86)
 __libm_sse2_sincos_  (for x64)

in optimized builds if appropriate compiler flags are specified (at minimum /O2 /arch:SSE2 /fp:fast). The names of these functions seem to imply that they do not compute separate sin and cos, but both "in one step".

For example:

void sincos(double const x, double & s, double & c)
{
  s = std::sin(x);
  c = std::cos(x);
}

Assembly (for x86) with /fp:fast:

movsd   xmm0, QWORD PTR _x$[esp-4]
call    ___libm_sse2_sincos_
mov     eax, DWORD PTR _s$[esp-4]
movsd   QWORD PTR [eax], xmm0
mov     eax, DWORD PTR _c$[esp-4]
shufpd  xmm0, xmm0, 1
movsd   QWORD PTR [eax], xmm0
ret     0

Assembly (for x86) without /fp:fast but with /fp:precise instead (which is the default) calls separate sin and cos:

movsd   xmm0, QWORD PTR _x$[esp-4]
call    __libm_sse2_sin_precise
mov     eax, DWORD PTR _s$[esp-4]
movsd   QWORD PTR [eax], xmm0
movsd   xmm0, QWORD PTR _x$[esp-4]
call    __libm_sse2_cos_precise
mov     eax, DWORD PTR _c$[esp-4]
movsd   QWORD PTR [eax], xmm0
ret     0

So /fp:fast is mandatory for the sincos optimization.

But please note that

___libm_sse2_sincos_

is maybe not as precise as

__libm_sse2_sin_precise
__libm_sse2_cos_precise

due to the missing "precise" at the end of its name.

On my "slightly" older system (Intel Core 2 Duo E6750) with the latest MSVC 2019 compiler and appropriate optimizations, my benchmark shows that the sincos call is about 2.4 times faster than separate sin and cos calls.

참고URL : https://stackoverflow.com/questions/2683588/what-is-the-fastest-way-to-compute-sin-and-cos-together

반응형