programing tip

Python과 Numpy를 사용하여 r- 제곱을 어떻게 계산합니까?

itbloger 2020. 9. 23. 07:23
반응형

Python과 Numpy를 사용하여 r- 제곱을 어떻게 계산합니까?


저는 Python과 Numpy를 사용하여 임의 차수의 최적 다항식을 계산하고 있습니다. x 값, y 값 및 내가 맞추려는 다항식의 정도 (선형, 2 차 등)의 목록을 전달합니다.

이 정도는 효과가 있지만 r (상관 계수)과 r 제곱 (결정 계수)도 계산하고 싶습니다. 내 결과를 Excel의 최적 추세선 기능 및 계산 한 r 제곱 값과 비교하고 있습니다. 이것을 사용하여 선형 최적 (차수가 1과 같음)에 대해 r- 제곱을 올바르게 계산하고 있음을 알고 있습니다. 그러나 내 함수는 차수가 1보다 큰 다항식에서는 작동하지 않습니다.

Excel은이를 수행 할 수 있습니다. Numpy를 사용하여 고차 다항식에 대한 r 제곱을 어떻게 계산합니까?

내 기능은 다음과 같습니다.

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)
     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    correlation = numpy.corrcoef(x, y)[0,1]

     # r
    results['correlation'] = correlation
     # r-squared
    results['determination'] = correlation**2

    return results

로부터 numpy.polyfit의 문서, 그것은 선형 회귀를 피팅 있습니다. 특히 차수가 'd'인 numpy.polyfit은 평균 함수가있는 선형 회귀에 적합합니다.

E (y | x) = p_d * x ** d + p_ {d-1} * x ** (d-1) + ... + p_1 * x + p_0

따라서 해당 피팅에 대한 R- 제곱을 계산하면됩니다. 선형 회귀 에 대한 위키피디아 페이지 는 자세한 내용을 제공합니다. 몇 가지 방법으로 계산할 수있는 R ^ 2에 관심이 있습니다. 가장 쉬운 방법은

SST = Sum(i=1..n) (y_i - y_bar)^2
SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
Rsquared = SSReg/SST

여기서는 'y_bar'를 y의 평균으로 사용하고 'y_ihat'을 각 점에 대한 적합 값으로 사용합니다.

나는 numpy에별로 익숙하지 않습니다 (보통 R에서 일합니다), 아마도 R 제곱을 계산하는 더 깔끔한 방법이 있지만 다음은 정확해야합니다

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)

     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    # r-squared
    p = numpy.poly1d(coeffs)
    # fit values, and mean
    yhat = p(x)                         # or [p(z) for z in x]
    ybar = numpy.sum(y)/len(y)          # or sum(y)/len(y)
    ssreg = numpy.sum((yhat-ybar)**2)   # or sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = numpy.sum((y - ybar)**2)    # or sum([ (yi - ybar)**2 for yi in y])
    results['determination'] = ssreg / sstot

    return results

매우 늦게 회신했지만 누군가가 이에 대한 준비 기능이 필요한 경우를 대비하여 :

scipy.stats.linregress

slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)

@Adam Marples의 답변에서와 같이.


yanl (아직 다른 라이브러리)에서 sklearn.metrics갖는 r2_square기능;

from sklearn.metrics import r2_score

coefficient_of_dermination = r2_score(y, p(x))

나는 이것을 성공적으로 사용하고 있는데, 여기서 x와 y는 배열과 같습니다.

def rsquared(x, y):
    """ Return R^2 where x and y are array-like."""

    slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
    return r_value**2

나는 원래 추천 목적으로 아래 벤치 마크를 게시했으며 numpy.corrcoef어리석게도 원래 질문이 이미 사용 corrcoef하고 실제로 고차 다항식 적합에 대해 묻는 것을 깨닫지 못했습니다 . 통계 모델을 사용하여 다항식 r- 제곱 질문에 실제 솔루션을 추가했으며 주제를 벗어 났지만 잠재적으로 누군가에게 유용 할 수있는 원래 벤치 마크를 남겼습니다.


statsmodelsr^2다항식 피팅을 직접 계산할 수있는 기능이 있습니다. 여기에 두 가지 방법이 있습니다.

import statsmodels.api as sm
import statsmodels.formula.api as smf

# Construct the columns for the different powers of x
def get_r2_statsmodels(x, y, k=1):
    xpoly = np.column_stack([x**i for i in range(k+1)])    
    return sm.OLS(y, xpoly).fit().rsquared

# Use the formula API and construct a formula describing the polynomial
def get_r2_statsmodels_formula(x, y, k=1):
    formula = 'y ~ 1 + ' + ' + '.join('I(x**{})'.format(i) for i in range(1, k+1))
    data = {'x': x, 'y': y}
    return smf.ols(formula, data).fit().rsquared # or rsquared_adj

To further take advantage of statsmodels, one should also look at the fitted model summary, which can be printed or displayed as a rich HTML table in Jupyter/IPython notebook. The results object provides access to many useful statistical metrics in addition to rsquared.

model = sm.OLS(y, xpoly)
results = model.fit()
results.summary()

Below is my original Answer where I benchmarked various linear regression r^2 methods...

The corrcoef function used in the Question calculates the correlation coefficient, r, only for a single linear regression, so it doesn't address the question of r^2 for higher order polynomial fits. However, for what it's worth, I've come to find that for linear regression, it is indeed the fastest and most direct method of calculating r.

def get_r2_numpy_corrcoef(x, y):
    return np.corrcoef(x, y)[0, 1]**2

These were my timeit results from comparing a bunch of methods for 1000 random (x, y) points:

  • Pure Python (direct r calculation)
    • 1000 loops, best of 3: 1.59 ms per loop
  • Numpy polyfit (applicable to n-th degree polynomial fits)
    • 1000 loops, best of 3: 326 µs per loop
  • Numpy Manual (direct r calculation)
    • 10000 loops, best of 3: 62.1 µs per loop
  • Numpy corrcoef (direct r calculation)
    • 10000 loops, best of 3: 56.6 µs per loop
  • Scipy (linear regression with r as an output)
    • 1000 loops, best of 3: 676 µs per loop
  • Statsmodels (can do n-th degree polynomial and many other fits)
    • 1000 loops, best of 3: 422 µs per loop

The corrcoef method narrowly beats calculating the r^2 "manually" using numpy methods. It is >5X faster than the polyfit method and ~12X faster than the scipy.linregress. Just to reinforce what numpy is doing for you, it's 28X faster than pure python. I'm not well-versed in things like numba and pypy, so someone else would have to fill those gaps, but I think this is plenty convincing to me that corrcoef is the best tool for calculating r for a simple linear regression.

Here's my benchmarking code. I copy-pasted from a Jupyter Notebook (hard not to call it an IPython Notebook...), so I apologize if anything broke on the way. The %timeit magic command requires IPython.

import numpy as np
from scipy import stats
import statsmodels.api as sm
import math

n=1000
x = np.random.rand(1000)*10
x.sort()
y = 10 * x + (5+np.random.randn(1000)*10-5)

x_list = list(x)
y_list = list(y)

def get_r2_numpy(x, y):
    slope, intercept = np.polyfit(x, y, 1)
    r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1)))
    return r_squared

def get_r2_scipy(x, y):
    _, _, r_value, _, _ = stats.linregress(x, y)
    return r_value**2

def get_r2_statsmodels(x, y):
    return sm.OLS(y, sm.add_constant(x)).fit().rsquared

def get_r2_python(x_list, y_list):
    n = len(x)
    x_bar = sum(x_list)/n
    y_bar = sum(y_list)/n
    x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1))
    y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1))
    zx = [(xi-x_bar)/x_std for xi in x_list]
    zy = [(yi-y_bar)/y_std for yi in y_list]
    r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1)
    return r**2

def get_r2_numpy_manual(x, y):
    zx = (x-np.mean(x))/np.std(x, ddof=1)
    zy = (y-np.mean(y))/np.std(y, ddof=1)
    r = np.sum(zx*zy)/(len(x)-1)
    return r**2

def get_r2_numpy_corrcoef(x, y):
    return np.corrcoef(x, y)[0, 1]**2

print('Python')
%timeit get_r2_python(x_list, y_list)
print('Numpy polyfit')
%timeit get_r2_numpy(x, y)
print('Numpy Manual')
%timeit get_r2_numpy_manual(x, y)
print('Numpy corrcoef')
%timeit get_r2_numpy_corrcoef(x, y)
print('Scipy')
%timeit get_r2_scipy(x, y)
print('Statsmodels')
%timeit get_r2_statsmodels(x, y)

The wikipedia article on r-squareds suggests that it may be used for general model fitting rather than just linear regression.


Here is a function to compute the weighted r-squared with Python and Numpy (most of the code comes from sklearn):

from __future__ import division 
import numpy as np

def compute_r2_weighted(y_true, y_pred, weight):
    sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
    tse = (weight * (y_true - np.average(
        y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse

Example:

from __future__ import print_function, division 
import sklearn.metrics 

def compute_r2_weighted(y_true, y_pred, weight):
    sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
    tse = (weight * (y_true - np.average(
        y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse    

def compute_r2(y_true, y_predicted):
    sse = sum((y_true - y_predicted)**2)
    tse = (len(y_true) - 1) * np.var(y_true, ddof=1)
    r2_score = 1 - (sse / tse)
    return r2_score, sse, tse

def main():
    '''
    Demonstrate the use of compute_r2_weighted() and checks the results against sklearn
    '''        
    y_true = [3, -0.5, 2, 7]
    y_pred = [2.5, 0.0, 2, 8]
    weight = [1, 5, 1, 2]
    r2_score = sklearn.metrics.r2_score(y_true, y_pred)
    print('r2_score: {0}'.format(r2_score))  
    r2_score,_,_ = compute_r2(np.array(y_true), np.array(y_pred))
    print('r2_score: {0}'.format(r2_score))
    r2_score = sklearn.metrics.r2_score(y_true, y_pred,weight)
    print('r2_score weighted: {0}'.format(r2_score))
    r2_score,_,_ = compute_r2_weighted(np.array(y_true), np.array(y_pred), np.array(weight))
    print('r2_score weighted: {0}'.format(r2_score))

if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

outputs:

r2_score: 0.9486081370449679
r2_score: 0.9486081370449679
r2_score weighted: 0.9573170731707317
r2_score weighted: 0.9573170731707317

This corresponds to the formula (mirror):

enter image description here

with f_i is the predicted value from the fit, y_{av} is the mean of the observed data y_i is the observed data value. w_i is the weighting applied to each data point, usually w_i=1. SSE is the sum of squares due to error and SST is the total sum of squares.


If interested, the code in R: https://gist.github.com/dhimmel/588d64a73fa4fef02c8f (mirror)


R-squared is a statistic that only applies to linear regression.

Essentially, it measures how much variation in your data can be explained by the linear regression.

So, you calculate the "Total Sum of Squares", which is the total squared deviation of each of your outcome variables from their mean. . .

\sum_{i}(y_{i} - y_bar)^2

where y_bar is the mean of the y's.

Then, you calculate the "regression sum of squares", which is how much your FITTED values differ from the mean

\sum_{i}(yHat_{i} - y_bar)^2

and find the ratio of those two.

Now, all you would have to do for a polynomial fit is plug in the y_hat's from that model, but it's not accurate to call that r-squared.

Here is a link I found that speaks to it a little.

참고URL : https://stackoverflow.com/questions/893657/how-do-i-calculate-r-squared-using-python-and-numpy

반응형