手工玩偶店的小何童鞋,想分析下玩偶的数量和成本之间的关系,他得到了如下表格的数据。将这些数据放在了图像之上,似乎是一个线性的关系,但是感觉并不严格,像是在一条直线上下随机波动。其实数据是由“自然之力”按照下面的公式来产生的。 其中b是一个随机变量,服从期望为0,方差为1的正态分布。
1 | # 10,7.7 |
$$y{i} = x{i} + b_{i}$$
1 | %matplotlib inline |
1 | import sys |
'3.6.4 (default, Jan 21 2018, 16:48:17) \n[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)]'
1 | # _*_ coding=utf8 _*_ |
1 | if __name__ == '__main__': |
(20, 2)
1 | linearModel(data) |
/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/linalg/basic.py:1226: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
warnings.warn(mesg, RuntimeWarning)
使用第三方库 statsmodels 来训练线性回归模型
1 | import statsmodels.api as sm |
1 | datapath = "./data/simple_example.csv" |
OLS Regression Results
Dep. Variable: y R-squared: 0.962
Model: OLS Adj. R-squared: 0.960
Method: Least Squares F-statistic: 460.5
Date: Tue, 07 Aug 2018 Prob (F-statistic): 2.85e-14
Time: 11:40:29 Log-Likelihood: -31.374
No. Observations: 20 AIC: 66.75
Df Residuals: 18 BIC: 68.74
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -0.9495 0.934 -1.017 0.323 -2.912 1.013
x 1.0330 0.048 21.458 0.000 0.932 1.134
Omnibus: 0.745 Durbin-Watson: 2.345
Prob(Omnibus): 0.689 Jarque-Bera (JB): 0.673
Skew: 0.074 Prob(JB): 0.714
Kurtosis: 2.113 Cond. No. 66.3
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<F test: F=array([[ 460.4584822]]), p=2.848465414495684e-14, df_denom=18, df_num=1>
<F test: F=array([[ 1.03355794]]), p=0.32279564008314576, df_denom=18, df_num=1>
<F test: F=array([[ 2442.62159921]]), p=1.2108814742372977e-22, df_denom=18, df_num=2>
得到参数b的估计值为-0.9495, 但是这个值在b=0这个假设下的P-value高达32.3%,统计学上认为这种参数是不显著的,应该舍弃此参数。同理a的估计值是1.033,P-value小于0.01,因此a是显著的,应该被纳入模型。因此,需要调整模型。
1 | import os |
1 |
OLS Regression Results
Dep. Variable: y R-squared: 0.962
Model: OLS Adj. R-squared: 0.960
Method: Least Squares F-statistic: 460.5
Date: Tue, 07 Aug 2018 Prob (F-statistic): 2.85e-14
Time: 11:42:08 Log-Likelihood: -31.374
No. Observations: 20 AIC: 66.75
Df Residuals: 18 BIC: 68.74
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -0.9495 0.934 -1.017 0.323 -2.912 1.013
x 1.0330 0.048 21.458 0.000 0.932 1.134
Omnibus: 0.745 Durbin-Watson: 2.345
Prob(Omnibus): 0.689 Jarque-Bera (JB): 0.673
Skew: 0.074 Prob(JB): 0.714
Kurtosis: 2.113 Cond. No. 66.3
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<F test: F=array([[ 460.4584822]]), p=2.848465414495684e-14, df_denom=18, df_num=1>
<F test: F=array([[ 1.03355794]]), p=0.32279564008314576, df_denom=18, df_num=1>
<F test: F=array([[ 0.99654631]]), p=0.3886267976063851, df_denom=18, df_num=2>
OLS Regression Results
Dep. Variable: y R-squared: 0.996
Model: OLS Adj. R-squared: 0.996
Method: Least Squares F-statistic: 4876.
Date: Tue, 07 Aug 2018 Prob (F-statistic): 2.26e-24
Time: 11:42:08 Log-Likelihood: -31.933
No. Observations: 20 AIC: 65.87
Df Residuals: 19 BIC: 66.86
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x 0.9862 0.014 69.825 0.000 0.957 1.016
Omnibus: 0.489 Durbin-Watson: 2.218
Prob(Omnibus): 0.783 Jarque-Bera (JB): 0.561
Skew: 0.033 Prob(JB): 0.755
Kurtosis: 2.182 Cond. No. 1.00
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.