线性回归是所有模型的基石
手工玩偶店的小何童鞋,想分析下玩偶的数量和成本之间的关系,他得到了如下表格的数据。将这些数据放在了图像之上,似乎是一个线性的关系,但是感觉并不严格,像是在一条直线上下随机波动。其实数据是由“自然之力”按照下面的公式来产生的。 其中b是一个随机变量,服从期望为0,方差为1的正态分布。
1 | # 10,7.7 |
$$y{i} = x{i} + b_{i}$$
机器学习的解法
步骤如下:
1.确定场景的类型
2.定义损失函数,使得模型预测的成本和实际的成本相近
3.提取特征(可能需要除去记错或者是特别异常的数据)
4.确定模型并估计参数(直接上线性模型)
5.评估模型效果(均方差要达到最小)
1 | %matplotlib inline |
1 | import sys |
'3.6.4 (default, Jan 21 2018, 16:48:17) \n[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)]'
1 | # _*_ coding=utf8 _*_ |
1 | if __name__ == '__main__': |
(20, 2)
1 | linearModel(data) |
/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/linalg/basic.py:1226: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
warnings.warn(mesg, RuntimeWarning)
统计学
1.假设条件概率
2.估计参数
3.推导参数的分布
4.假设检验和置信区间
使用第三方库 statsmodels 来训练线性回归模型
1 | import statsmodels.api as sm |
1 | datapath = "./data/simple_example.csv" |
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.962
Model: OLS Adj. R-squared: 0.960
Method: Least Squares F-statistic: 460.5
Date: Tue, 07 Aug 2018 Prob (F-statistic): 2.85e-14
Time: 11:40:29 Log-Likelihood: -31.374
No. Observations: 20 AIC: 66.75
Df Residuals: 18 BIC: 68.74
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.9495 0.934 -1.017 0.323 -2.912 1.013
x 1.0330 0.048 21.458 0.000 0.932 1.134
==============================================================================
Omnibus: 0.745 Durbin-Watson: 2.345
Prob(Omnibus): 0.689 Jarque-Bera (JB): 0.673
Skew: 0.074 Prob(JB): 0.714
Kurtosis: 2.113 Cond. No. 66.3
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<F test: F=array([[ 460.4584822]]), p=2.848465414495684e-14, df_denom=18, df_num=1>
<F test: F=array([[ 1.03355794]]), p=0.32279564008314576, df_denom=18, df_num=1>
<F test: F=array([[ 2442.62159921]]), p=1.2108814742372977e-22, df_denom=18, df_num=2>
得到参数b的估计值为-0.9495, 但是这个值在b=0这个假设下的P-value高达32.3%,统计学上认为这种参数是不显著的,应该舍弃此参数。同理a的估计值是1.033,P-value小于0.01,因此a是显著的,应该被纳入模型。因此,需要调整模型。
1 | import os |
1 |
|
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.962
Model: OLS Adj. R-squared: 0.960
Method: Least Squares F-statistic: 460.5
Date: Tue, 07 Aug 2018 Prob (F-statistic): 2.85e-14
Time: 11:42:08 Log-Likelihood: -31.374
No. Observations: 20 AIC: 66.75
Df Residuals: 18 BIC: 68.74
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.9495 0.934 -1.017 0.323 -2.912 1.013
x 1.0330 0.048 21.458 0.000 0.932 1.134
==============================================================================
Omnibus: 0.745 Durbin-Watson: 2.345
Prob(Omnibus): 0.689 Jarque-Bera (JB): 0.673
Skew: 0.074 Prob(JB): 0.714
Kurtosis: 2.113 Cond. No. 66.3
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
检验假设x的系数等于0:
<F test: F=array([[ 460.4584822]]), p=2.848465414495684e-14, df_denom=18, df_num=1>
检测假设const的系数等于0:
<F test: F=array([[ 1.03355794]]), p=0.32279564008314576, df_denom=18, df_num=1>
检测假设x的系数等于1和const的系数等于0同时成立:
<F test: F=array([[ 0.99654631]]), p=0.3886267976063851, df_denom=18, df_num=2>
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.996
Model: OLS Adj. R-squared: 0.996
Method: Least Squares F-statistic: 4876.
Date: Tue, 07 Aug 2018 Prob (F-statistic): 2.26e-24
Time: 11:42:08 Log-Likelihood: -31.933
No. Observations: 20 AIC: 65.87
Df Residuals: 19 BIC: 66.86
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x 0.9862 0.014 69.825 0.000 0.957 1.016
==============================================================================
Omnibus: 0.489 Durbin-Watson: 2.218
Prob(Omnibus): 0.783 Jarque-Bera (JB): 0.561
Skew: 0.033 Prob(JB): 0.755
Kurtosis: 2.182 Cond. No. 1.00
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
来对比下两种解法,数据科学家在搭建模型时,通常会定义一些技术指标来衡量模型预测的准确度。需要模型参数的估计值是可靠的,但是事实并非如此,机器学习的模型结果显示,玩偶生产的固定成本是负数,这违背了事实。模型没有抓住数据真正的内在关系。为了提高预测的准确度,常常提取更多的特征,并以此搭建复杂的模型。大家都热衷于复杂度更高的模型。一旦发生过拟合,模型越复杂,错得越多。