数据科学起步01

线性回归是所有模型的基石

手工玩偶店的小何童鞋,想分析下玩偶的数量和成本之间的关系,他得到了如下表格的数据。将这些数据放在了图像之上,似乎是一个线性的关系,但是感觉并不严格,像是在一条直线上下随机波动。其实数据是由“自然之力”按照下面的公式来产生的。 其中b是一个随机变量,服从期望为0,方差为1的正态分布。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 10,7.7
# 10,9.87
# 11,11.18
# 12,10.43
# 13,12.36
# 14,14.15
# 15,15.73
# 16,16.4
# 17,18.86
# 18,16.13
# 19,18.21
# 20,18.37
# 21,22.61
# 22,19.83
# 23,22.67
# 24,22.7
# 25,25.16
# 26,25.55
# 27,28.21
# 28,28.12
# ...


$$y{i} = x{i} + b_{i}$$

机器学习的解法

步骤如下:
1.确定场景的类型
2.定义损失函数,使得模型预测的成本和实际的成本相近
3.提取特征(可能需要除去记错或者是特别异常的数据)
4.确定模型并估计参数(直接上线性模型)
5.评估模型效果(均方差要达到最小)

1
%matplotlib inline
1
2
import sys
sys.version
'3.6.4 (default, Jan 21 2018, 16:48:17) \n[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)]'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# _*_ coding=utf8 _*_

"""
机器学习方法实现
"""

import os
import sys

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model


def data_load(path):

# data = pd.read_csv(path)
# data = pd.read_csv(path, delimiter='', header=0)
# # 选择列
# data['x']
# data[['x', 'y']]
# data.x
#
#
# # 选择行
# data[:10]
# data[10:]
# data[2:3]
pass


def read_data(path):
# data = pd.read_csv(path, delimiter=',', header=0)
data = pd.read_csv(path)
return data


def linearModel(data):
features = ["x"]
labels = ["y"]

train_data = data[:15]
test_data = data[15:]
# 产生并且训练模型
model = train_model(train_data, features, labels)
# 评价模型效果
error, score = evaluate_model(model, test_data, features, labels)
# 图形化模型结果
visual_model(model, data, features, labels, error, score)


def train_model(train_data, features, labels):
model = linear_model.LinearRegression()
model.fit(train_data[features], train_data[labels])
return model


def evaluate_model(model, test_data, features, labels):
error = np.mean(
(model.predict(test_data[features]) - test_data[labels])**2)
score = model.score(test_data[features], test_data[labels])
return error, score


def visual_model(model, data, features, labels, error, score):
"""
模型可视化
"""

# 为在Matplotlib中显示中文,设置特殊字体
plt.rcParams['font.family'] = ['Heiti']
# 创建一个图形框
fig = plt.figure(figsize=(6, 6), dpi=80)
# 在图形框里只画一幅图
ax = fig.add_subplot(111)
# 在Matplotlib中显示中文,需要使用unicode
# 在Python3中,str不需要decode
if sys.version_info[0] == 3:
ax.set_title(u'%s' % "线性回归示例")
else:
ax.set_title(u'%s' % "线性回归示例".decode("utf-8"))
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
# 画点图,用蓝色圆点表示原始数据
# 在Python3中,str不需要decode
if sys.version_info[0] == 3:
ax.scatter(data[features], data[labels], color='b',
label=u'%s: $y = x + \epsilon$' % "真实值")
else:
ax.scatter(data[features], data[labels], color='b',
label=u'%s: $y = x + \epsilon$' % "真实值".decode("utf-8"))
# 根据截距的正负,打印不同的标签
if model.intercept_ > 0:
# 画线图,用红色线条表示模型结果
# 在Python3中,str不需要decode
if sys.version_info[0] == 3:
ax.plot(data[features], model.predict(data[features]), color='r',
label=u'%s: $y = %.3fx$ + %.3f'\
% ("预测值", model.coef_, model.intercept_))
else:
ax.plot(data[features], model.predict(data[features]), color='r',
label=u'%s: $y = %.3fx$ + %.3f'\
% ("预测值".decode("utf-8"), model.coef_, model.intercept_))
else:
# 在Python3中,str不需要decode
if sys.version_info[0] == 3:
ax.plot(data[features], model.predict(data[features]), color='r',
label=u'%s: $y = %.3fx$ - %.3f'\
% ("预测值", model.coef_, abs(model.intercept_)))
else:
ax.plot(data[features], model.predict(data[features]), color='r',
label=u'%s: $y = %.3fx$ - %.3f'\
% ("预测值".decode("utf-8"), model.coef_, abs(model.intercept_)))
legend = plt.legend(shadow=True)
legend.get_frame().set_facecolor('#6F93AE')
# 显示均方差和决定系数
# 在Python3中,str不需要decode
if sys.version_info[0] == 3:
ax.text(0.99, 0.01,
u'%s%.3f\n%s%.3f'\
% ("均方差:", error, "决定系数:", score),
style='italic', verticalalignment='bottom', horizontalalignment='right',
transform=ax.transAxes, color='m', fontsize=13)
else:
ax.text(0.99, 0.01,
u'%s%.3f\n%s%.3f'\
% ("均方差:".decode("utf-8"), error, "决定系数:".decode("utf-8"), score),
style='italic', verticalalignment='bottom', horizontalalignment='right',
transform=ax.transAxes, color='m', fontsize=13)
# 展示上面所画的图片。图片将阻断程序的运行,直至所有的图片被关闭
# 在Python shell里面,可以设置参数"block=False",使阻断失效。
plt.show()


# 为什么搭建模型 有两个目的 使用模型对未知数据做预测;或者利用模型分析数据,找出数据的内在规律,为决策提供支持。
# 在利用模型做预测时,我们希望模型的准确度越高越好,但是在利用模型做分析时, 我们则更关心模型是否是可靠的
1
2
3
4
5
6
7
8
if __name__ == '__main__':
# home_path = os.path.dirname(os.path.abspath(__file__))
if os.name == "posix":
data_path = "./data/simple_example.csv"

data = read_data(data_path)
print(data.shape)
# linearModel(data)
(20, 2)
1
linearModel(data)
/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/linalg/basic.py:1226: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)

png

统计学

1.假设条件概率
2.估计参数
3.推导参数的分布
4.假设检验和置信区间

使用第三方库 statsmodels 来训练线性回归模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import statsmodels.api as sm


def linear_model2(data):
features = ["x"]
labels = ["y"]
Y = data[labels]
X = sm.add_constant(data[features])
result = train_model2(X, Y)
model_summary(result)


def train_model2(X, Y):
model = sm.OLS(Y, X)
result = model.fit()
return result


def model_summary(result):
# 整体统计分析结果
print(result.summary())
print(result.f_test("x=0"))
print(result.f_test("const=0"))
print(result.f_test(["x=0", "const=0"]))
1
2
3
datapath = "./data/simple_example.csv"
data = read_data(datapath)
linear_model2(data)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.962
Model:                            OLS   Adj. R-squared:                  0.960
Method:                 Least Squares   F-statistic:                     460.5
Date:                Tue, 07 Aug 2018   Prob (F-statistic):           2.85e-14
Time:                        11:40:29   Log-Likelihood:                -31.374
No. Observations:                  20   AIC:                             66.75
Df Residuals:                      18   BIC:                             68.74
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.9495      0.934     -1.017      0.323      -2.912       1.013
x              1.0330      0.048     21.458      0.000       0.932       1.134
==============================================================================
Omnibus:                        0.745   Durbin-Watson:                   2.345
Prob(Omnibus):                  0.689   Jarque-Bera (JB):                0.673
Skew:                           0.074   Prob(JB):                        0.714
Kurtosis:                       2.113   Cond. No.                         66.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<F test: F=array([[ 460.4584822]]), p=2.848465414495684e-14, df_denom=18, df_num=1>
<F test: F=array([[ 1.03355794]]), p=0.32279564008314576, df_denom=18, df_num=1>
<F test: F=array([[ 2442.62159921]]), p=1.2108814742372977e-22, df_denom=18, df_num=2>

得到参数b的估计值为-0.9495, 但是这个值在b=0这个假设下的P-value高达32.3%,统计学上认为这种参数是不显著的,应该舍弃此参数。同理a的估计值是1.033,P-value小于0.01,因此a是显著的,应该被纳入模型。因此,需要调整模型。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import os
import sys

import numpy as np
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import matplotlib.pyplot as plt
import pandas as pd


def modelSummary(re):
"""
分析线性回归模型的统计性质
"""

# 整体统计分析结果
print(re.summary())
# 在Windows下运行此脚本需确保Windows下的命令提示符(cmd)能显示中文
# 用f test检测x对应的系数a是否显著
print("检验假设x的系数等于0:")
print(re.f_test("x=0"))
# 用f test检测常量b是否显著
print("检测假设const的系数等于0:")
print(re.f_test("const=0"))
# 用f test检测a=1, b=0同时成立的显著性
print("检测假设x的系数等于1和const的系数等于0同时成立:")
print(re.f_test(["x=1", "const=0"]))


def visualizeModel(re, data, features, labels):
"""
模型可视化
"""

# 计算预测结果的标准差,预测下界,预测上界
prstd, preLow, preUp = wls_prediction_std(re, alpha=0.05)
# 为在Matplotlib中显示中文,设置特殊字体
plt.rcParams['font.family'] = ['Heiti']
# 创建一个图形框
fig = plt.figure(figsize=(6, 6), dpi=80)
# 在图形框里只画一幅图
ax = fig.add_subplot(111)
# 在Matplotlib中显示中文,需要使用unicode
# 在Python3中,str不需要decode
if sys.version_info[0] == 3:
ax.set_title(u'%s' % "线性回归统计分析示例")
else:
ax.set_title(u'%s' % "线性回归统计分析示例".decode("utf-8"))
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
# 画点图,用蓝色圆点表示原始数据
# 在Python3中,str不需要decode
if sys.version_info[0] == 3:
ax.scatter(data[features], data[labels], color='b',
label=u'%s: $y = x + \epsilon$' % "真实值")
else:
ax.scatter(data[features], data[labels], color='b',
label=u'%s: $y = x + \epsilon$' % "真实值".decode("utf-8"))
# 画线图,用红色虚线表示95%置信区间
# 在Python3中,str不需要decode
if sys.version_info[0] == 3:
ax.plot(data[features], preUp, "r--", label=u'%s' % "95%置信区间")
ax.plot(data[features], re.predict(data[features]), color='r',
label=u'%s: $y = %.3fx$'\
% ("预测值", re.params[features]))
else:
ax.plot(data[features], preUp, "r--", label=u'%s' % "95%置信区间".decode("utf-8"))
ax.plot(data[features], re.predict(data[features]), color='r',
label=u'%s: $y = %.3fx$'\
% ("预测值".decode("utf-8"), re.params[features]))
ax.plot(data[features], preLow, "r--")
legend = plt.legend(shadow=True)
legend.get_frame().set_facecolor('#6F93AE')
plt.show()


def trainModel(X, Y):
"""
训练模型
"""

model = sm.OLS(Y, X)
re = model.fit()
return re


def linearModel2(data):
"""
线性回归统计性质分析步骤展示

参数
----
data : DataFrame,建模数据
"""

features = ["x"]
labels = ["y"]
Y = data[labels]
# 加入常量变量
X = sm.add_constant(data[features])
# 构建模型
re = trainModel(X, Y)
# 分析模型效果
modelSummary(re)
# const并不显著,去掉这个常量变量
resNew = trainModel(data[features], Y)
# 输出新模型的分析结果
print(resNew.summary())
# 将模型结果可视化
visualizeModel(resNew, data, features, labels)
1
2

linearModel2(data)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.962
Model:                            OLS   Adj. R-squared:                  0.960
Method:                 Least Squares   F-statistic:                     460.5
Date:                Tue, 07 Aug 2018   Prob (F-statistic):           2.85e-14
Time:                        11:42:08   Log-Likelihood:                -31.374
No. Observations:                  20   AIC:                             66.75
Df Residuals:                      18   BIC:                             68.74
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.9495      0.934     -1.017      0.323      -2.912       1.013
x              1.0330      0.048     21.458      0.000       0.932       1.134
==============================================================================
Omnibus:                        0.745   Durbin-Watson:                   2.345
Prob(Omnibus):                  0.689   Jarque-Bera (JB):                0.673
Skew:                           0.074   Prob(JB):                        0.714
Kurtosis:                       2.113   Cond. No.                         66.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
检验假设x的系数等于0:
<F test: F=array([[ 460.4584822]]), p=2.848465414495684e-14, df_denom=18, df_num=1>
检测假设const的系数等于0:
<F test: F=array([[ 1.03355794]]), p=0.32279564008314576, df_denom=18, df_num=1>
检测假设x的系数等于1和const的系数等于0同时成立:
<F test: F=array([[ 0.99654631]]), p=0.3886267976063851, df_denom=18, df_num=2>
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.996
Model:                            OLS   Adj. R-squared:                  0.996
Method:                 Least Squares   F-statistic:                     4876.
Date:                Tue, 07 Aug 2018   Prob (F-statistic):           2.26e-24
Time:                        11:42:08   Log-Likelihood:                -31.933
No. Observations:                  20   AIC:                             65.87
Df Residuals:                      19   BIC:                             66.86
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x              0.9862      0.014     69.825      0.000       0.957       1.016
==============================================================================
Omnibus:                        0.489   Durbin-Watson:                   2.218
Prob(Omnibus):                  0.783   Jarque-Bera (JB):                0.561
Skew:                           0.033   Prob(JB):                        0.755
Kurtosis:                       2.182   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

png

来对比下两种解法,数据科学家在搭建模型时,通常会定义一些技术指标来衡量模型预测的准确度。需要模型参数的估计值是可靠的,但是事实并非如此,机器学习的模型结果显示,玩偶生产的固定成本是负数,这违背了事实。模型没有抓住数据真正的内在关系。为了提高预测的准确度,常常提取更多的特征,并以此搭建复杂的模型。大家都热衷于复杂度更高的模型。一旦发生过拟合,模型越复杂,错得越多。