kaggle总结

主要是机器学习的过程

采用scikit-learn 结合pandas numpy matplotlib seaborn


import pandas as pd

input_df = pd.read_csv('data/raw/train.csv', header=0)
submit_df  = pd.read_csv('data/raw/test.csv',  header=0)

# 合并他们
df = pd.concat([input_df, submit_df])

# 重建index
df.reset_index(inplace=True)

# 删除reset_index()产生的index column
df.drop('index', axis=1, inplace=True)

print df.shape[1], "columns:", df.columns.values
print "Row count:", df.shape[0]

输出如下


12 columns: ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']
Row count: 1309

可以看到有12个特征

查看数据的完整性情况

def observe(df):
	print "column: ", df.shape[1]
	columns = df.columns
	for i in columns:
		print i, "missing ",pd.isnull(df[i]).sum(), " type:", df[i].dtypes

Cabin 缺失很严重，我想可以忽略这一个特征了。
Age 缺失的并不多，而且Age是一个重要的特征，应该保留。

如何处理缺失的数据

直接扔掉出现缺失Value的数据：只有少量的数据出现缺失Value的情况，这样做比较简单快捷。
给缺失的Value赋特殊值来表明它是缺失的：比较适用于分类变量，因为缺失Value就是不存在的数据，如果给他分配平均值之类的数值并没有什么意义。除非是某些潜在原因使某些缺失值会影响其与另外一个值的关联(correlation)。并且这种方法不适用于连续变量。不过对于二元变量(binary variables)，我们可以把他的缺失值赋为0，正常情况下True为1，False为-1。
给缺失的Value赋平均值：这种简单的做法很普遍，对于不重要的特征来说用这种方法足矣。还可以结合其他变量来算平均值。对于分类变量，使用最常见的值或许比平均值更好。
使用机器学习算法/模型来预测缺失数据：感觉只有数据量很大的情况下这样做才有效。

定量转换

变量转换的目的是将数据转换为模型适用的格式，不同方法实现的随机森林(Random Forest)接受不同类型的数据，Scikit-learn要求数据都是数字型numeric，所以我们要将原始数据转换为数字型numeric。

所有的数据可以分为两类：1.定性(Quantitative)变量可以以某种方式排序，Age就是一个很好的列子。2.定量(Qualitative)变量描述了物体的某一（不能被数学表示的）方面，Embarked就是一个例子。

Dummy Variables
就是类别变量或者二元变量，当qualitative variable是一些频繁出现的几个独立变量时，Dummy Variables比较适合使用。我们以Embarked为例，Embarked只包含三个值’S’,’C’,’Q’，我们可以使用下面的代码将其转换为dummies:

1
2
3

embark_dummies  = pd.get_dummies(df['Embarked'])
df = df.join(embark_dummies)
df.drop(['Embarked'], axis=1,inplace=True)

Factorizing
dummy不好处理Cabin（船舱号）这种标称属性，因为他出现的变量比较多。所以Pandas有一个方法叫做factorize()，它可以创建一些数字，来表示类别变量，对每一个类别映射一个ID，这种映射最后只生成一个特征，不像dummy那样生成多个特征。

import re

# Replace missing values with "U0"
df['Cabin'][df.Cabin.isnull()] = 'U0'

# create feature for the alphabetical part of the cabin number
df['CabinLetter'] = df['Cabin'].map( lambda x : re.compile("([a-zA-Z]+)").search(x).group())

# convert the distinct cabin letters with incremental integer values
df['CabinLetter'] = pd.factorize(df['CabinLetter'])[0]

Scaling
Scaling可以将一个很大范围的数值映射到一个很小的范围(通常是-1 - 1，或则是0 - 1)，很多情况下我们需要将数值做Scaling使其范围大小一样，否则大范围数值特征将会由更高的权重。比如：Age的范围可能只是0-100，而income的范围可能是0-10000000，在某些对数组大小敏感的模型中会影响其结果。

下面的代码是对Age进行Scaling：

1
2
3

# StandardScaler will subtract the mean from each value then scale to the unit variance
scaler = preprocessing.StandardScaler()
df['Age_scaled'] = scaler.fit_transform(df['Age'])

Binning
Binning通过观察“邻居”(即周围的值)来连续数据离散化。存储的值被分布到一些“桶”或箱中，就像直方图的bin将数据划分成几块一样。下面的代码对Fare进行Binning。

# Divide all fares into quartiles
df['Fare_bin'] = pd.qcut(df['Fare'], 4)

# qcut() creates a new variable that identifies the quartile range, but we can't use the string so either
# factorize or create dummies from the result
df['Fare_bin_id'] = pd.factorize(df['Fare_bin'])
df = pd.concat([df, pd.get_dummies(df['Fare_bin']).rename(columns=lambda x: 'Fare_' + str(x))], axis=1)

特征提取很重要的一个方面是深入理解数据，并且能提取出新的特征来做预测。机器学习的核心就是模型选取和参数选择，特征提取可以说是重中之重。

一个特征提取的例子是，从电话号码中提取中国家、地区、城市的信息，或者是从GPS中提取中国家、地区、城市的信息。只要能描述一个事物的qualitative变量，都有可能从中挖掘出有用的特征，另外，时序等信息也是非常有用的。

泰坦尼克号的这些数据非常简单，我们并不需要对数据做太多的处理，我们下面只对name，cabin和ticket提取一些变量。

name 提取称呼
Cabin
客舱信息包含了甲板和房间号，不同甲板位置不同，逃生船数量不同，人群年龄分布不同等等。不同房间号离甲板距离不同，离逃生船距离不同，等等。所以从客舱中提取中甲板和房间号这两个特征很重要。

机器学习的模型很多，用于分类有：

回归算法：Logistic Regression、 Ordinary Least Square等等。
决策树: CART、ID3、Random Forest等等。
贝叶斯：Navie Bayesian、BBN等等。
基于实例的算法：KNN、LVQ等等。
组合模型、关联规则、神经网络、深度学习等等。
模型太多都看晕了，这种场景下选什么模型合适？

随机森林


from sklearn.ensemble import RandomForestClassifier
X = df[:input_df.shape[0]].values[:, 1::]
y = df[:input_df.shape[0]].values[:, 0]

X_test = df[input_df.shape[0]:].values[:, 1::]
random_forest = RandomForestClassifier(oob_score=True, n_estimators=1000)
random_forest.fit(X, y)

Y_pred = random_forest.predict(X_test)
print random_forest.score(X, y)
submission = pd.DataFrame({
	    "PassengerId": X_origin["PassengerId"],
	    "Survived": Y_pred.astype(int)
	})
submission.to_csv('result.csv', index=False)

GBDT

from sklearn.ensemble import GradientBoostingClassifier
X = df[:input_df.shape[0]].values[:, 1::]
y = df[:input_df.shape[0]].values[:, 0]

X_test = df[input_df.shape[0]:].values[:, 1::]
GBDT = GradientBoostingClassifier(n_estimators=1000)
GBDT.fit(X, y)

Y_pred = GBDT.predict(X_test)
print GBDT.score(X, y)
submission = pd.DataFrame({
	    "PassengerId": X_origin["PassengerId"],
	    "Survived": Y_pred.astype(int)
	})
submission.to_csv('result2.csv', index=False)

调优优化

再观察一下数据，看看还有那些特征可以用到，又去Google了一番，整理出三个新特征：称谓、家庭大小、姓。

称谓：不同的称谓意味着不同的社会地位、不同的社会地位的人对人生、事物的理解不同。并且不同的社会地位乘坐逃生舱的概率也不同？可能某一类人的生存概率更高？

家庭大小：一家七个人的逃生概率大还是一家两个人的逃生概率大呢？人多的家庭会不会更难逃生呢？

姓：其实姓这个特征是为了辅助家庭这个特征的，同一个姓是一个家庭的概率更大？

参数调优，Sklean提供了两种方法，GridSearch和RandomizedSearch。在这两种情况下，都可以指定每个参数的取值范围，创建一个字典。将参数字典提供给search方法，它就会执行模型所指定的值的组合。GridSearch会测试参数每一个可能的组合。而RandomizedSearch需要指定有多少不同的组合要测试，然后随机选择并组合他们。

使用Random Forest, 加上参数max_depth=5 防止模型过拟合，并将n_estimators放到了30000

首先 Error = Bias + Variance，Error反映的是整个模型的准确度，Bias反映的是模型在样本上的输出与真实值之间的误差，即模型本身的精准度，Variance反映的是模型每一次输出结果与模型输出期望之间的误差，即模型的稳定性。
举一个例子，一次打靶实验，目标是为了打到10环，但是实际上只打到了7环，那么这里面的Error就是3。具体分析打到7环的原因，可能有两方面：一是瞄准出了问题，比如实际上射击瞄准的是9环而不是10环；二是枪本身的稳定性有问题，虽然瞄准的是9环，但是只打到了7环。那么在上面一次射击实验中，Bias就是1,反应的是模型期望与真实目标的差距，而在这次试验中，由于Variance所带来的误差就是2，即虽然瞄准的是9环，但由于本身模型缺乏稳定性，造成了实际结果与模型期望之间的差距。

High variance，low bias意味着”overfitting”，模型过拟合导致不能很好的用于新数据。而High bias，low variance意味着”underfitting”，模型欠拟合导致不能很好从样本中学习，很难去预测新数据。Bias与Variance往往是不能兼得的。如果要降低模型的Bias，就一定程度上会提高模型的Variance，反之亦然。

例如，如果模型存在high variance，一个常见的解决方法是给他增加更多的特征。但是这样也会增加bias，这中间的平衡需要仔细考虑。

from sklearn.learning_curve import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt
title = "Learning Curves"
plot_learning_curve(RandomForestClassifier(oob_score=True, n_estimators=30000, max_depth=5), title, X, y, ylim=(0.5, 1.01), cv=None, n_jobs=4, train_sizes=[50, 100, 150, 200, 250, 350, 400])
plt.show()

如何处理缺失的数据

定量转换

调优 优化

调优优化