###加载数据集
数据集为Bike-Sharing-Dataset
[u'1', u'2011-01-01', u'1', u'0', u'1', u'0', u'0', u'6', u'0', u'1', u'0.24', u'0.2879', u'0.81', u'0', u'3', u'13', u'16']
17379
###为线性回归模型准备数据集
|
|
Mapping of first categorical feature column: {u'1': 0, u'3': 1, u'2': 2, u'4': 3}
|
|
Feature vector length for categorical features: 57
Feature vector length for numerical features: 4
Total feature vector length: 61
|
|
|
|
|
|
Raw data: [u'1', u'0', u'1', u'0', u'0', u'6', u'0', u'1', u'0.24', u'0.2879', u'0.81', u'0', u'3', u'13', u'16']
Label: 16.0
Linear Model feature vector:
[1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.24,0.2879,0.81,0.0]
Linear Model feature vector length: 61
###为决策回归树准备数据集
|
|
|
|
Label: 16.0
Decision Tree feature vector: [1.0,0.0,1.0,0.0,0.0,6.0,0.0,1.0,0.24,0.2879,0.81,0.0]
Decision Tree feature vector length: 12
###线性回归和决策回归树的帮助文档
|
|
Help on method train in module pyspark.mllib.regression:
train(cls, data, iterations=100, step=1.0, miniBatchFraction=1.0, initialWeights=None, regParam=0.0, regType=None, intercept=False, validateData=True) method of __builtin__.type instance
Train a linear regression model using Stochastic Gradient
Descent (SGD).
This solves the least squares regression formulation
f(weights) = 1/n ||A weights-y||^2^
(which is the mean squared error).
Here the data matrix has n rows, and the input RDD holds the
set of rows of A, each with its corresponding right hand side
label y. See also the documentation for the precise formulation.
:param data: The training data, an RDD of
LabeledPoint.
:param iterations: The number of iterations
(default: 100).
:param step: The step parameter used in SGD
(default: 1.0).
:param miniBatchFraction: Fraction of data to be used for each
SGD iteration (default: 1.0).
:param initialWeights: The initial weights (default: None).
:param regParam: The regularizer parameter
(default: 0.0).
:param regType: The type of regularizer used for
training our model.
:Allowed values:
- "l1" for using L1 regularization (lasso),
- "l2" for using L2 regularization (ridge),
- None for no regularization
(default: None)
:param intercept: Boolean parameter which indicates the
use or not of the augmented representation
for training data (i.e. whether bias
features are activated or not,
default: False).
:param validateData: Boolean parameter which indicates if
the algorithm should validate data
before training. (default: True)
|
|
Help on method trainRegressor in module pyspark.mllib.tree:
trainRegressor(cls, data, categoricalFeaturesInfo, impurity='variance', maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0) method of __builtin__.type instance
Train a DecisionTreeModel for regression.
:param data: Training data: RDD of LabeledPoint.
Labels are real numbers.
:param categoricalFeaturesInfo: Map from categorical feature
index to number of categories.
Any feature not in this map is treated as continuous.
:param impurity: Supported values: "variance"
:param maxDepth: Max depth of tree.
E.g., depth 0 means 1 leaf node.
Depth 1 means 1 internal node + 2 leaf nodes.
:param maxBins: Number of bins used for finding splits at each
node.
:param minInstancesPerNode: Min number of instances required at
child nodes to create the parent split
:param minInfoGain: Min info gain required to create a split
:return: DecisionTreeModel
Example usage:
>>> from pyspark.mllib.regression import LabeledPoint
>>> from pyspark.mllib.tree import DecisionTree
>>> from pyspark.mllib.linalg import SparseVector
>>>
>>> sparse_data = [
... LabeledPoint(0.0, SparseVector(2, {0: 0.0})),
... LabeledPoint(1.0, SparseVector(2, {1: 1.0})),
... LabeledPoint(0.0, SparseVector(2, {0: 0.0})),
... LabeledPoint(1.0, SparseVector(2, {1: 2.0}))
... ]
>>>
>>> model = DecisionTree.trainRegressor(sc.parallelize(sparse_data), {})
>>> model.predict(SparseVector(2, {1: 1.0}))
1.0
>>> model.predict(SparseVector(2, {1: 0.0}))
0.0
>>> rdd = sc.parallelize([[0.0, 1.0], [0.0, 0.0]])
>>> model.predict(rdd).collect()
[1.0, 0.0]
###训练回归模型
|
|
Linear Model predictions: [(16.0, 110.54561916607503), (40.0, 107.92836240337226), (32.0, 107.45239452594706), (13.0, 107.46860142170017), (1.0, 107.19840815670955)]
|
|
Decision Tree predictions: [(16.0, 54.913223140495866), (40.0, 54.913223140495866), (32.0, 53.171052631578945), (13.0, 14.284023668639053), (1.0, 14.284023668639053)]
Decision Tree depth: 5
Decision Tree number of nodes: 63
###评估回归模型
|
|
|
|
Linear Model - Mean Squared Error: 27408.1527
Linear Model - Mean Absolute Error: 127.4103
Linear Model - Root Mean Squared Log Error: 1.4847
|
|
Decision Tree - Mean Squared Error: 11560.7978
Decision Tree - Mean Absolute Error: 71.0969
Decision Tree - Root Mean Squared Log Error: 0.6259
|
|
###目标变量分布
|
|
|
|
|
|
###对目标变量作对数变换后重新训练并评估
|
|
Linear Model - Mean Squared Error: 37837.3776
Linear Model - Mean Absolute Error: 133.6572
Linear Model - Root Mean Squared Log Error: 1.3315
Non log-transformed predictions:
[(16.0, 110.54561916607503), (40.0, 107.92836240337226), (32.0, 107.45239452594706)]
Log-transformed predictions:
[(15.999999999999998, 45.336304060846494), (40.0, 42.455963122588777), (32.0, 41.297013243855893)]
|
|
|
|
Decision Tree - Mean Squared Error: 14781.5760
Decision Tree - Mean Absolute Error: 76.4131
Decision Tree - Root Mean Squared Log Error: 0.6406
Non log-transformed predictions:
[(16.0, 54.913223140495866), (40.0, 54.913223140495866), (32.0, 53.171052631578945)]
Log-transformed predictions:
[(15.999999999999998, 37.530779787154522), (40.0, 37.530779787154522), (32.0, 7.2797070993907287)]
###为数据集划分训练集与测试集
|
|
|
|
Train data size: 13934
Test data size: 3445
Total data size: 17379
Train + Test size: 17379
|
|
###对线性回归模型的参数调优
|
|
|
|
[1, 5, 10, 20, 50, 100, 200]
[2.8779465130028195, 2.0390187660391499, 1.7761565324837874, 1.5828778102209107, 1.4382263191764473, 1.4050638054019449, 1.4191482045051593]
|
|
[0.01, 0.025, 0.05, 0.1, 1.0]
[1.7761565324837874, 1.4379348243997032, 1.4189071944747718, 1.5027293911925559, nan]
|
|
[0.0, 0.01, 0.1, 1.0, 5.0, 10.0, 20.0]
[1.5027293911925559, 1.5020646031965639, 1.4961903335175231, 1.4479313176192781, 1.4113329999970989, 1.5381692234875386, 1.8279640526059082]
|
|
[0.0, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
[1.5027293911925559, 1.5026938950690176, 1.5023761634555699, 1.499412856617814, 1.4713669769550108, 1.7596682962964314, 4.7551250073268614]
|
|
L1 (1.0) number of zero weights: 4
L1 (10.0) number of zero weights: 33
L1 (100.0) number of zero weights: 58
|
|
[False, True]
[1.4479313176192781, 1.4798261513419801]
|
|
|
|
[1, 2, 3, 4, 5, 10, 20]
[1.0280339660196287, 0.92686672078778276, 0.81807794023407532, 0.74060228537329209, 0.63583503599563096, 0.42659354467941862, 0.45291736653588244]
|
|
[2, 4, 8, 16, 32, 64, 100]
[1.3069788763726049, 0.81923394899750324, 0.75745322513058744, 0.62430742982038667, 0.63583503599563096, 0.63583503599563096, 0.63583503599563096]
###使用IPython的交互插件进行参数调优
需要IPython notebook