Selecting Data for Modeling

你的数据集包含太多变量，你根本无法理解，甚至无法很好地打印出来。如何将如此庞大的数据精简到你能理解的程度呢？

我们将首先运用直觉挑选一些变量。后续课程将向你展示自动确定变量优先级的统计技术。

要选择变量/列，我们需要查看数据集中所有列的列表。这可以通过 DataFrame 的 columns 属性（下面代码的最后一行）来实现。

In [1]:

import pandas as pd

melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

Out[1]:

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [2]:

# The Melbourne data has some missing values (some houses for which some variables weren't recorded.)
# We'll learn to handle missing values in a later tutorial.  
# Your Iowa data doesn't have missing values in the columns you use. 
# So we will take the simplest option for now, and drop houses from our data. 
# Don't worry about this much for now, though the code is:

# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

有很多方法可以选择数据子集。Pandas 课程对这些方法进行了更深入的讲解，但目前我们将重点介绍两种方法。

点符号，我们用它来选择“预测目标”。
使用列列表进行选择，我们用它来选择“特征”。

选择预测目标

您可以使用点符号提取一个变量。这个单列存储在Series中，它类似于只有一列数据的 DataFrame。

我们将使用点符号来选择要预测的列，它被称为预测目标。按照惯例，预测目标称为y。因此，我们需要将墨尔本房价数据保存为以下代码：

In [3]:

y = melbourne_data.Price

Choosing “Features”

输入到我们模型中的列（稍后用于进行预测）称为“特征”。在我们的例子中，这些列用于确定房价。有时，您会将除目标列之外的所有列都用作特征。有时，使用较少的特征会更好。

现在，我们将构建一个仅包含少量特征的模型。稍后您将了解如何迭代和比较使用不同特征构建的模型。

我们通过提供括号内的列名列表来选择多个特征。该列表中的每个项目都应为字符串（带引号）。

以下是一个例子：

In [4]:

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

按照惯例，该数据被称为X。

In [5]:

X = melbourne_data[melbourne_features]

让我们快速回顾一下我们将使用“describe”方法和“head”方法预测房价的数据，其中显示了前几行。

In [6]:

X.describe()

Out[6]:

	Rooms	Bathroom	Landsize	Lattitude	Longtitude
count	6196.000000	6196.000000	6196.000000	6196.000000	6196.000000
mean	2.931407	1.576340	471.006940	-37.807904	144.990201
std	0.971079	0.711362	897.449881	0.075850	0.099165
min	1.000000	1.000000	0.000000	-38.164920	144.542370
25%	2.000000	1.000000	152.000000	-37.855438	144.926198
50%	3.000000	1.000000	373.000000	-37.802250	144.995800
75%	4.000000	2.000000	628.000000	-37.758200	145.052700
max	8.000000	8.000000	37000.000000	-37.457090	145.526350

In [7]:

X.head()

Out[7]:

	Rooms	Bathroom	Landsize	Lattitude	Longtitude
1	2	1.0	156.0	-37.8079	144.9934
2	3	2.0	134.0	-37.8093	144.9944
4	4	1.0	120.0	-37.8072	144.9941
6	3	2.0	245.0	-37.8024	144.9993
7	2	1.0	256.0	-37.8060	144.9954

使用这些命令直观地检查数据是数据科学家工作的重要组成部分。你经常会在数据集中发现一些值得进一步研究的惊喜。

Building Your Model

您将使用 scikit-learn 库来创建模型。在编写代码时，该库被写作 sklearn，正如您在示例代码中看到的那样。Scikit-learn 无疑是最流行的用于对通常存储在 DataFrame 中的数据类型进行建模的库。

构建和使用模型的步骤如下：

定义：模型的类型是什么？决策树？其他类型的模型？还需要指定该模型类型的其他一些参数。
拟合：从提供的数据中捕捉模式。这是建模的核心。
预测：顾名思义
评估：确定模型预测的准确率。

以下示例演示了如何使用 scikit-learn 定义决策树模型，并使用特征和目标变量对其进行拟合。

In [8]:

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

Out[8]:

DecisionTreeRegressor(random_state=1)

许多机器学习模型允许在模型训练中引入一些随机性。为“random_state”指定一个数值可确保每次运行都能获得相同的结果。这被认为是一种良好做法。您可以使用任意数值，模型质量不会显著依赖于您选择的具体值。

现在，我们已经有了一个拟合模型，可以用来进行预测。

在实际操作中，您需要预测即将上市的新房，而不是我们已知的房屋价格。但我们将对训练数据的前几行进行预测，以了解预测函数的工作原理。

In [9]:

print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))
Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]

Your Turn

在模型构建练习中亲自尝试一下