Introduction

选择要处理的 pandas DataFrame 或 Series 的特定值是几乎任何数据操作中的一个隐含步骤，因此在使用 Python 处理数据时需要学习的第一件事就是如何快速有效地选择与您相关的数据点。

import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option('display.max_rows', 5)

要开始此主题的练习，请点击此处。

Native accessors

原生 Python 对象提供了良好的数据索引方式。Pandas 也继承了这些特性，这使得入门变得简单。

考虑一下这个 DataFrame：

In [2]:

reviews

Out[2]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston…	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth…	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
…	…	…	…	…	…	…	…	…	…	…	…	…	…
129969	France	A dry style of Pinot Gris, this is crisp with …	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte…	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car…	Gewürztraminer	Domaine Schoffit

129971 行 × 13 列

在 Python 中，我们可以通过将对象作为属性 (attribute) 来访问其属性 (property)。例如，一个 book 对象可能有一个 title 属性，我们可以通过调用 book.title 来访问它。Pandas DataFrame 中的列的工作方式大致相同。

因此，要访问 reviews 的 country 属性，我们可以使用：

In [3]:

reviews.country

Out[3]:

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

如果我们有一个 Python 字典，我们可以使用索引运算符 ([]) 访问它的值。我们也可以对 DataFrame 中的列执行相同的操作：

In [4]:

reviews['country']

Out[4]:

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

这是从 DataFrame 中选取特定 Series 的两种方法。它们在语法上并无优劣之分，但索引运算符 [] 的优势在于它可以处理包含保留字符的列名（例如，如果我们有一个 country providence 列，那么 reviews.country providence 将无法工作）。

Pandas Series 看起来是不是有点像一本精美的字典？它确实很像，所以毫不奇怪，为了深入到单个特定值，我们只需再次使用索引运算符 []：

In [5]:

reviews['country'][0]

Out[5]:

'Italy'

Indexing in pandas

索引运算符和属性选择非常实用，因为它们的工作方式与 Python 生态系统中的其他运算符相同。对于新手来说，这使其易于上手和使用。然而，Pandas 有自己的访问器运算符 loc 和 iloc。对于更高级的操作，您应该使用这些运算符。

基于索引的选择

Pandas 索引有两种工作模式。第一种是基于索引的选择：根据数据在数据中的数字位置进行选择。iloc 就遵循这种模式。

要选择 DataFrame 中的第一行数据，我们可以使用以下命令：

In [6]:

reviews.iloc[0]

Out[6]:

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

loc 和 iloc 都是先行后列的。这与 Python 原生的先列后行的做法正好相反。

这意味着检索行会稍微容易一些，而检索列则会稍微困难一些。要使用 iloc 获取列，我们可以执行以下操作：

In [7]:

reviews.iloc[:, 0]

Out[7]:

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

: 运算符本身也源自 Python 原生代码，表示“一切”。然而，当与其他选择器结合使用时，它可以用来指示值的范围。例如，要从第一、第二和第三行中选择“country”列，我们可以这样做：

In [8]:

reviews.iloc[:3, 0]

Out[8]:

0       Italy
1    Portugal
2          US
Name: country, dtype: object

或者，如果只选择第二和第三个条目，我们可以这样做：

In [9]:

reviews.iloc[1:3, 0]

Out[9]:

1    Portugal
2          US
Name: country, dtype: object

也可以传递列表：

In [10]:

reviews.iloc[[0, 1, 2], 0]

Out[10]:

0       Italy
1    Portugal
2          US
Name: country, dtype: object

最后，值得一提的是，负数可以在选择中使用。这将从值的末尾开始向前计数。例如，这里是数据集的最后五个元素。

In [11]:

reviews.iloc[-5:]

Out[11]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
129966	Germany	Notes of honeysuckle and cantaloupe sweeten th…	Brauneberger Juffer-Sonnenuhr Spätlese	90	28.0	Mosel	NaN	NaN	Anna Lee C. Iijima	NaN	Dr. H. Thanisch (Erben Müller-Burggraef) 2013 …	Riesling	Dr. H. Thanisch (Erben Müller-Burggraef)
129967	US	Citation is given as much as a decade of bottl…	NaN	90	75.0	Oregon	Oregon	Oregon Other	Paul Gregutt	@paulgwine	Citation 2004 Pinot Noir (Oregon)	Pinot Noir	Citation
129968	France	Well-drained gravel soil gives this wine its c…	Kritt	90	30.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Gresser 2013 Kritt Gewurztraminer (Als…	Gewürztraminer	Domaine Gresser
129969	France	A dry style of Pinot Gris, this is crisp with …	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte…	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car…	Gewürztraminer	Domaine Schoffit

基于标签的选择

属性选择的第二种范式是 loc 运算符所遵循的范式：基于标签的选择。在此范式中，重要的是数据索引值，而不是其位置。

例如，要获取 reviews 中的第一个条目，我们现在可以执行以下操作：

In [12]:

reviews.loc[0, 'country']

Out[12]:

'Italy'

从概念上讲，“iloc”比“loc”更简单，因为它忽略了数据集的索引。使用“iloc”时，我们将数据集视为一个大矩阵（一个列表的列表），必须按位置对其进行索引。相比之下，“loc”则使用索引中的信息来完成工作。由于数据集通常具有有意义的索引，因此使用“loc”通常更容易。例如，以下操作使用“loc”更容易：

In [13]:

reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

Out[13]:

	taster_name	taster_twitter_handle	points
0	Kerin O’Keefe	@kerinokeefe	87
1	Roger Voss	@vossroger	87
…	…	…	…
129969	Roger Voss	@vossroger	90
129970	Roger Voss	@vossroger	90

129971 行 × 3 列

在 `loc` 和 `iloc` 之间进行选择

在 loc 和 iloc 之间进行选择或转换时，需要注意一个“陷阱”，即这两种方法使用的索引方案略有不同。

iloc 使用 Python 标准库的索引方案，其中包含范围的第一个元素，而不包含最后一个元素。因此，0:10 将选择条目 0,...,9。而 loc 则包含范围。因此，0:10 将选择条目 0,...,10。

为什么要进行这样的更改？请记住，loc 可以索引任何标准库类型：例如字符串。假设我们有一个 DataFrame，其索引值为“Apples, …, Potatoes, …”，并且我们想要选择“Apples 和 Potatoes 之间所有按字母顺序排列的水果”，那么使用索引“df.loc[‘Apples’:’Potatoes’]”要比使用索引“df.loc[‘Apples’, ‘Potatoet’]”（字母表中“t”位于“s”之后）方便得多。

当 DataFrame 索引是一个简单的数字列表（例如“0,…,1000”）时，这尤其容易引起混淆。在这种情况下，“df.iloc[0:1000]”将返回 1000 个条目，而“df.loc[0:1000]”将返回其中的 1001 个！要使用“loc”获取 1000 个元素，您需要向下一级索引，即“df.loc[0:999]”。

否则，使用“loc”的语义与“iloc”的语义相同。

Manipulating the index

基于标签的选择功能源自索引中的标签。关键在于，我们使用的索引并非一成不变。我们可以按照自己认为合适的任何方式操作索引。

可以使用 set_index() 方法来完成这项工作。以下是当我们将 set_index 设置为 title 字段时发生的情况：

In [14]:

reviews.set_index("title")

Out[14]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	variety	winery
title
Nicosia 2013 Vulkà Bianco (Etna)	Italy	Aromas include tropical fruit, broom, brimston…	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	White Blend	Nicosia
Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portugal	This is ripe and fruity, a wine that is smooth…	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Portuguese Red	Quinta dos Avidagos
…	…	…	…	…	…	…	…	…	…	…	…	…
Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	France	A dry style of Pinot Gris, this is crisp with …	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Pinot Gris	Domaine Marcel Deiss
Domaine Schoffit 2012 Lieu-dit Harth Cuvée Caroline Gewurztraminer (Alsace)	France	Big, rich and off-dry, this is powered by inte…	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Gewürztraminer	Domaine Schoffit

129971 行 × 12 列

如果您能为数据集创建一个比当前更好的索引，这将非常有用。

Conditional selection

到目前为止，我们一直在使用 DataFrame 本身的结构属性对各种数据步长进行索引。然而，为了对数据进行有趣的处理，我们通常需要根据条件提出问题。

例如，假设我们特别关注意大利生产的高于平均水平的葡萄酒。

我们可以先检查每种葡萄酒是否是意大利产的：

In [15]:

reviews.country == 'Italy'

Out[15]:

0          True
1         False
          ...  
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

此操作根据每条记录的“国家/地区”生成一系列“True”/“False”布尔值。此结果随后可在“loc”内部用于选择相关数据：

In [16]:

reviews.loc[reviews.country == 'Italy']

Out[16]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston…	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
6	Italy	Here’s a bright, informal red that opens with …	Belsito	87	16.0	Sicily & Sardinia	Vittoria	NaN	Kerin O’Keefe	@kerinokeefe	Terre di Giurfo 2013 Belsito Frappato (Vittoria)	Frappato	Terre di Giurfo
…	…	…	…	…	…	…	…	…	…	…	…	…	…
129961	Italy	Intense aromas of wild cherry, baking spice, t…	NaN	90	30.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	COS 2013 Frappato (Sicilia)	Frappato	COS
129962	Italy	Blackberry, cassis, grilled herb and toasted a…	Sàgana Tenuta San Giacomo	90	40.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	Cusumano 2012 Sàgana Tenuta San Giacomo Nero d…	Nero d’Avola	Cusumano

19540 行 × 13 列

这个 DataFrame 大约有 20,000 行。原始数据大约有 130,000 行。这意味着大约 15% 的葡萄酒来自意大利。

我们还想知道哪些葡萄酒比平均水平更好。葡萄酒的评分标准是 80 到 100 分，所以这可能意味着至少获得了 90 分的葡萄酒。

我们可以使用“&”符号将这两个问题连接起来：

In [17]:

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

Out[17]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
120	Italy	Slightly backward, particularly given the vint…	Bricco Rocche Prapó	92	70.0	Piedmont	Barolo	NaN	NaN	NaN	Ceretto 2003 Bricco Rocche Prapó (Barolo)	Nebbiolo	Ceretto
130	Italy	At the first it was quite muted and subdued, b…	Bricco Rocche Brunate	91	70.0	Piedmont	Barolo	NaN	NaN	NaN	Ceretto 2003 Bricco Rocche Brunate (Barolo)	Nebbiolo	Ceretto
…	…	…	…	…	…	…	…	…	…	…	…	…	…
129961	Italy	Intense aromas of wild cherry, baking spice, t…	NaN	90	30.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	COS 2013 Frappato (Sicilia)	Frappato	COS
129962	Italy	Blackberry, cassis, grilled herb and toasted a…	Sàgana Tenuta San Giacomo	90	40.0	Sicily & Sardinia	Sicilia	NaN	Kerin O’Keefe	@kerinokeefe	Cusumano 2012 Sàgana Tenuta San Giacomo Nero d…	Nero d’Avola	Cusumano

6648 行 × 13 列

假设我们要购买任何产自意大利或评级高于平均水平的葡萄酒。为此，我们使用竖线 (|)：

In [18]:

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

Out[18]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston…	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
6	Italy	Here’s a bright, informal red that opens with …	Belsito	87	16.0	Sicily & Sardinia	Vittoria	NaN	Kerin O’Keefe	@kerinokeefe	Terre di Giurfo 2013 Belsito Frappato (Vittoria)	Frappato	Terre di Giurfo
…	…	…	…	…	…	…	…	…	…	…	…	…	…
129969	France	A dry style of Pinot Gris, this is crisp with …	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte…	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car…	Gewürztraminer	Domaine Schoffit

61937 行 × 13 列

Pandas 自带一些内置条件选择器，我们将在此重点介绍其中两个。

第一个是 isin。isin 允许你选择值“位于”值列表中的数据。例如，我们可以使用它来选择仅来自意大利或法国的葡萄酒：

In [19]:

reviews.loc[reviews.country.isin(['Italy', 'France'])]

Out[19]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston…	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
6	Italy	Here’s a bright, informal red that opens with …	Belsito	87	16.0	Sicily & Sardinia	Vittoria	NaN	Kerin O’Keefe	@kerinokeefe	Terre di Giurfo 2013 Belsito Frappato (Vittoria)	Frappato	Terre di Giurfo
…	…	…	…	…	…	…	…	…	…	…	…	…	…
129969	France	A dry style of Pinot Gris, this is crisp with …	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte…	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car…	Gewürztraminer	Domaine Schoffit

41633 行 × 13 列

第二个方法是 isnull（以及它的对应方法 notnull）。这些方法可以突出显示为空（或非空）的值（NaN）。例如，为了过滤掉数据集中没有价格标签的葡萄酒，我们可以这样做：

In [20]:

reviews.loc[reviews.price.notnull()]

Out[20]:

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
1	Portugal	This is ripe and fruity, a wine that is smooth…	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and…	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
…	…	…	…	…	…	…	…	…	…	…	…	…	…
129969	France	A dry style of Pinot Gris, this is crisp with …	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte…	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car…	Gewürztraminer	Domaine Schoffit

120975 行 × 13 列

Assigning data

反过来，给 DataFrame 赋值也很简单。你可以赋值一个常量值：

In [21]:

reviews['critic'] = 'everyone'
reviews['critic']

Out[21]:

0         everyone
1         everyone
            ...   
129969    everyone
129970    everyone
Name: critic, Length: 129971, dtype: object

或者使用可迭代的值：

In [22]:

reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews['index_backwards']

Out[22]:

0         129971
1         129970
           ...  
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64

Your turn

如果你还没有开始练习，你可以**从这里开始**。