Introduction

我们经常会收到一些带有列名、索引名或其他命名约定的数据,而这些约定我们并不满意。在这种情况下,您将学习如何使用 Pandas 函数将有问题的条目更改为更合适的名称。

您还将学习如何合并来自多个 DataFrame 和/或 Series 的数据。

要开始本主题的练习,请点击此处

Renaming

我们首先要介绍的函数是 rename(),它允许你更改索引名称和/或列名称。例如,要将数据集中的 points 列更改为 score,我们可以这样做:

In [1]:

import pandas as pd
pd.set_option('display.max_rows', 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

In [2]:

reviews.rename(columns={'points': 'score'})

Out[2]:

country description designation score price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston… Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth… Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
129969 France A dry style of Pinot Gris, this is crisp with … NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte… Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car… Gewürztraminer Domaine Schoffit

129971 行 × 13 列

rename() 函数允许您通过分别指定 indexcolumn 关键字参数来重命名索引或列的值。它支持多种输入格式,但通常使用 Python 字典最为方便。以下是使用它重命名索引中某些元素的示例。

In [3]:

reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

Out[3]:

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
firstEntry Italy Aromas include tropical fruit, broom, brimston… Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
secondEntry Portugal This is ripe and fruity, a wine that is smooth… Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
129969 France A dry style of Pinot Gris, this is crisp with … NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte… Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car… Gewürztraminer Domaine Schoffit

129971 行 × 13 列

您可能经常重命名列,但很少重命名索引值。因此,使用 set_index() 通常更方便。

行索引和列索引都可以拥有各自的 name 属性。可以使用配套的 rename_axis() 方法来更改这些名称。例如:

In [4]:

reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

Out[4]:

fields country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
wines
0 Italy Aromas include tropical fruit, broom, brimston… Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth… Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
129969 France A dry style of Pinot Gris, this is crisp with … NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte… Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car… Gewürztraminer Domaine Schoffit

129971 行 × 13 列

Combining

在对数据集执行操作时,我们有时需要以非同寻常的方式组合不同的 DataFrame 和/或 Series。Pandas 提供了三种核心方法来实现这一点。按复杂度依次递增,这三种方法分别是 concat()join()merge()merge() 的大部分功能也可以用 join() 更简单地实现,因此我们将省略它,并重点介绍前两个函数。

最简单的组合方法是 concat()。给定一个元素列表,此函数会沿某个轴将这些元素合并在一起。

当数据位于不同的 DataFrame 或 Series 对象中,但具有相同的字段(列)时,这种方法非常有用。例如:YouTube 视频数据集,它根据原产国(例如,本例中为加拿大和英国)对数据进行拆分。如果我们想同时研究多个国家/地区,可以使用 concat() 将它们合并在一起:

In [5]:

canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../input/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

Out[5]:

video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description
0 n1WpP7iowLc 17.14.11 Eminem - Walk On Water (Audio) ft. Beyoncé EminemVEVO 10 2017-11-10T17:00:03.000Z Eminem|”Walk”|”On”|”Water”|”Aftermath/Shady/In… 17158579 787425 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg False False False Eminem’s new track Walk on Water ft. Beyoncé i…
1 0dBIkQ4Mz1M 17.14.11 PLUSH - Bad Unboxing Fan Mail iDubbbzTV 23 2017-11-13T17:00:00.000Z plush|”bad unboxing”|”unboxing”|”fan mail”|”id… 1014651 127794 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg False False False STill got a lot of packages. Probably will las…
38914 -DRsfNObKIQ 18.14.06 Eleni Foureira - Fuego - Cyprus - LIVE - First… Eurovision Song Contest 24 2018-05-08T20:32:32.000Z Eurovision Song Contest|”2018”|”Lisbon”|”Cypru… 14317515 151870 45875 26766 https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg False False False Eleni Foureira represented Cyprus at the first…
38915 4YFo4bdMO8Q 18.14.06 KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A… SuperDuperKyle 10 2018-05-11T04:06:35.000Z Kyle|”SuperDuperKyle”|”Ikuyo”|”2 Chainz”|”Soph… 607552 18271 274 1423 https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg False False False Debut album ‘Light of Mine’ out now: http://ky

79797 行 × 16 列

就复杂度而言,中间的组合器是 join()join() 可以组合具有共同索引的不同 DataFrame 对象。例如,要提取恰好在同一天在加拿大和英国都流行的视频,我们可以执行以下操作:

In [6]:

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

Out[6]:

video_id_CAN channel_title_CAN category_id_CAN publish_time_CAN tags_CAN views_CAN likes_CAN dislikes_CAN comment_count_CAN thumbnail_link_CAN tags_UK views_UK likes_UK dislikes_UK comment_count_UK thumbnail_link_UK comments_disabled_UK ratings_disabled_UK video_error_or_removed_UK description_UK
title trending_date
!! THIS VIDEO IS NOTHING BUT PAIN !! | Getting Over It - Part 7 18.04.01 PNn8sECd7io Markiplier 20 2018-01-03T19:33:53.000Z getting over it|”markiplier”|”funny moments”|”… 835930 47058 1023 8250 https://i.ytimg.com/vi/PNn8sECd7io/default.jpg NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
#1 Fortnite World Rank - 2,323 Solo Wins! 18.09.03 DvPW66IFhMI AlexRamiGaming 20 2018-03-09T07:15:52.000Z PS4 Battle Royale|”PS4 Pro Battle Royale”|”Bat… 212838 5199 542 11 https://i.ytimg.com/vi/DvPW66IFhMI/default.jpg NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
🚨 BREAKING NEWS 🔴 Raja Live all Slot Channels Welcome 🎰 18.07.05 Wt9Gkpmbt44 TheBigJackpot 24 2018-05-07T06:58:59.000Z Slot Machine|”win”|”Gambling”|”Big Win”|”raja”… 28973 2167 175 10 https://i.ytimg.com/vi/Wt9Gkpmbt44/default.jpg NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
🚨Active Shooter at YouTube Headquarters - LIVE BREAKING NEWS COVERAGE 18.04.04 Az72jrKbANA Right Side Broadcasting Network 25 2018-04-03T23:12:37.000Z YouTube shooter|”YouTube active shooter”|”acti… 103513 1722 181 76 https://i.ytimg.com/vi/Az72jrKbANA/default.jpg NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

40900 行 × 28 列

此处 lsuffixrsuffix 参数是必需的,因为该数据在英国和加拿大数据集中的列名相同。如果不是这样(比如说,我们事先重命名了它们),我们就不需要它们了。

Your turn

如果你还没有开始练习,你可以**从这里开始**。