kagglePandas重命名和合并
Introduction
我们经常会收到一些带有列名、索引名或其他命名约定的数据,而这些约定我们并不满意。在这种情况下,您将学习如何使用 Pandas 函数将有问题的条目更改为更合适的名称。
您还将学习如何合并来自多个 DataFrame 和/或 Series 的数据。
要开始本主题的练习,请点击此处。
Renaming
我们首先要介绍的函数是 rename(),它允许你更改索引名称和/或列名称。例如,要将数据集中的 points 列更改为 score,我们可以这样做:
In [1]:
import pandas as pd |
In [2]:
reviews.rename(columns={'points': 'score'}) |
Out[2]:
| country | description | designation | score | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Italy | Aromas include tropical fruit, broom, brimston… | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
| 1 | Portugal | This is ripe and fruity, a wine that is smooth… | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 129969 | France | A dry style of Pinot Gris, this is crisp with … | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
| 129970 | France | Big, rich and off-dry, this is powered by inte… | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car… | Gewürztraminer | Domaine Schoffit |
129971 行 × 13 列
rename() 函数允许您通过分别指定 index 或 column 关键字参数来重命名索引或列的值。它支持多种输入格式,但通常使用 Python 字典最为方便。以下是使用它重命名索引中某些元素的示例。
In [3]:
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'}) |
Out[3]:
| country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| firstEntry | Italy | Aromas include tropical fruit, broom, brimston… | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
| secondEntry | Portugal | This is ripe and fruity, a wine that is smooth… | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 129969 | France | A dry style of Pinot Gris, this is crisp with … | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
| 129970 | France | Big, rich and off-dry, this is powered by inte… | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car… | Gewürztraminer | Domaine Schoffit |
129971 行 × 13 列
您可能经常重命名列,但很少重命名索引值。因此,使用 set_index() 通常更方便。
行索引和列索引都可以拥有各自的 name 属性。可以使用配套的 rename_axis() 方法来更改这些名称。例如:
In [4]:
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns') |
Out[4]:
| fields | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wines | |||||||||||||
| 0 | Italy | Aromas include tropical fruit, broom, brimston… | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
| 1 | Portugal | This is ripe and fruity, a wine that is smooth… | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 129969 | France | A dry style of Pinot Gris, this is crisp with … | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
| 129970 | France | Big, rich and off-dry, this is powered by inte… | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car… | Gewürztraminer | Domaine Schoffit |
129971 行 × 13 列
Combining
在对数据集执行操作时,我们有时需要以非同寻常的方式组合不同的 DataFrame 和/或 Series。Pandas 提供了三种核心方法来实现这一点。按复杂度依次递增,这三种方法分别是 concat()、join() 和 merge()。merge() 的大部分功能也可以用 join() 更简单地实现,因此我们将省略它,并重点介绍前两个函数。
最简单的组合方法是 concat()。给定一个元素列表,此函数会沿某个轴将这些元素合并在一起。
当数据位于不同的 DataFrame 或 Series 对象中,但具有相同的字段(列)时,这种方法非常有用。例如:YouTube 视频数据集,它根据原产国(例如,本例中为加拿大和英国)对数据进行拆分。如果我们想同时研究多个国家/地区,可以使用 concat() 将它们合并在一起:
In [5]:
canadian_youtube = pd.read_csv("../input/youtube-new/CAvideos.csv") |
Out[5]:
| video_id | trending_date | title | channel_title | category_id | publish_time | tags | views | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | video_error_or_removed | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | n1WpP7iowLc | 17.14.11 | Eminem - Walk On Water (Audio) ft. Beyoncé | EminemVEVO | 10 | 2017-11-10T17:00:03.000Z | Eminem|”Walk”|”On”|”Water”|”Aftermath/Shady/In… | 17158579 | 787425 | 43420 | 125882 | https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg | False | False | False | Eminem’s new track Walk on Water ft. Beyoncé i… |
| 1 | 0dBIkQ4Mz1M | 17.14.11 | PLUSH - Bad Unboxing Fan Mail | iDubbbzTV | 23 | 2017-11-13T17:00:00.000Z | plush|”bad unboxing”|”unboxing”|”fan mail”|”id… | 1014651 | 127794 | 1688 | 13030 | https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg | False | False | False | STill got a lot of packages. Probably will las… |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 38914 | -DRsfNObKIQ | 18.14.06 | Eleni Foureira - Fuego - Cyprus - LIVE - First… | Eurovision Song Contest | 24 | 2018-05-08T20:32:32.000Z | Eurovision Song Contest|”2018”|”Lisbon”|”Cypru… | 14317515 | 151870 | 45875 | 26766 | https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg | False | False | False | Eleni Foureira represented Cyprus at the first… |
| 38915 | 4YFo4bdMO8Q | 18.14.06 | KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A… | SuperDuperKyle | 10 | 2018-05-11T04:06:35.000Z | Kyle|”SuperDuperKyle”|”Ikuyo”|”2 Chainz”|”Soph… | 607552 | 18271 | 274 | 1423 | https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg | False | False | False | Debut album ‘Light of Mine’ out now: http://ky… |
79797 行 × 16 列
就复杂度而言,中间的组合器是 join()。join() 可以组合具有共同索引的不同 DataFrame 对象。例如,要提取恰好在同一天在加拿大和英国都流行的视频,我们可以执行以下操作:
In [6]:
left = canadian_youtube.set_index(['title', 'trending_date']) |
Out[6]:
| video_id_CAN | channel_title_CAN | category_id_CAN | publish_time_CAN | tags_CAN | views_CAN | likes_CAN | dislikes_CAN | comment_count_CAN | thumbnail_link_CAN | … | tags_UK | views_UK | likes_UK | dislikes_UK | comment_count_UK | thumbnail_link_UK | comments_disabled_UK | ratings_disabled_UK | video_error_or_removed_UK | description_UK | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| title | trending_date | |||||||||||||||||||||
| !! THIS VIDEO IS NOTHING BUT PAIN !! | Getting Over It - Part 7 | 18.04.01 | PNn8sECd7io | Markiplier | 20 | 2018-01-03T19:33:53.000Z | getting over it|”markiplier”|”funny moments”|”… | 835930 | 47058 | 1023 | 8250 | https://i.ytimg.com/vi/PNn8sECd7io/default.jpg | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| #1 Fortnite World Rank - 2,323 Solo Wins! | 18.09.03 | DvPW66IFhMI | AlexRamiGaming | 20 | 2018-03-09T07:15:52.000Z | PS4 Battle Royale|”PS4 Pro Battle Royale”|”Bat… | 212838 | 5199 | 542 | 11 | https://i.ytimg.com/vi/DvPW66IFhMI/default.jpg | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 🚨 BREAKING NEWS 🔴 Raja Live all Slot Channels Welcome 🎰 | 18.07.05 | Wt9Gkpmbt44 | TheBigJackpot | 24 | 2018-05-07T06:58:59.000Z | Slot Machine|”win”|”Gambling”|”Big Win”|”raja”… | 28973 | 2167 | 175 | 10 | https://i.ytimg.com/vi/Wt9Gkpmbt44/default.jpg | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 🚨Active Shooter at YouTube Headquarters - LIVE BREAKING NEWS COVERAGE | 18.04.04 | Az72jrKbANA | Right Side Broadcasting Network | 25 | 2018-04-03T23:12:37.000Z | YouTube shooter|”YouTube active shooter”|”acti… | 103513 | 1722 | 181 | 76 | https://i.ytimg.com/vi/Az72jrKbANA/default.jpg | … | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
40900 行 × 28 列
此处 lsuffix 和 rsuffix 参数是必需的,因为该数据在英国和加拿大数据集中的列名相同。如果不是这样(比如说,我们事先重命名了它们),我们就不需要它们了。
Your turn
如果你还没有开始练习,你可以**从这里开始**。
