{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Fusionner, Joindre et Concaténer\n",
"\n",
"Il y a 3 façons principales de combiner des DataFrames ensemble : Fusion, assemblage et concaténation. Dans ce notebook, nous discuterons de ces 3 méthodes avec des exemples.\n",
"\n",
"____"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exemple DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n",
" 'B': ['B0', 'B1', 'B2', 'B3'],\n",
" 'C': ['C0', 'C1', 'C2', 'C3'],\n",
" 'D': ['D0', 'D1', 'D2', 'D3']},\n",
" index=[0, 1, 2, 3])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n",
" 'B': ['B4', 'B5', 'B6', 'B7'],\n",
" 'C': ['C4', 'C5', 'C6', 'C7'],\n",
" 'D': ['D4', 'D5', 'D6', 'D7']},\n",
" index=[4, 5, 6, 7]) "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],\n",
" 'B': ['B8', 'B9', 'B10', 'B11'],\n",
" 'C': ['C8', 'C9', 'C10', 'C11'],\n",
" 'D': ['D8', 'D9', 'D10', 'D11']},\n",
" index=[8, 9, 10, 11])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C D\n",
"0 A0 B0 C0 D0\n",
"1 A1 B1 C1 D1\n",
"2 A2 B2 C2 D2\n",
"3 A3 B3 C3 D3"
],
"text/html": "
\n\n
\n \n \n | \n A | \n B | \n C | \n D | \n
\n \n \n \n 0 | \n A0 | \n B0 | \n C0 | \n D0 | \n
\n \n 1 | \n A1 | \n B1 | \n C1 | \n D1 | \n
\n \n 2 | \n A2 | \n B2 | \n C2 | \n D2 | \n
\n \n 3 | \n A3 | \n B3 | \n C3 | \n D3 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 5
}
],
"source": [
"df1"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C D\n",
"4 A4 B4 C4 D4\n",
"5 A5 B5 C5 D5\n",
"6 A6 B6 C6 D6\n",
"7 A7 B7 C7 D7"
],
"text/html": "\n\n
\n \n \n | \n A | \n B | \n C | \n D | \n
\n \n \n \n 4 | \n A4 | \n B4 | \n C4 | \n D4 | \n
\n \n 5 | \n A5 | \n B5 | \n C5 | \n D5 | \n
\n \n 6 | \n A6 | \n B6 | \n C6 | \n D6 | \n
\n \n 7 | \n A7 | \n B7 | \n C7 | \n D7 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 6
}
],
"source": [
"df2"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C D\n",
"8 A8 B8 C8 D8\n",
"9 A9 B9 C9 D9\n",
"10 A10 B10 C10 D10\n",
"11 A11 B11 C11 D11"
],
"text/html": "\n\n
\n \n \n | \n A | \n B | \n C | \n D | \n
\n \n \n \n 8 | \n A8 | \n B8 | \n C8 | \n D8 | \n
\n \n 9 | \n A9 | \n B9 | \n C9 | \n D9 | \n
\n \n 10 | \n A10 | \n B10 | \n C10 | \n D10 | \n
\n \n 11 | \n A11 | \n B11 | \n C11 | \n D11 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 7
}
],
"source": [
"df3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Concaténation\n",
"\n",
"La concaténation colle fondamentalement les DataFrames ensemble. Gardez à l'esprit que les dimensions doivent correspondre à l'axe sur lequel vous concaténez. Vous pouvez utiliser **pd.concat** et passer dans une liste de DataFrames à concaténer ensemble :"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C D\n",
"0 A0 B0 C0 D0\n",
"1 A1 B1 C1 D1\n",
"2 A2 B2 C2 D2\n",
"3 A3 B3 C3 D3\n",
"4 A4 B4 C4 D4\n",
"5 A5 B5 C5 D5\n",
"6 A6 B6 C6 D6\n",
"7 A7 B7 C7 D7\n",
"8 A8 B8 C8 D8\n",
"9 A9 B9 C9 D9\n",
"10 A10 B10 C10 D10\n",
"11 A11 B11 C11 D11"
],
"text/html": "\n\n
\n \n \n | \n A | \n B | \n C | \n D | \n
\n \n \n \n 0 | \n A0 | \n B0 | \n C0 | \n D0 | \n
\n \n 1 | \n A1 | \n B1 | \n C1 | \n D1 | \n
\n \n 2 | \n A2 | \n B2 | \n C2 | \n D2 | \n
\n \n 3 | \n A3 | \n B3 | \n C3 | \n D3 | \n
\n \n 4 | \n A4 | \n B4 | \n C4 | \n D4 | \n
\n \n 5 | \n A5 | \n B5 | \n C5 | \n D5 | \n
\n \n 6 | \n A6 | \n B6 | \n C6 | \n D6 | \n
\n \n 7 | \n A7 | \n B7 | \n C7 | \n D7 | \n
\n \n 8 | \n A8 | \n B8 | \n C8 | \n D8 | \n
\n \n 9 | \n A9 | \n B9 | \n C9 | \n D9 | \n
\n \n 10 | \n A10 | \n B10 | \n C10 | \n D10 | \n
\n \n 11 | \n A11 | \n B11 | \n C11 | \n D11 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 8
}
],
"source": [
"pd.concat([df1,df2,df3])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C D A B C D A B C D\n",
"0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN\n",
"1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN\n",
"2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN\n",
"3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN\n",
"4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN\n",
"5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN\n",
"6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN\n",
"7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN\n",
"8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8\n",
"9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9\n",
"10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10\n",
"11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11"
],
"text/html": "\n\n
\n \n \n | \n A | \n B | \n C | \n D | \n A | \n B | \n C | \n D | \n A | \n B | \n C | \n D | \n
\n \n \n \n 0 | \n A0 | \n B0 | \n C0 | \n D0 | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n
\n \n 1 | \n A1 | \n B1 | \n C1 | \n D1 | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n
\n \n 2 | \n A2 | \n B2 | \n C2 | \n D2 | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n
\n \n 3 | \n A3 | \n B3 | \n C3 | \n D3 | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n
\n \n 4 | \n NaN | \n NaN | \n NaN | \n NaN | \n A4 | \n B4 | \n C4 | \n D4 | \n NaN | \n NaN | \n NaN | \n NaN | \n
\n \n 5 | \n NaN | \n NaN | \n NaN | \n NaN | \n A5 | \n B5 | \n C5 | \n D5 | \n NaN | \n NaN | \n NaN | \n NaN | \n
\n \n 6 | \n NaN | \n NaN | \n NaN | \n NaN | \n A6 | \n B6 | \n C6 | \n D6 | \n NaN | \n NaN | \n NaN | \n NaN | \n
\n \n 7 | \n NaN | \n NaN | \n NaN | \n NaN | \n A7 | \n B7 | \n C7 | \n D7 | \n NaN | \n NaN | \n NaN | \n NaN | \n
\n \n 8 | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n A8 | \n B8 | \n C8 | \n D8 | \n
\n \n 9 | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n A9 | \n B9 | \n C9 | \n D9 | \n
\n \n 10 | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n A10 | \n B10 | \n C10 | \n D10 | \n
\n \n 11 | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n NaN | \n A11 | \n B11 | \n C11 | \n D11 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 9
}
],
"source": [
"pd.concat([df1,df2,df3],axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_____\n",
"## Exemple DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n",
" 'A': ['A0', 'A1', 'A2', 'A3'],\n",
" 'B': ['B0', 'B1', 'B2', 'B3']})\n",
" \n",
"right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n",
" 'C': ['C0', 'C1', 'C2', 'C3'],\n",
" 'D': ['D0', 'D1', 'D2', 'D3']}) "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" key A B\n",
"0 K0 A0 B0\n",
"1 K1 A1 B1\n",
"2 K2 A2 B2\n",
"3 K3 A3 B3"
],
"text/html": "\n\n
\n \n \n | \n key | \n A | \n B | \n
\n \n \n \n 0 | \n K0 | \n A0 | \n B0 | \n
\n \n 1 | \n K1 | \n A1 | \n B1 | \n
\n \n 2 | \n K2 | \n A2 | \n B2 | \n
\n \n 3 | \n K3 | \n A3 | \n B3 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 11
}
],
"source": [
"left"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" key C D\n",
"0 K0 C0 D0\n",
"1 K1 C1 D1\n",
"2 K2 C2 D2\n",
"3 K3 C3 D3"
],
"text/html": "\n\n
\n \n \n | \n key | \n C | \n D | \n
\n \n \n \n 0 | \n K0 | \n C0 | \n D0 | \n
\n \n 1 | \n K1 | \n C1 | \n D1 | \n
\n \n 2 | \n K2 | \n C2 | \n D2 | \n
\n \n 3 | \n K3 | \n C3 | \n D3 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 12
}
],
"source": [
"right"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fusion\n",
"\n",
"La fonction **merge** vous permet de fusionner des DataFrames ensemble en utilisant une logique similaire à celle de la fusion de Tables SQL. Par exemple :"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" key A B C D\n",
"0 K0 A0 B0 C0 D0\n",
"1 K1 A1 B1 C1 D1\n",
"2 K2 A2 B2 C2 D2\n",
"3 K3 A3 B3 C3 D3"
],
"text/html": "\n\n
\n \n \n | \n key | \n A | \n B | \n C | \n D | \n
\n \n \n \n 0 | \n K0 | \n A0 | \n B0 | \n C0 | \n D0 | \n
\n \n 1 | \n K1 | \n A1 | \n B1 | \n C1 | \n D1 | \n
\n \n 2 | \n K2 | \n A2 | \n B2 | \n C2 | \n D2 | \n
\n \n 3 | \n K3 | \n A3 | \n B3 | \n C3 | \n D3 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 13
}
],
"source": [
"pd.merge(left,right,how='inner',on='key')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exemple plus compliqué :"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],\n",
" 'key2': ['K0', 'K1', 'K0', 'K1'],\n",
" 'A': ['A0', 'A1', 'A2', 'A3'],\n",
" 'B': ['B0', 'B1', 'B2', 'B3']})\n",
" \n",
"right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],\n",
" 'key2': ['K0', 'K0', 'K0', 'K0'],\n",
" 'C': ['C0', 'C1', 'C2', 'C3'],\n",
" 'D': ['D0', 'D1', 'D2', 'D3']})"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" key1 key2 A B C D\n",
"0 K0 K0 A0 B0 C0 D0\n",
"1 K1 K0 A2 B2 C1 D1\n",
"2 K1 K0 A2 B2 C2 D2"
],
"text/html": "\n\n
\n \n \n | \n key1 | \n key2 | \n A | \n B | \n C | \n D | \n
\n \n \n \n 0 | \n K0 | \n K0 | \n A0 | \n B0 | \n C0 | \n D0 | \n
\n \n 1 | \n K1 | \n K0 | \n A2 | \n B2 | \n C1 | \n D1 | \n
\n \n 2 | \n K1 | \n K0 | \n A2 | \n B2 | \n C2 | \n D2 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 15
}
],
"source": [
"pd.merge(left, right, on=['key1', 'key2'])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" key1 key2 A B C D\n",
"0 K0 K0 A0 B0 C0 D0\n",
"1 K0 K1 A1 B1 NaN NaN\n",
"2 K1 K0 A2 B2 C1 D1\n",
"3 K1 K0 A2 B2 C2 D2\n",
"4 K2 K1 A3 B3 NaN NaN\n",
"5 K2 K0 NaN NaN C3 D3"
],
"text/html": "\n\n
\n \n \n | \n key1 | \n key2 | \n A | \n B | \n C | \n D | \n
\n \n \n \n 0 | \n K0 | \n K0 | \n A0 | \n B0 | \n C0 | \n D0 | \n
\n \n 1 | \n K0 | \n K1 | \n A1 | \n B1 | \n NaN | \n NaN | \n
\n \n 2 | \n K1 | \n K0 | \n A2 | \n B2 | \n C1 | \n D1 | \n
\n \n 3 | \n K1 | \n K0 | \n A2 | \n B2 | \n C2 | \n D2 | \n
\n \n 4 | \n K2 | \n K1 | \n A3 | \n B3 | \n NaN | \n NaN | \n
\n \n 5 | \n K2 | \n K0 | \n NaN | \n NaN | \n C3 | \n D3 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 16
}
],
"source": [
"pd.merge(left, right, how='outer', on=['key1', 'key2'])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" key1 key2 A B C D\n",
"0 K0 K0 A0 B0 C0 D0\n",
"1 K1 K0 A2 B2 C1 D1\n",
"2 K1 K0 A2 B2 C2 D2\n",
"3 K2 K0 NaN NaN C3 D3"
],
"text/html": "\n\n
\n \n \n | \n key1 | \n key2 | \n A | \n B | \n C | \n D | \n
\n \n \n \n 0 | \n K0 | \n K0 | \n A0 | \n B0 | \n C0 | \n D0 | \n
\n \n 1 | \n K1 | \n K0 | \n A2 | \n B2 | \n C1 | \n D1 | \n
\n \n 2 | \n K1 | \n K0 | \n A2 | \n B2 | \n C2 | \n D2 | \n
\n \n 3 | \n K2 | \n K0 | \n NaN | \n NaN | \n C3 | \n D3 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 17
}
],
"source": [
"pd.merge(left, right, how='right', on=['key1', 'key2'])"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" key1 key2 A B C D\n",
"0 K0 K0 A0 B0 C0 D0\n",
"1 K0 K1 A1 B1 NaN NaN\n",
"2 K1 K0 A2 B2 C1 D1\n",
"3 K1 K0 A2 B2 C2 D2\n",
"4 K2 K1 A3 B3 NaN NaN"
],
"text/html": "\n\n
\n \n \n | \n key1 | \n key2 | \n A | \n B | \n C | \n D | \n
\n \n \n \n 0 | \n K0 | \n K0 | \n A0 | \n B0 | \n C0 | \n D0 | \n
\n \n 1 | \n K0 | \n K1 | \n A1 | \n B1 | \n NaN | \n NaN | \n
\n \n 2 | \n K1 | \n K0 | \n A2 | \n B2 | \n C1 | \n D1 | \n
\n \n 3 | \n K1 | \n K0 | \n A2 | \n B2 | \n C2 | \n D2 | \n
\n \n 4 | \n K2 | \n K1 | \n A3 | \n B3 | \n NaN | \n NaN | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 18
}
],
"source": [
"pd.merge(left, right, how='left', on=['key1', 'key2'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assemblage\n",
"Join est une méthode pratique pour combiner les colonnes de deux DataFrames potentiellement indexées différemment résultant en un seul DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],\n",
" 'B': ['B0', 'B1', 'B2']},\n",
" index=['K0', 'K1', 'K2']) \n",
"\n",
"right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],\n",
" 'D': ['D0', 'D2', 'D3']},\n",
" index=['K0', 'K2', 'K3'])"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C D\n",
"K0 A0 B0 C0 D0\n",
"K1 A1 B1 NaN NaN\n",
"K2 A2 B2 C2 D2"
],
"text/html": "\n\n
\n \n \n | \n A | \n B | \n C | \n D | \n
\n \n \n \n K0 | \n A0 | \n B0 | \n C0 | \n D0 | \n
\n \n K1 | \n A1 | \n B1 | \n NaN | \n NaN | \n
\n \n K2 | \n A2 | \n B2 | \n C2 | \n D2 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 20
}
],
"source": [
"left.join(right)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" A B C D\n",
"K0 A0 B0 C0 D0\n",
"K1 A1 B1 NaN NaN\n",
"K2 A2 B2 C2 D2\n",
"K3 NaN NaN C3 D3"
],
"text/html": "\n\n
\n \n \n | \n A | \n B | \n C | \n D | \n
\n \n \n \n K0 | \n A0 | \n B0 | \n C0 | \n D0 | \n
\n \n K1 | \n A1 | \n B1 | \n NaN | \n NaN | \n
\n \n K2 | \n A2 | \n B2 | \n C2 | \n D2 | \n
\n \n K3 | \n NaN | \n NaN | \n C3 | \n D3 | \n
\n \n
\n
"
},
"metadata": {},
"execution_count": 21
}
],
"source": [
"left.join(right, how='outer')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bon travail!"
]
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.7.9 64-bit ('pyfinance': conda)",
"metadata": {
"interpreter": {
"hash": "e89404a230d8800c54ad520c7b67d1bd9bb833a07b37dd3e521a178a3dc34904"
}
}
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.9-final"
}
},
"nbformat": 4,
"nbformat_minor": 1
}