{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Fusionner, Joindre et Concaténer\n", "\n", "Il y a 3 façons principales de combiner des DataFrames ensemble : Fusion, assemblage et concaténation. Dans ce notebook, nous discuterons de ces 3 méthodes avec des exemples.\n", "\n", "____" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exemple DataFrames" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],\n", " 'B': ['B0', 'B1', 'B2', 'B3'],\n", " 'C': ['C0', 'C1', 'C2', 'C3'],\n", " 'D': ['D0', 'D1', 'D2', 'D3']},\n", " index=[0, 1, 2, 3])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],\n", " 'B': ['B4', 'B5', 'B6', 'B7'],\n", " 'C': ['C4', 'C5', 'C6', 'C7'],\n", " 'D': ['D4', 'D5', 'D6', 'D7']},\n", " index=[4, 5, 6, 7]) " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],\n", " 'B': ['B8', 'B9', 'B10', 'B11'],\n", " 'C': ['C8', 'C9', 'C10', 'C11'],\n", " 'D': ['D8', 'D9', 'D10', 'D11']},\n", " index=[8, 9, 10, 11])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C D\n", "0 A0 B0 C0 D0\n", "1 A1 B1 C1 D1\n", "2 A2 B2 C2 D2\n", "3 A3 B3 C3 D3" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ABCD
0A0B0C0D0
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
\n
" }, "metadata": {}, "execution_count": 5 } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C D\n", "4 A4 B4 C4 D4\n", "5 A5 B5 C5 D5\n", "6 A6 B6 C6 D6\n", "7 A7 B7 C7 D7" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ABCD
4A4B4C4D4
5A5B5C5D5
6A6B6C6D6
7A7B7C7D7
\n
" }, "metadata": {}, "execution_count": 6 } ], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C D\n", "8 A8 B8 C8 D8\n", "9 A9 B9 C9 D9\n", "10 A10 B10 C10 D10\n", "11 A11 B11 C11 D11" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ABCD
8A8B8C8D8
9A9B9C9D9
10A10B10C10D10
11A11B11C11D11
\n
" }, "metadata": {}, "execution_count": 7 } ], "source": [ "df3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Concaténation\n", "\n", "La concaténation colle fondamentalement les DataFrames ensemble. Gardez à l'esprit que les dimensions doivent correspondre à l'axe sur lequel vous concaténez. Vous pouvez utiliser **pd.concat** et passer dans une liste de DataFrames à concaténer ensemble :" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C D\n", "0 A0 B0 C0 D0\n", "1 A1 B1 C1 D1\n", "2 A2 B2 C2 D2\n", "3 A3 B3 C3 D3\n", "4 A4 B4 C4 D4\n", "5 A5 B5 C5 D5\n", "6 A6 B6 C6 D6\n", "7 A7 B7 C7 D7\n", "8 A8 B8 C8 D8\n", "9 A9 B9 C9 D9\n", "10 A10 B10 C10 D10\n", "11 A11 B11 C11 D11" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ABCD
0A0B0C0D0
1A1B1C1D1
2A2B2C2D2
3A3B3C3D3
4A4B4C4D4
5A5B5C5D5
6A6B6C6D6
7A7B7C7D7
8A8B8C8D8
9A9B9C9D9
10A10B10C10D10
11A11B11C11D11
\n
" }, "metadata": {}, "execution_count": 8 } ], "source": [ "pd.concat([df1,df2,df3])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C D A B C D A B C D\n", "0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN\n", "1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN\n", "2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN\n", "3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN\n", "4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN\n", "5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN\n", "6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN\n", "7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN\n", "8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8\n", "9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9\n", "10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10\n", "11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ABCDABCDABCD
0A0B0C0D0NaNNaNNaNNaNNaNNaNNaNNaN
1A1B1C1D1NaNNaNNaNNaNNaNNaNNaNNaN
2A2B2C2D2NaNNaNNaNNaNNaNNaNNaNNaN
3A3B3C3D3NaNNaNNaNNaNNaNNaNNaNNaN
4NaNNaNNaNNaNA4B4C4D4NaNNaNNaNNaN
5NaNNaNNaNNaNA5B5C5D5NaNNaNNaNNaN
6NaNNaNNaNNaNA6B6C6D6NaNNaNNaNNaN
7NaNNaNNaNNaNA7B7C7D7NaNNaNNaNNaN
8NaNNaNNaNNaNNaNNaNNaNNaNA8B8C8D8
9NaNNaNNaNNaNNaNNaNNaNNaNA9B9C9D9
10NaNNaNNaNNaNNaNNaNNaNNaNA10B10C10D10
11NaNNaNNaNNaNNaNNaNNaNNaNA11B11C11D11
\n
" }, "metadata": {}, "execution_count": 9 } ], "source": [ "pd.concat([df1,df2,df3],axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_____\n", "## Exemple DataFrames" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n", " 'A': ['A0', 'A1', 'A2', 'A3'],\n", " 'B': ['B0', 'B1', 'B2', 'B3']})\n", " \n", "right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n", " 'C': ['C0', 'C1', 'C2', 'C3'],\n", " 'D': ['D0', 'D1', 'D2', 'D3']}) " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " key A B\n", "0 K0 A0 B0\n", "1 K1 A1 B1\n", "2 K2 A2 B2\n", "3 K3 A3 B3" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
keyAB
0K0A0B0
1K1A1B1
2K2A2B2
3K3A3B3
\n
" }, "metadata": {}, "execution_count": 11 } ], "source": [ "left" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " key C D\n", "0 K0 C0 D0\n", "1 K1 C1 D1\n", "2 K2 C2 D2\n", "3 K3 C3 D3" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
keyCD
0K0C0D0
1K1C1D1
2K2C2D2
3K3C3D3
\n
" }, "metadata": {}, "execution_count": 12 } ], "source": [ "right" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fusion\n", "\n", "La fonction **merge** vous permet de fusionner des DataFrames ensemble en utilisant une logique similaire à celle de la fusion de Tables SQL. Par exemple :" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " key A B C D\n", "0 K0 A0 B0 C0 D0\n", "1 K1 A1 B1 C1 D1\n", "2 K2 A2 B2 C2 D2\n", "3 K3 A3 B3 C3 D3" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
keyABCD
0K0A0B0C0D0
1K1A1B1C1D1
2K2A2B2C2D2
3K3A3B3C3D3
\n
" }, "metadata": {}, "execution_count": 13 } ], "source": [ "pd.merge(left,right,how='inner',on='key')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exemple plus compliqué :" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],\n", " 'key2': ['K0', 'K1', 'K0', 'K1'],\n", " 'A': ['A0', 'A1', 'A2', 'A3'],\n", " 'B': ['B0', 'B1', 'B2', 'B3']})\n", " \n", "right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],\n", " 'key2': ['K0', 'K0', 'K0', 'K0'],\n", " 'C': ['C0', 'C1', 'C2', 'C3'],\n", " 'D': ['D0', 'D1', 'D2', 'D3']})" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " key1 key2 A B C D\n", "0 K0 K0 A0 B0 C0 D0\n", "1 K1 K0 A2 B2 C1 D1\n", "2 K1 K0 A2 B2 C2 D2" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
key1key2ABCD
0K0K0A0B0C0D0
1K1K0A2B2C1D1
2K1K0A2B2C2D2
\n
" }, "metadata": {}, "execution_count": 15 } ], "source": [ "pd.merge(left, right, on=['key1', 'key2'])" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " key1 key2 A B C D\n", "0 K0 K0 A0 B0 C0 D0\n", "1 K0 K1 A1 B1 NaN NaN\n", "2 K1 K0 A2 B2 C1 D1\n", "3 K1 K0 A2 B2 C2 D2\n", "4 K2 K1 A3 B3 NaN NaN\n", "5 K2 K0 NaN NaN C3 D3" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
key1key2ABCD
0K0K0A0B0C0D0
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN
5K2K0NaNNaNC3D3
\n
" }, "metadata": {}, "execution_count": 16 } ], "source": [ "pd.merge(left, right, how='outer', on=['key1', 'key2'])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " key1 key2 A B C D\n", "0 K0 K0 A0 B0 C0 D0\n", "1 K1 K0 A2 B2 C1 D1\n", "2 K1 K0 A2 B2 C2 D2\n", "3 K2 K0 NaN NaN C3 D3" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
key1key2ABCD
0K0K0A0B0C0D0
1K1K0A2B2C1D1
2K1K0A2B2C2D2
3K2K0NaNNaNC3D3
\n
" }, "metadata": {}, "execution_count": 17 } ], "source": [ "pd.merge(left, right, how='right', on=['key1', 'key2'])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " key1 key2 A B C D\n", "0 K0 K0 A0 B0 C0 D0\n", "1 K0 K1 A1 B1 NaN NaN\n", "2 K1 K0 A2 B2 C1 D1\n", "3 K1 K0 A2 B2 C2 D2\n", "4 K2 K1 A3 B3 NaN NaN" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
key1key2ABCD
0K0K0A0B0C0D0
1K0K1A1B1NaNNaN
2K1K0A2B2C1D1
3K1K0A2B2C2D2
4K2K1A3B3NaNNaN
\n
" }, "metadata": {}, "execution_count": 18 } ], "source": [ "pd.merge(left, right, how='left', on=['key1', 'key2'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assemblage\n", "Join est une méthode pratique pour combiner les colonnes de deux DataFrames potentiellement indexées différemment résultant en un seul DataFrame." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],\n", " 'B': ['B0', 'B1', 'B2']},\n", " index=['K0', 'K1', 'K2']) \n", "\n", "right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],\n", " 'D': ['D0', 'D2', 'D3']},\n", " index=['K0', 'K2', 'K3'])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C D\n", "K0 A0 B0 C0 D0\n", "K1 A1 B1 NaN NaN\n", "K2 A2 B2 C2 D2" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ABCD
K0A0B0C0D0
K1A1B1NaNNaN
K2A2B2C2D2
\n
" }, "metadata": {}, "execution_count": 20 } ], "source": [ "left.join(right)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " A B C D\n", "K0 A0 B0 C0 D0\n", "K1 A1 B1 NaN NaN\n", "K2 A2 B2 C2 D2\n", "K3 NaN NaN C3 D3" ], "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
ABCD
K0A0B0C0D0
K1A1B1NaNNaN
K2A2B2C2D2
K3NaNNaNC3D3
\n
" }, "metadata": {}, "execution_count": 21 } ], "source": [ "left.join(right, how='outer')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bon travail!" ] } ], "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3.7.9 64-bit ('pyfinance': conda)", "metadata": { "interpreter": { "hash": "e89404a230d8800c54ad520c7b67d1bd9bb833a07b37dd3e521a178a3dc34904" } } }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9-final" } }, "nbformat": 4, "nbformat_minor": 1 }