Skip to content

Commit e4247ec

Browse files
author
Dennis Bakhuis
committed
tip 8
1 parent 9a22eb6 commit e4247ec

5 files changed

Lines changed: 134 additions & 0 deletions

File tree

443 KB
Loading
506 KB
Binary file not shown.
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Pandas tip #8: Explode your DataFrame\n",
8+
"A column in a Pandas DataFrame can practically hold any type. Not all types are ideal but some can be useful as an intermediate value. One of these types is a list which creates some sort of higher order structure in your tabular data. \n",
9+
"\n",
10+
"For example, I used a list in a DataFrame of words, to store the position the word was mentioned in a large corpus. This nicely groups all relevant data, however, to be useful we need to flatten this again. Flattening in Python is nicely done using a list comprehension. In Pandas I found this nifty method: .explode().\n",
11+
"\n",
12+
"The .explode() method flattens the list from each row, independed of the length of each list and duplicates the rest of the rows columns. It is probably not used very often but definitely a 'nice to have' method in your toolbox."
13+
]
14+
},
15+
{
16+
"cell_type": "markdown",
17+
"metadata": {},
18+
"source": [
19+
"Lets generate some random data:"
20+
]
21+
},
22+
{
23+
"cell_type": "code",
24+
"execution_count": null,
25+
"metadata": {},
26+
"outputs": [],
27+
"source": [
28+
"import numpy as np\n",
29+
"import pandas as pd\n",
30+
"\n",
31+
"categories = [list('AB'), list('ABC'), list('ABCD')] \n",
32+
"n_samples = 100\n",
33+
"\n",
34+
"rng = np.random.default_rng()\n",
35+
"df = pd.DataFrame({\n",
36+
" 'client_id': np.arange(n_samples), \n",
37+
" 'product_category': rng.choice(categories, size=n_samples),\n",
38+
"}).set_index('client_id')"
39+
]
40+
},
41+
{
42+
"cell_type": "code",
43+
"execution_count": null,
44+
"metadata": {},
45+
"outputs": [],
46+
"source": [
47+
"df.head()"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"Flatten a list in Python is easy using list comprehensions:"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {},
61+
"outputs": [],
62+
"source": [
63+
"categories"
64+
]
65+
},
66+
{
67+
"cell_type": "code",
68+
"execution_count": null,
69+
"metadata": {},
70+
"outputs": [],
71+
"source": [
72+
"[item for sublist in categories for item in sublist]"
73+
]
74+
},
75+
{
76+
"cell_type": "markdown",
77+
"metadata": {},
78+
"source": [
79+
"Flattening a column is called exploding in Pandas:"
80+
]
81+
},
82+
{
83+
"cell_type": "code",
84+
"execution_count": null,
85+
"metadata": {},
86+
"outputs": [],
87+
"source": [
88+
"df.explode('product_category')"
89+
]
90+
},
91+
{
92+
"cell_type": "markdown",
93+
"metadata": {},
94+
"source": [
95+
"If you have any questions, comments, or requests, feel free to [contact me on LinkedIn](https://linkedin.com/in/dennisbakhuis)."
96+
]
97+
},
98+
{
99+
"cell_type": "code",
100+
"execution_count": null,
101+
"metadata": {},
102+
"outputs": [],
103+
"source": []
104+
},
105+
{
106+
"cell_type": "code",
107+
"execution_count": null,
108+
"metadata": {},
109+
"outputs": [],
110+
"source": []
111+
}
112+
],
113+
"metadata": {
114+
"kernelspec": {
115+
"display_name": "Python 3",
116+
"language": "python",
117+
"name": "python3"
118+
},
119+
"language_info": {
120+
"codemirror_mode": {
121+
"name": "ipython",
122+
"version": 3
123+
},
124+
"file_extension": ".py",
125+
"mimetype": "text/x-python",
126+
"name": "python",
127+
"nbconvert_exporter": "python",
128+
"pygments_lexer": "ipython3",
129+
"version": "3.7.7"
130+
}
131+
},
132+
"nbformat": 4,
133+
"nbformat_minor": 4
134+
}
271 KB
Loading
2.07 MB
Binary file not shown.

0 commit comments

Comments
 (0)