Skip to content

Commit bea2608

Browse files
author
Dennis Bakhuis
committed
tip 10
1 parent 96a076e commit bea2608

3 files changed

Lines changed: 363 additions & 0 deletions

File tree

Lines changed: 363 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,363 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Pandas tip #10: filter your rows and columns\n",
8+
"Tabular data can consist of a large number of columns and sometimes you want to select a subset of columns in a smart way. For example, you have a dataset that contains the color combination for a car and you want to get all the columns about colors.\n",
9+
"\n",
10+
"I used to .loc[] until I dropped and used a list comprehension to select the columns I want. This works very well but also is quite long and therefore, less readable. For such cases Pandas almost always offers a neater way to solve that problem: .filter().\n",
11+
"\n",
12+
"The .filter() method helps you to select a subset of the DataFrame, but it only filters the labels, not the content. There are three parameters that can be used for filtering: items, like, and regex. The first parameter is simply a list of label names and must match exactly. The second parameter works similar to the `LIKE` keyword in SQL and is used to filter labels that contains the substring passed to like. With the regex parameter we can pass a regex as a selection criteria.\n",
13+
"\n",
14+
"Pandas offers many of such small improvements and I think those make the code much more readable with sometimes even a small performance gain."
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"metadata": {},
20+
"source": [
21+
"Lets generate some random data:"
22+
]
23+
},
24+
{
25+
"cell_type": "code",
26+
"execution_count": 1,
27+
"metadata": {},
28+
"outputs": [],
29+
"source": [
30+
"import numpy as np\n",
31+
"import pandas as pd\n",
32+
"\n",
33+
"colors = ['red', 'blue', 'yellow', 'green', 'purple'] \n",
34+
"n_samples = 100\n",
35+
"\n",
36+
"rng = np.random.default_rng()\n",
37+
"df = pd.DataFrame({\n",
38+
" 'car_serial_id': rng.integers(0, 1000, size=n_samples), \n",
39+
" 'body_color': rng.choice(colors, size=n_samples),\n",
40+
" 'door_color': rng.choice(colors, size=n_samples),\n",
41+
" 'roof_color': rng.choice(colors, size=n_samples),\n",
42+
"})"
43+
]
44+
},
45+
{
46+
"cell_type": "markdown",
47+
"metadata": {},
48+
"source": [
49+
"Select all rows containing color:"
50+
]
51+
},
52+
{
53+
"cell_type": "code",
54+
"execution_count": 3,
55+
"metadata": {},
56+
"outputs": [
57+
{
58+
"data": {
59+
"text/html": [
60+
"<div>\n",
61+
"<style scoped>\n",
62+
" .dataframe tbody tr th:only-of-type {\n",
63+
" vertical-align: middle;\n",
64+
" }\n",
65+
"\n",
66+
" .dataframe tbody tr th {\n",
67+
" vertical-align: top;\n",
68+
" }\n",
69+
"\n",
70+
" .dataframe thead th {\n",
71+
" text-align: right;\n",
72+
" }\n",
73+
"</style>\n",
74+
"<table border=\"1\" class=\"dataframe\">\n",
75+
" <thead>\n",
76+
" <tr style=\"text-align: right;\">\n",
77+
" <th></th>\n",
78+
" <th>body_color</th>\n",
79+
" <th>door_color</th>\n",
80+
" <th>roof_color</th>\n",
81+
" </tr>\n",
82+
" </thead>\n",
83+
" <tbody>\n",
84+
" <tr>\n",
85+
" <th>0</th>\n",
86+
" <td>blue</td>\n",
87+
" <td>red</td>\n",
88+
" <td>red</td>\n",
89+
" </tr>\n",
90+
" <tr>\n",
91+
" <th>1</th>\n",
92+
" <td>green</td>\n",
93+
" <td>red</td>\n",
94+
" <td>blue</td>\n",
95+
" </tr>\n",
96+
" <tr>\n",
97+
" <th>2</th>\n",
98+
" <td>green</td>\n",
99+
" <td>blue</td>\n",
100+
" <td>green</td>\n",
101+
" </tr>\n",
102+
" <tr>\n",
103+
" <th>3</th>\n",
104+
" <td>yellow</td>\n",
105+
" <td>red</td>\n",
106+
" <td>blue</td>\n",
107+
" </tr>\n",
108+
" <tr>\n",
109+
" <th>4</th>\n",
110+
" <td>red</td>\n",
111+
" <td>blue</td>\n",
112+
" <td>purple</td>\n",
113+
" </tr>\n",
114+
" <tr>\n",
115+
" <th>...</th>\n",
116+
" <td>...</td>\n",
117+
" <td>...</td>\n",
118+
" <td>...</td>\n",
119+
" </tr>\n",
120+
" <tr>\n",
121+
" <th>95</th>\n",
122+
" <td>green</td>\n",
123+
" <td>red</td>\n",
124+
" <td>blue</td>\n",
125+
" </tr>\n",
126+
" <tr>\n",
127+
" <th>96</th>\n",
128+
" <td>red</td>\n",
129+
" <td>blue</td>\n",
130+
" <td>red</td>\n",
131+
" </tr>\n",
132+
" <tr>\n",
133+
" <th>97</th>\n",
134+
" <td>purple</td>\n",
135+
" <td>yellow</td>\n",
136+
" <td>yellow</td>\n",
137+
" </tr>\n",
138+
" <tr>\n",
139+
" <th>98</th>\n",
140+
" <td>green</td>\n",
141+
" <td>yellow</td>\n",
142+
" <td>red</td>\n",
143+
" </tr>\n",
144+
" <tr>\n",
145+
" <th>99</th>\n",
146+
" <td>red</td>\n",
147+
" <td>blue</td>\n",
148+
" <td>yellow</td>\n",
149+
" </tr>\n",
150+
" </tbody>\n",
151+
"</table>\n",
152+
"<p>100 rows × 3 columns</p>\n",
153+
"</div>"
154+
],
155+
"text/plain": [
156+
" body_color door_color roof_color\n",
157+
"0 blue red red\n",
158+
"1 green red blue\n",
159+
"2 green blue green\n",
160+
"3 yellow red blue\n",
161+
"4 red blue purple\n",
162+
".. ... ... ...\n",
163+
"95 green red blue\n",
164+
"96 red blue red\n",
165+
"97 purple yellow yellow\n",
166+
"98 green yellow red\n",
167+
"99 red blue yellow\n",
168+
"\n",
169+
"[100 rows x 3 columns]"
170+
]
171+
},
172+
"execution_count": 3,
173+
"metadata": {},
174+
"output_type": "execute_result"
175+
}
176+
],
177+
"source": [
178+
"df.loc[\n",
179+
" :,\n",
180+
" [x.endswith('_color') for x in df.columns]\n",
181+
"]"
182+
]
183+
},
184+
{
185+
"cell_type": "markdown",
186+
"metadata": {},
187+
"source": [
188+
"It is much easier using the filter command:"
189+
]
190+
},
191+
{
192+
"cell_type": "code",
193+
"execution_count": 4,
194+
"metadata": {},
195+
"outputs": [
196+
{
197+
"data": {
198+
"text/html": [
199+
"<div>\n",
200+
"<style scoped>\n",
201+
" .dataframe tbody tr th:only-of-type {\n",
202+
" vertical-align: middle;\n",
203+
" }\n",
204+
"\n",
205+
" .dataframe tbody tr th {\n",
206+
" vertical-align: top;\n",
207+
" }\n",
208+
"\n",
209+
" .dataframe thead th {\n",
210+
" text-align: right;\n",
211+
" }\n",
212+
"</style>\n",
213+
"<table border=\"1\" class=\"dataframe\">\n",
214+
" <thead>\n",
215+
" <tr style=\"text-align: right;\">\n",
216+
" <th></th>\n",
217+
" <th>body_color</th>\n",
218+
" <th>door_color</th>\n",
219+
" <th>roof_color</th>\n",
220+
" </tr>\n",
221+
" </thead>\n",
222+
" <tbody>\n",
223+
" <tr>\n",
224+
" <th>0</th>\n",
225+
" <td>blue</td>\n",
226+
" <td>red</td>\n",
227+
" <td>red</td>\n",
228+
" </tr>\n",
229+
" <tr>\n",
230+
" <th>1</th>\n",
231+
" <td>green</td>\n",
232+
" <td>red</td>\n",
233+
" <td>blue</td>\n",
234+
" </tr>\n",
235+
" <tr>\n",
236+
" <th>2</th>\n",
237+
" <td>green</td>\n",
238+
" <td>blue</td>\n",
239+
" <td>green</td>\n",
240+
" </tr>\n",
241+
" <tr>\n",
242+
" <th>3</th>\n",
243+
" <td>yellow</td>\n",
244+
" <td>red</td>\n",
245+
" <td>blue</td>\n",
246+
" </tr>\n",
247+
" <tr>\n",
248+
" <th>4</th>\n",
249+
" <td>red</td>\n",
250+
" <td>blue</td>\n",
251+
" <td>purple</td>\n",
252+
" </tr>\n",
253+
" <tr>\n",
254+
" <th>...</th>\n",
255+
" <td>...</td>\n",
256+
" <td>...</td>\n",
257+
" <td>...</td>\n",
258+
" </tr>\n",
259+
" <tr>\n",
260+
" <th>95</th>\n",
261+
" <td>green</td>\n",
262+
" <td>red</td>\n",
263+
" <td>blue</td>\n",
264+
" </tr>\n",
265+
" <tr>\n",
266+
" <th>96</th>\n",
267+
" <td>red</td>\n",
268+
" <td>blue</td>\n",
269+
" <td>red</td>\n",
270+
" </tr>\n",
271+
" <tr>\n",
272+
" <th>97</th>\n",
273+
" <td>purple</td>\n",
274+
" <td>yellow</td>\n",
275+
" <td>yellow</td>\n",
276+
" </tr>\n",
277+
" <tr>\n",
278+
" <th>98</th>\n",
279+
" <td>green</td>\n",
280+
" <td>yellow</td>\n",
281+
" <td>red</td>\n",
282+
" </tr>\n",
283+
" <tr>\n",
284+
" <th>99</th>\n",
285+
" <td>red</td>\n",
286+
" <td>blue</td>\n",
287+
" <td>yellow</td>\n",
288+
" </tr>\n",
289+
" </tbody>\n",
290+
"</table>\n",
291+
"<p>100 rows × 3 columns</p>\n",
292+
"</div>"
293+
],
294+
"text/plain": [
295+
" body_color door_color roof_color\n",
296+
"0 blue red red\n",
297+
"1 green red blue\n",
298+
"2 green blue green\n",
299+
"3 yellow red blue\n",
300+
"4 red blue purple\n",
301+
".. ... ... ...\n",
302+
"95 green red blue\n",
303+
"96 red blue red\n",
304+
"97 purple yellow yellow\n",
305+
"98 green yellow red\n",
306+
"99 red blue yellow\n",
307+
"\n",
308+
"[100 rows x 3 columns]"
309+
]
310+
},
311+
"execution_count": 4,
312+
"metadata": {},
313+
"output_type": "execute_result"
314+
}
315+
],
316+
"source": [
317+
"df.filter(like='color', axis=1)"
318+
]
319+
},
320+
{
321+
"cell_type": "markdown",
322+
"metadata": {},
323+
"source": [
324+
"If you have any questions, comments, or requests, feel free to [contact me on LinkedIn](https://linkedin.com/in/dennisbakhuis)."
325+
]
326+
},
327+
{
328+
"cell_type": "code",
329+
"execution_count": null,
330+
"metadata": {},
331+
"outputs": [],
332+
"source": []
333+
},
334+
{
335+
"cell_type": "code",
336+
"execution_count": null,
337+
"metadata": {},
338+
"outputs": [],
339+
"source": []
340+
}
341+
],
342+
"metadata": {
343+
"kernelspec": {
344+
"display_name": "Python 3",
345+
"language": "python",
346+
"name": "python3"
347+
},
348+
"language_info": {
349+
"codemirror_mode": {
350+
"name": "ipython",
351+
"version": 3
352+
},
353+
"file_extension": ".py",
354+
"mimetype": "text/x-python",
355+
"name": "python",
356+
"nbconvert_exporter": "python",
357+
"pygments_lexer": "ipython3",
358+
"version": "3.7.7"
359+
}
360+
},
361+
"nbformat": 4,
362+
"nbformat_minor": 4
363+
}
195 KB
Loading
101 KB
Binary file not shown.

0 commit comments

Comments
 (0)