Skip to content

Commit 833eb84

Browse files
author
Dennis Bakhuis
committed
tip 6
1 parent c2fb76a commit 833eb84

2 files changed

Lines changed: 127 additions & 0 deletions

File tree

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Pandas tip #6: Selecting a range\n",
8+
"Selecting and filtering data from a DataFrame is the core business of a data scientist. There are many methods in Pandas to help you select or eliminate the rows. The all-rounder is clearly the .loc[] method and it share some similarities with boolean masking from Numpy. The first time a saw the method using square brackets instead of curly braces, I thought it was a bit weird. Between the brackets, the first number is the row pattern and second is the column pattern. \n",
9+
"\n",
10+
"You can combine multiple rules by using the & operator. A few years ago, I thought that this was required when selecting ranges, however, Pandas has the very nifty .between() method. This is not only shorter but also makes it more readable. "
11+
]
12+
},
13+
{
14+
"cell_type": "markdown",
15+
"metadata": {},
16+
"source": [
17+
"Lets generate some random data:"
18+
]
19+
},
20+
{
21+
"cell_type": "code",
22+
"execution_count": null,
23+
"metadata": {},
24+
"outputs": [],
25+
"source": [
26+
"import numpy as np\n",
27+
"import pandas as pd\n",
28+
"\n",
29+
"start = pd.to_datetime('2021-05-24').value // 10**9\n",
30+
"end = pd.to_datetime('2021-05-25').value // 10**9\n",
31+
"n_samples = 10_000\n",
32+
"\n",
33+
"rng = np.random.default_rng()\n",
34+
"df = pd.DataFrame({\n",
35+
" 'price': rng.normal(loc=4, scale=1, size=n_samples),\n",
36+
" },\n",
37+
" index= pd.to_datetime(\n",
38+
" rng.integers(start, end, size=n_samples),\n",
39+
" unit='s',\n",
40+
" ),\n",
41+
")"
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": null,
47+
"metadata": {},
48+
"outputs": [],
49+
"source": [
50+
"df.loc[ # The 'traditional' way\n",
51+
" (df.price > 1)\n",
52+
" & (df.price < 2)\n",
53+
"]"
54+
]
55+
},
56+
{
57+
"cell_type": "code",
58+
"execution_count": null,
59+
"metadata": {},
60+
"outputs": [],
61+
"source": [
62+
"# https://linkedin.com/in/dennisbakhuis\n",
63+
"df.loc[\n",
64+
" df.price.between(1, 2)\n",
65+
"]"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"For times, there is a special .between_time() method. It takes or datetime object or a string. It is very convenient to filter your data between time slots. Probably not something we use every day but definitely good to know."
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {},
79+
"outputs": [],
80+
"source": [
81+
"df.between_time('13:00', '14:00')"
82+
]
83+
},
84+
{
85+
"cell_type": "markdown",
86+
"metadata": {},
87+
"source": [
88+
"If you have any questions, comments, or requests, feel free to [contact me on LinkedIn](https://linkedin.com/in/dennisbakhuis)."
89+
]
90+
},
91+
{
92+
"cell_type": "code",
93+
"execution_count": null,
94+
"metadata": {},
95+
"outputs": [],
96+
"source": []
97+
},
98+
{
99+
"cell_type": "code",
100+
"execution_count": null,
101+
"metadata": {},
102+
"outputs": [],
103+
"source": []
104+
}
105+
],
106+
"metadata": {
107+
"kernelspec": {
108+
"display_name": "Python 3",
109+
"language": "python",
110+
"name": "python3"
111+
},
112+
"language_info": {
113+
"codemirror_mode": {
114+
"name": "ipython",
115+
"version": 3
116+
},
117+
"file_extension": ".py",
118+
"mimetype": "text/x-python",
119+
"name": "python",
120+
"nbconvert_exporter": "python",
121+
"pygments_lexer": "ipython3",
122+
"version": "3.7.7"
123+
}
124+
},
125+
"nbformat": 4,
126+
"nbformat_minor": 4
127+
}
126 KB
Binary file not shown.

0 commit comments

Comments
 (0)