dennisbakhuis
diff --git a/‎B_Pandas_tips/6 - Selecting a range.ipynb‎
Lines changed: 127 additions & 0 deletions b/‎B_Pandas_tips/6 - Selecting a range.ipynb‎
Lines changed: 127 additions & 0 deletions
diff --git a/‎B_Pandas_tips/Assets/Pandas_Tricks_Slides.pptx‎
126 KB b/‎B_Pandas_tips/Assets/Pandas_Tricks_Slides.pptx‎
126 KB
@@ -0,0 +1,127 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Pandas tip #6: Selecting a range\n",
+    "Selecting and filtering data from a DataFrame is the core business of a data scientist. There are many methods in Pandas to help you select or eliminate the rows. The all-rounder is clearly the .loc[] method and it share some similarities with boolean masking from Numpy. The first time a saw the method using square brackets instead of curly braces, I thought it was a bit weird. Between the brackets, the first number is the row pattern and second is the column pattern. \n",
+    "\n",
+    "You can combine multiple rules by using the & operator. A few years ago, I thought that this was required when selecting ranges, however, Pandas has the very nifty .between() method. This is not only shorter but also makes it more readable. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Lets generate some random data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "start = pd.to_datetime('2021-05-24').value // 10**9\n",
+    "end = pd.to_datetime('2021-05-25').value // 10**9\n",
+    "n_samples = 10_000\n",
+    "\n",
+    "rng = np.random.default_rng()\n",
+    "df = pd.DataFrame({\n",
+    "    'price': rng.normal(loc=4, scale=1, size=n_samples),\n",
+    "    },\n",
+    "    index= pd.to_datetime(\n",
+    "        rng.integers(start, end, size=n_samples),\n",
+    "        unit='s',\n",
+    "    ),\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.loc[  # The 'traditional' way\n",
+    "    (df.price > 1)\n",
+    "    & (df.price < 2)\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# https://linkedin.com/in/dennisbakhuis\n",
+    "df.loc[\n",
+    "    df.price.between(1, 2)\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For times, there is a special .between_time() method. It takes or datetime object or a string. It is very convenient to filter your data between time slots. Probably not something we use every day but definitely good to know."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.between_time('13:00', '14:00')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you have any questions, comments, or requests, feel free to [contact me on LinkedIn](https://linkedin.com/in/dennisbakhuis)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}