|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Pandas tip #9: .loc[] or .query()?\n", |
| 8 | + "When having tabular data, one of the most common things to do is filter that data. For me in Pandas this would mean using .loc[], .iloc[], or .at[]. These all work great (remember that they return views so be careful with chaining) but there is another kid on the block: .query().\n", |
| 9 | + "\n", |
| 10 | + "The .query() is a bit different as it takes a SQL-like where-clause string as an argument. The .loc[] and friends take a boolean mask as arguments. These boolean masks are often small logic operations that need to be calculated separately. When using .query() the mask is combined with the selection process and this has a small performance gain. This is only visible on large DataFrames.\n", |
| 11 | + "\n", |
| 12 | + "Another difference between .query() and the others is that it returns a copy and not a view. You have to be careful not to chain .loc[] as these can behave weird (and you see the famous warning). This is solved with copies and therefore, .query() is always save to chain.\n", |
| 13 | + "\n", |
| 14 | + "I still use .loc[] all the time, most probably because I am used to writing those but .query() can be shorter and faster in some occasions." |
| 15 | + ] |
| 16 | + }, |
| 17 | + { |
| 18 | + "cell_type": "markdown", |
| 19 | + "metadata": {}, |
| 20 | + "source": [ |
| 21 | + "Lets generate some random data:" |
| 22 | + ] |
| 23 | + }, |
| 24 | + { |
| 25 | + "cell_type": "code", |
| 26 | + "execution_count": null, |
| 27 | + "metadata": {}, |
| 28 | + "outputs": [], |
| 29 | + "source": [ |
| 30 | + "import numpy as np\n", |
| 31 | + "import pandas as pd\n", |
| 32 | + "\n", |
| 33 | + "categories = list('ABCD') \n", |
| 34 | + "n_samples = 100_000\n", |
| 35 | + "\n", |
| 36 | + "rng = np.random.default_rng()\n", |
| 37 | + "df = pd.DataFrame({\n", |
| 38 | + " 'client_id': rng.integers(0, 1000, size=n_samples), \n", |
| 39 | + " 'product_category': rng.choice(categories, size=n_samples),\n", |
| 40 | + "})" |
| 41 | + ] |
| 42 | + }, |
| 43 | + { |
| 44 | + "cell_type": "markdown", |
| 45 | + "metadata": {}, |
| 46 | + "source": [ |
| 47 | + "Lets time the selection using .loc[]" |
| 48 | + ] |
| 49 | + }, |
| 50 | + { |
| 51 | + "cell_type": "code", |
| 52 | + "execution_count": null, |
| 53 | + "metadata": {}, |
| 54 | + "outputs": [], |
| 55 | + "source": [ |
| 56 | + "%%timeit\n", |
| 57 | + "df.loc[\n", |
| 58 | + " (df.product_category=='A') | (df.product_category == 'C')\n", |
| 59 | + "]" |
| 60 | + ] |
| 61 | + }, |
| 62 | + { |
| 63 | + "cell_type": "markdown", |
| 64 | + "metadata": {}, |
| 65 | + "source": [ |
| 66 | + "And now using .query()" |
| 67 | + ] |
| 68 | + }, |
| 69 | + { |
| 70 | + "cell_type": "code", |
| 71 | + "execution_count": null, |
| 72 | + "metadata": {}, |
| 73 | + "outputs": [], |
| 74 | + "source": [ |
| 75 | + "%%timeit\n", |
| 76 | + "df.query(\"product_category == 'A' | product_category == 'C'\")" |
| 77 | + ] |
| 78 | + }, |
| 79 | + { |
| 80 | + "cell_type": "markdown", |
| 81 | + "metadata": {}, |
| 82 | + "source": [ |
| 83 | + "You can use variables with the older @ notation or the new Python f-strings." |
| 84 | + ] |
| 85 | + }, |
| 86 | + { |
| 87 | + "cell_type": "code", |
| 88 | + "execution_count": null, |
| 89 | + "metadata": {}, |
| 90 | + "outputs": [], |
| 91 | + "source": [ |
| 92 | + "category_name = 'C'\n", |
| 93 | + "\n", |
| 94 | + "cat_c = df.query(\"product_category == @category_name\")\n", |
| 95 | + "cat_c = df.query(f\"product_category == '{category_name}'\")" |
| 96 | + ] |
| 97 | + }, |
| 98 | + { |
| 99 | + "cell_type": "markdown", |
| 100 | + "metadata": {}, |
| 101 | + "source": [ |
| 102 | + "If you have any questions, comments, or requests, feel free to [contact me on LinkedIn](https://linkedin.com/in/dennisbakhuis)." |
| 103 | + ] |
| 104 | + }, |
| 105 | + { |
| 106 | + "cell_type": "code", |
| 107 | + "execution_count": null, |
| 108 | + "metadata": {}, |
| 109 | + "outputs": [], |
| 110 | + "source": [] |
| 111 | + }, |
| 112 | + { |
| 113 | + "cell_type": "code", |
| 114 | + "execution_count": null, |
| 115 | + "metadata": {}, |
| 116 | + "outputs": [], |
| 117 | + "source": [] |
| 118 | + } |
| 119 | + ], |
| 120 | + "metadata": { |
| 121 | + "kernelspec": { |
| 122 | + "display_name": "Python 3", |
| 123 | + "language": "python", |
| 124 | + "name": "python3" |
| 125 | + }, |
| 126 | + "language_info": { |
| 127 | + "codemirror_mode": { |
| 128 | + "name": "ipython", |
| 129 | + "version": 3 |
| 130 | + }, |
| 131 | + "file_extension": ".py", |
| 132 | + "mimetype": "text/x-python", |
| 133 | + "name": "python", |
| 134 | + "nbconvert_exporter": "python", |
| 135 | + "pygments_lexer": "ipython3", |
| 136 | + "version": "3.7.7" |
| 137 | + } |
| 138 | + }, |
| 139 | + "nbformat": 4, |
| 140 | + "nbformat_minor": 4 |
| 141 | +} |
0 commit comments