|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Pandas tip #15: Compress your data!\n", |
| 8 | + "When working with data you often come accross CSV files. The great thing about CSV files is that they are human readable. The bad thing is that they are not very space efficient. My experience is that the 'human readable' benefit is only a marginal benefit when working with that file for the first time.\n", |
| 9 | + "\n", |
| 10 | + "A nice CSV feature of Pandas is to store your data in a compressed way using the `compression` parameter. Out of the box Pandas can us zip/gzip, bz2, and xz compression when storing a DataFrame. As we are generally using the '.csv.gz' as an extention, Pandas can automatically infer the compression, therefore, we can omit the `compression` parameter.\n", |
| 11 | + "\n", |
| 12 | + "As we are compressing the data, it takes a bit more effort to store the data. Obviously, the compressed file is not human readable anymore. Under the hood, Pandas uses the gzip, bzip2, and the xz library and streams the text lines to the compressed file. The most efficient is bzip2, which reduces the size to one third of the original size while taking three times longer.\n", |
| 13 | + "\n", |
| 14 | + "Writing the data line by line is of course not very efficient and as we loose the \"benefit\" of human readability anyhow, we could also store it as Parquet. Parquet is a column format and highly optimized for reading, i.e. writing takes a bit more effort. Still, compared to a method that stores line by line, Parquet is blazingly fast and until now, I have not yet seen any real downsides of using it.\n", |
| 15 | + "\n", |
| 16 | + "Parquet is by far my favorite format and I highly recommend it!" |
| 17 | + ] |
| 18 | + }, |
| 19 | + { |
| 20 | + "cell_type": "markdown", |
| 21 | + "metadata": {}, |
| 22 | + "source": [ |
| 23 | + "Lets generate some random data:" |
| 24 | + ] |
| 25 | + }, |
| 26 | + { |
| 27 | + "cell_type": "code", |
| 28 | + "execution_count": null, |
| 29 | + "metadata": {}, |
| 30 | + "outputs": [], |
| 31 | + "source": [ |
| 32 | + "import numpy as np\n", |
| 33 | + "import pandas as pd\n", |
| 34 | + "from pathlib import Path\n", |
| 35 | + "\n", |
| 36 | + "rng = np.random.default_rng(42)\n", |
| 37 | + "n_rows, n_cols = 100_000, 100\n", |
| 38 | + "\n", |
| 39 | + "df = pd.DataFrame(\n", |
| 40 | + " rng.integers(0,1000, size=(n_rows, n_cols)),\n", |
| 41 | + " columns=[str(x) for x in np.arange(n_cols)]\n", |
| 42 | + ")" |
| 43 | + ] |
| 44 | + }, |
| 45 | + { |
| 46 | + "cell_type": "markdown", |
| 47 | + "metadata": {}, |
| 48 | + "source": [ |
| 49 | + "Lets save this to a regular csv:" |
| 50 | + ] |
| 51 | + }, |
| 52 | + { |
| 53 | + "cell_type": "code", |
| 54 | + "execution_count": null, |
| 55 | + "metadata": {}, |
| 56 | + "outputs": [], |
| 57 | + "source": [ |
| 58 | + "def get_filesize(f: Path) -> str:\n", |
| 59 | + " file_size = f.stat().st_size\n", |
| 60 | + " for unit in ['','K','M','G','T']:\n", |
| 61 | + " if file_size < 1024:\n", |
| 62 | + " return f\"{file_size:3.1f}{unit}B\"\n", |
| 63 | + " file_size /= 1024" |
| 64 | + ] |
| 65 | + }, |
| 66 | + { |
| 67 | + "cell_type": "code", |
| 68 | + "execution_count": null, |
| 69 | + "metadata": {}, |
| 70 | + "outputs": [], |
| 71 | + "source": [ |
| 72 | + "%%time\n", |
| 73 | + "large_file = Path('large_file.csv')\n", |
| 74 | + "df.to_csv(large_file)\n", |
| 75 | + "print(f'CSV file size: {get_filesize(large_file)}')" |
| 76 | + ] |
| 77 | + }, |
| 78 | + { |
| 79 | + "cell_type": "markdown", |
| 80 | + "metadata": {}, |
| 81 | + "source": [ |
| 82 | + "Using gzip compression:" |
| 83 | + ] |
| 84 | + }, |
| 85 | + { |
| 86 | + "cell_type": "code", |
| 87 | + "execution_count": null, |
| 88 | + "metadata": {}, |
| 89 | + "outputs": [], |
| 90 | + "source": [ |
| 91 | + "gzip_file = Path('gzipped_file.csv.gz')" |
| 92 | + ] |
| 93 | + }, |
| 94 | + { |
| 95 | + "cell_type": "code", |
| 96 | + "execution_count": null, |
| 97 | + "metadata": {}, |
| 98 | + "outputs": [], |
| 99 | + "source": [ |
| 100 | + "%%time\n", |
| 101 | + "df.to_csv(gzip_file, compression='gzip')" |
| 102 | + ] |
| 103 | + }, |
| 104 | + { |
| 105 | + "cell_type": "code", |
| 106 | + "execution_count": null, |
| 107 | + "metadata": {}, |
| 108 | + "outputs": [], |
| 109 | + "source": [ |
| 110 | + "print(f'GZ CSV file size: {get_filesize(gzip_file)}')" |
| 111 | + ] |
| 112 | + }, |
| 113 | + { |
| 114 | + "cell_type": "markdown", |
| 115 | + "metadata": {}, |
| 116 | + "source": [ |
| 117 | + "The default option for `compression` is 'infer' which detects which type of compression is used from the extention. Therefore, we only need to supply the .gz extentions and it will automatically gzip the file." |
| 118 | + ] |
| 119 | + }, |
| 120 | + { |
| 121 | + "cell_type": "code", |
| 122 | + "execution_count": null, |
| 123 | + "metadata": {}, |
| 124 | + "outputs": [], |
| 125 | + "source": [ |
| 126 | + "bzip_file = Path('bzipped_file.csv.bz2')" |
| 127 | + ] |
| 128 | + }, |
| 129 | + { |
| 130 | + "cell_type": "code", |
| 131 | + "execution_count": null, |
| 132 | + "metadata": {}, |
| 133 | + "outputs": [], |
| 134 | + "source": [ |
| 135 | + "%%time\n", |
| 136 | + "df.to_csv(bzip_file)" |
| 137 | + ] |
| 138 | + }, |
| 139 | + { |
| 140 | + "cell_type": "code", |
| 141 | + "execution_count": null, |
| 142 | + "metadata": {}, |
| 143 | + "outputs": [], |
| 144 | + "source": [ |
| 145 | + "print(f'BZ2 CSV file size: {get_filesize(bzip_file)}')" |
| 146 | + ] |
| 147 | + }, |
| 148 | + { |
| 149 | + "cell_type": "code", |
| 150 | + "execution_count": null, |
| 151 | + "metadata": {}, |
| 152 | + "outputs": [], |
| 153 | + "source": [ |
| 154 | + "xzip_file = Path('xzipped_file.csv.xz')" |
| 155 | + ] |
| 156 | + }, |
| 157 | + { |
| 158 | + "cell_type": "code", |
| 159 | + "execution_count": null, |
| 160 | + "metadata": {}, |
| 161 | + "outputs": [], |
| 162 | + "source": [ |
| 163 | + "%%time\n", |
| 164 | + "df.to_csv(xzip_file)" |
| 165 | + ] |
| 166 | + }, |
| 167 | + { |
| 168 | + "cell_type": "code", |
| 169 | + "execution_count": null, |
| 170 | + "metadata": {}, |
| 171 | + "outputs": [], |
| 172 | + "source": [ |
| 173 | + "print(f'XZ CSV file size: {get_filesize(xzip_file)}')" |
| 174 | + ] |
| 175 | + }, |
| 176 | + { |
| 177 | + "cell_type": "code", |
| 178 | + "execution_count": null, |
| 179 | + "metadata": {}, |
| 180 | + "outputs": [], |
| 181 | + "source": [ |
| 182 | + "parquet_file = Path('parquet_file.parquet')" |
| 183 | + ] |
| 184 | + }, |
| 185 | + { |
| 186 | + "cell_type": "code", |
| 187 | + "execution_count": null, |
| 188 | + "metadata": {}, |
| 189 | + "outputs": [], |
| 190 | + "source": [ |
| 191 | + "%%time\n", |
| 192 | + "df.to_parquet(parquet_file)" |
| 193 | + ] |
| 194 | + }, |
| 195 | + { |
| 196 | + "cell_type": "code", |
| 197 | + "execution_count": null, |
| 198 | + "metadata": {}, |
| 199 | + "outputs": [], |
| 200 | + "source": [ |
| 201 | + "print(f'Parquet file size: {get_filesize(parquet_file)}')" |
| 202 | + ] |
| 203 | + }, |
| 204 | + { |
| 205 | + "cell_type": "markdown", |
| 206 | + "metadata": {}, |
| 207 | + "source": [ |
| 208 | + "If you have any questions, comments, or requests, feel free to [contact me on LinkedIn](https://linkedin.com/in/dennisbakhuis)." |
| 209 | + ] |
| 210 | + }, |
| 211 | + { |
| 212 | + "cell_type": "code", |
| 213 | + "execution_count": null, |
| 214 | + "metadata": {}, |
| 215 | + "outputs": [], |
| 216 | + "source": [] |
| 217 | + }, |
| 218 | + { |
| 219 | + "cell_type": "code", |
| 220 | + "execution_count": null, |
| 221 | + "metadata": {}, |
| 222 | + "outputs": [], |
| 223 | + "source": [] |
| 224 | + }, |
| 225 | + { |
| 226 | + "cell_type": "code", |
| 227 | + "execution_count": null, |
| 228 | + "metadata": {}, |
| 229 | + "outputs": [], |
| 230 | + "source": [] |
| 231 | + }, |
| 232 | + { |
| 233 | + "cell_type": "code", |
| 234 | + "execution_count": null, |
| 235 | + "metadata": {}, |
| 236 | + "outputs": [], |
| 237 | + "source": [] |
| 238 | + } |
| 239 | + ], |
| 240 | + "metadata": { |
| 241 | + "kernelspec": { |
| 242 | + "display_name": "Python 3 (ipykernel)", |
| 243 | + "language": "python", |
| 244 | + "name": "python3" |
| 245 | + }, |
| 246 | + "language_info": { |
| 247 | + "codemirror_mode": { |
| 248 | + "name": "ipython", |
| 249 | + "version": 3 |
| 250 | + }, |
| 251 | + "file_extension": ".py", |
| 252 | + "mimetype": "text/x-python", |
| 253 | + "name": "python", |
| 254 | + "nbconvert_exporter": "python", |
| 255 | + "pygments_lexer": "ipython3", |
| 256 | + "version": "3.9.5" |
| 257 | + } |
| 258 | + }, |
| 259 | + "nbformat": 4, |
| 260 | + "nbformat_minor": 4 |
| 261 | +} |
0 commit comments