Data Analysis Using Python: A Beginner’s Guide Featuring NYC Open Data¶

Part 1: Reading and Writing Files in Python¶

Mark Bauer

Table of Contents¶

Getting Started

Reading In Data from Local Folder
- 1.1 Reading in a CSV file
- 1.2 Reading in an Excel file
- 1.3 Reading in a JSON file
- 1.4 Reading in a Shapefile
- 1.5 Reading in a GeoJSON file
Reading In Data from the Web
- 2.1 Unzipping and reading in data as CSV to local folders
- 2.2 Unzipping and reading in data as CSV from local folder
- 2.3 Unzipping and reading in data as CSV in memory
Reading In Data from NYC Open Data
- 3.1 Reading in data as CSV in static form
- 3.2 Reading in data as JSON in static form
- 3.3 Reading in Shapefile data
- 3.4 Reading in data from Socrata Open Data API (SODA)
Writing Out Data
- 4.1 Writing to a CSV file
- 4.2 Writing to a Excel (xlsx) file
- 4.3 Writing to a JSON file
- 4.4 Writing to a Shapefile
Conclusion

Goal: In this notebook, we will review various ways to read (load) and write (save) data from NYC Open Data. Specifically, we will focus on reading our data into a pandas dataframe.

Main Library: pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

In [1]:

# importing libraries
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
import os
import urllib
import json
import requests
from io import BytesIO
from sodapy import Socrata
import zipfile
from zipfile import ZipFile
from os.path import basename
from openpyxl import Workbook
from pathlib import Path

Printing verions of Python modules and packages with watermark - the IPython magic extension.

Documention for installing watermark: https://github.com/rasbt/watermark

In [2]:

%load_ext watermark

%watermark -v -p numpy,pandas,geopandas,matplotlib,json,requests,sodapy

Python implementation: CPython
Python version       : 3.8.13
IPython version      : 8.4.0

numpy     : 1.23.1
pandas    : 1.4.3
geopandas : 0.11.1
matplotlib: 3.5.2
json      : 2.0.9
requests  : 2.28.1
sodapy    : 2.1.1

Getting Started: Accessing the Building Footprints Dataset¶

I've provided sample data in the data folder, but follow these steps to access it to follow along.

1. Search NYC Open Data in Google¶

building_footprints

2. Search "Building Footprints" in NYC Open Data search bar¶

3. Select "Building Footprints" Dataset¶

building_footprints

4. The Building Footprints Dataset Page¶

building_footprints

Dataset Link: https://data.cityofnewyork.us/Housing-Development/Building-Footprints/nqwf-w8eh

Documentation/Metadata: https://github.com/CityOfNewYork/nyc-geo-metadata/blob/master/Metadata/Metadata_BuildingFootprints.md

Building Footprints Dataset Identification

Here are a few things to note about the data:

Purpose: This feature class is used by the NYC DOITT GIS group to maintain and distribute an accurate 'basemap' for NYC. The basemap provides the foundation upon virtually all other geospatial data with New York.

Description: Building footprints represent the full perimeter outline of each building as viewed from directly above. Additional attribute information maintained for each feature includes: Building Identification Number (BIN); Borough, Block, and Lot information(BBL); ground elevation at building base; roof height above ground elevation; construction year, and feature type.

Source(s): Annually captured aerial imagery, NYC Research of DOB records, or other image resources.

Publication Dates: Data: 05/03/16

Last Update: Weekly

Metadata: 12/22/2016

Update Frequency: Features are updated daily by DoITT staff and a public release is available weekly on NYC Open Data. Every four years a citywide review is made of the building footprints and features are updated photogrammetrically.

Available Formats: File Geodatabase Feature Class as part of the Planimetrics geodatabase and individual shapefile on the NYC Open Data Portal

Use Limitations: Open Data policies and restrictions apply. See Terms of Use

Access Rights: Public

Links: https://data.cityofnewyork.us/Housing-Development/Building-Footprints/nqwf-w8eh

Tags: Buildings, Building footprint, BIN, Structure

Source: https://github.com/CityOfNewYork/nyc-geo-metadata/blob/master/Metadata/Metadata_BuildingFootprints.md

1. Reading In Data from a Local Folder¶

Before we dive into retrieving data on NYC Open Data, let's practice reading data into a pandas dataframe from a local folder. If you'd like to see how I exported this sample data, visit the data-wrangling notebook.

In [3]:

# listing items in data folder
%ls data/

README.md                nta_shape.shx            sample-data.json
building-footprints.csv  sample-buildings.zip     sample-data.prj
nta_shape.cpg            sample-data.cpg          sample-data.shp
nta_shape.dbf            sample-data.csv          sample-data.shx
nta_shape.geojson        sample-data.dbf          sample-data.xlsx
nta_shape.prj            sample-data.geojson      unzipped-data/
nta_shape.shp            sample-data.gpkg

1.1 Reading in a CSV file¶

In [4]:

# read data as a dataframe
df = pd.read_csv('data/sample-data.csv')

# previewing first five rows in data
df.head()

Out[4]:

	the_geom	NAME	BIN	CNSTRCT_YR	LSTMODDATE	LSTSTATYPE	DOITT_ID	HEIGHTROOF	FEAT_CODE	GROUNDELEV	BASE_BBL	MPLUTO_BBL	GEOMSOURCE
0	MULTIPOLYGON (((-73.87172426474349 40.65519420...	NaN	3394834	2011.0	08/22/2017 12:00:00 AM	Constructed	1250314	26.000000	2100.0	15.0	3.044521e+09	3.044521e+09	Other (Man
1	MULTIPOLYGON (((-73.86650099829305 40.74236058...	NaN	4540159	2010.0	08/22/2017 12:00:00 AM	Constructed	201366	28.000000	2100.0	37.0	4.018780e+09	4.018780e+09	Other (Man
2	MULTIPOLYGON (((-73.7940773567428 40.680625171...	NaN	4260357	2010.0	08/10/2017 12:00:00 AM	Constructed	1184712	29.000000	2100.0	20.0	4.120060e+09	4.120060e+09	Other (Man
3	MULTIPOLYGON (((-73.87805078807256 40.71475698...	NaN	4540051	2010.0	08/22/2017 12:00:00 AM	Constructed	1171655	28.330225	2100.0	112.0	4.030600e+09	4.030600e+09	Photogramm
4	MULTIPOLYGON (((-73.81520745135124 40.73053646...	NaN	4545453	2012.0	08/22/2017 12:00:00 AM	Constructed	1118502	16.640000	2100.0	74.0	4.066560e+09	4.066560e+09	Photogramm

In [5]:

rows, columns = df.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 15,498 rows and 15 columns.

Sanity check

We use pandas .head() method to preview the first five rows of the dataframe.

We use pandas .shape method to print the dimensions of the dataframe (i.e. number of rows, number of columns).

We will use these two methods throughout the examples.

1.2 Reading in a JSON file¶

In [6]:

# read data as a dataframe
df = pd.read_json('data/sample-data.json')

# previewing first five rows in data
df.head()

Out[6]:

	the_geom	NAME	BIN	CNSTRCT_YR	LSTMODDATE	LSTSTATYPE	DOITT_ID	HEIGHTROOF	FEAT_CODE	GROUNDELEV	BASE_BBL	MPLUTO_BBL	GEOMSOURCE
0	MULTIPOLYGON (((-73.87172426474349 40.65519420...	None	3394834	2011	08/22/2017 12:00:00 AM	Constructed	1250314	26.000000	2100.0	15.0	3.044521e+09	3.044521e+09	Other (Man
1	MULTIPOLYGON (((-73.86650099829305 40.74236058...	None	4540159	2010	08/22/2017 12:00:00 AM	Constructed	201366	28.000000	2100.0	37.0	4.018780e+09	4.018780e+09	Other (Man
2	MULTIPOLYGON (((-73.7940773567428 40.680625171...	None	4260357	2010	08/10/2017 12:00:00 AM	Constructed	1184712	29.000000	2100.0	20.0	4.120060e+09	4.120060e+09	Other (Man
3	MULTIPOLYGON (((-73.87805078807256 40.71475698...	None	4540051	2010	08/22/2017 12:00:00 AM	Constructed	1171655	28.330225	2100.0	112.0	4.030600e+09	4.030600e+09	Photogramm
4	MULTIPOLYGON (((-73.81520745135124 40.73053646...	None	4545453	2012	08/22/2017 12:00:00 AM	Constructed	1118502	16.640000	2100.0	74.0	4.066560e+09	4.066560e+09	Photogramm

In [7]:

rows, columns = df.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 15,498 rows and 15 columns.

1.3 Reading in an Excel file¶

In [8]:

# read data as a dataframe
df = pd.read_excel('data/sample-data.xlsx')

# previewing first five rows in data
df.head()

Out[8]:

	the_geom	NAME	BIN	CNSTRCT_YR	LSTMODDATE	LSTSTATYPE	DOITT_ID	HEIGHTROOF	FEAT_CODE	GROUNDELEV	BASE_BBL	MPLUTO_BBL	GEOMSOURCE
0	MULTIPOLYGON (((-73.87172426474349 40.65519420...	NaN	3394834	2011	08/22/2017 12:00:00 AM	Constructed	1250314	26.000000	2100.0	15.0	3.044521e+09	3.044521e+09	Other (Man
1	MULTIPOLYGON (((-73.86650099829305 40.74236058...	NaN	4540159	2010	08/22/2017 12:00:00 AM	Constructed	201366	28.000000	2100.0	37.0	4.018780e+09	4.018780e+09	Other (Man
2	MULTIPOLYGON (((-73.7940773567428 40.680625171...	NaN	4260357	2010	08/10/2017 12:00:00 AM	Constructed	1184712	29.000000	2100.0	20.0	4.120060e+09	4.120060e+09	Other (Man
3	MULTIPOLYGON (((-73.87805078807256 40.71475698...	NaN	4540051	2010	08/22/2017 12:00:00 AM	Constructed	1171655	28.330225	2100.0	112.0	4.030600e+09	4.030600e+09	Photogramm
4	MULTIPOLYGON (((-73.81520745135124 40.73053646...	NaN	4545453	2012	08/22/2017 12:00:00 AM	Constructed	1118502	16.640000	2100.0	74.0	4.066560e+09	4.066560e+09	Photogramm

In [9]:

rows, columns = df.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 15,498 rows and 15 columns.

1.4 Reading in a Shapefile¶

In [10]:

# read data as a geodataframe
path = 'data/sample-data.shp'
gdf = gpd.read_file(path)

# previewing first five rows in data
gdf.head()

Out[10]:

	base_bbl	bin	cnstrct_yr	doitt_id	feat_code	geomsource	groundelev	heightroof	date_lstmo	time_lstmo	lststatype	mpluto_bbl	name	geometry
0	3044520924	3394834.0	2011.0	1250314.0	2100.0	Other (Man	15.0	26.000000	2017-08-22	00:00:00.000	Constructed	3044520924	None	POLYGON ((-73.87172 40.65519, -73.87179 40.655...
1	4018780115	4540159.0	2010.0	201366.0	2100.0	Other (Man	37.0	28.000000	2017-08-22	00:00:00.000	Constructed	4018780115	None	POLYGON ((-73.86650 40.74236, -73.86645 40.742...
2	4120060029	4260357.0	2010.0	1184712.0	2100.0	Other (Man	20.0	29.000000	2017-08-10	00:00:00.000	Constructed	4120060029	None	POLYGON ((-73.79408 40.68063, -73.79407 40.680...
3	4030600139	4540051.0	2010.0	1171655.0	2100.0	Photogramm	112.0	28.330225	2017-08-22	00:00:00.000	Constructed	4030600139	None	POLYGON ((-73.87805 40.71476, -73.87787 40.714...
4	4066560052	4545453.0	2012.0	1118502.0	2100.0	Photogramm	74.0	16.640000	2017-08-22	00:00:00.000	Constructed	4066560052	None	POLYGON ((-73.81521 40.73054, -73.81546 40.730...

In [11]:

rows, columns = gdf.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 15,498 rows and 16 columns.

1.5 Reading in a GeoJSON file¶

In [12]:

path = 'data/sample-data.geojson'
gdf = gpd.read_file(path)

gdf.head()

Out[12]:

	base_bbl	bin	cnstrct_yr	doitt_id	feat_code	geomsource	groundelev	heightroof	date_lstmo	time_lstmo	lststatype	mpluto_bbl	name	geometry
0	3044520924	3394834.0	2011.0	1250314.0	2100.0	Other (Man	15.0	26.000000	2017-08-22	00:00:00	Constructed	3044520924	None	POLYGON ((-73.87172 40.65519, -73.87179 40.655...
1	4018780115	4540159.0	2010.0	201366.0	2100.0	Other (Man	37.0	28.000000	2017-08-22	00:00:00	Constructed	4018780115	None	POLYGON ((-73.86650 40.74236, -73.86645 40.742...
2	4120060029	4260357.0	2010.0	1184712.0	2100.0	Other (Man	20.0	29.000000	2017-08-10	00:00:00	Constructed	4120060029	None	POLYGON ((-73.79408 40.68063, -73.79407 40.680...
3	4030600139	4540051.0	2010.0	1171655.0	2100.0	Photogramm	112.0	28.330225	2017-08-22	00:00:00	Constructed	4030600139	None	POLYGON ((-73.87805 40.71476, -73.87787 40.714...
4	4066560052	4545453.0	2012.0	1118502.0	2100.0	Photogramm	74.0	16.640000	2017-08-22	00:00:00	Constructed	4066560052	None	POLYGON ((-73.81521 40.73054, -73.81546 40.730...

In [13]:

rows, columns = gdf.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 15,498 rows and 16 columns.

2. Reading in Data from the Web¶

Another popular dataset is NYC's PLUTO dataset. We will use this one because it comes in a zip file.

Description: Extensive land use and geographic data at the tax lot level in comma–separated values (CSV) file format. The PLUTO files contain more than seventy fields derived from data maintained by city agencies.
Dataset Link: https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page
Data Dictionary: https://www1.nyc.gov/assets/planning/download/pdf/data-maps/open-data/pluto_datadictionary.pdf?v=20v1

building_footprints_csv

2.1 Unzipping and reading in data as CSV to local folder¶

We will retrieve, unzip and read in data in our downloads folder.
Note: right-click the Download icon and copy the link address as the url.

In [14]:

# download url of dataset
url = 'https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_pluto_20v1_csv.zip'

# a path to our downloads folder 
downloads_path = str(Path.home() / "Downloads")

# specify file name
filename = 'PLUTO.gz'

# a path to our file from our downloads path
filepath = os.path.join(downloads_path, filename)

# retrieving data 
urllib.request.urlretrieve(url, filepath)

Out[14]:

('/Users/geribauer/Downloads/PLUTO.gz',
 <http.client.HTTPMessage at 0x16ca48250>)

In [15]:

# open zipfile and saving items in our zipfolder
items = zipfile.ZipFile(filepath)

# available files in the container
print(items.namelist())

['pluto_20v1.csv', 'PLUTODD20v1.pdf', 'PlutoReadme20v1.pdf']

In [16]:

# opening zipfile using 'with' keyword in read mode
with zipfile.ZipFile(filepath, 'r') as file:
    
    # extract all files inside the zip file
    file.extractall(downloads_path)

In [17]:

# a path to our file from our downloads path
file = 'pluto_20v1.csv'
filepath = os.path.join(downloads_path, file)
pluto_data = pd.read_csv(filepath, low_memory=False)

pluto_data.head()

Out[17]:

	borough	block	lot	cd	ct2010	cb2010	schooldist	council	zipcode	firecomp	...	dcasdate	zoningdate	landmkdate	basempdate	masdate	polidate	edesigdate	geom	dcpedited	notes
0	BK	834	46	307.0	106.0	2001.0	20.0	38.0	11220.0	L114	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0106000020E61000000100000001030000000100000005...	NaN	NaN
1	QN	4042	106	407.0	929.0	3000.0	25.0	19.0	11356.0	E297	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0106000020E61000000100000001030000000100000007...	NaN	NaN
2	BK	4679	17	317.0	866.0	3002.0	18.0	41.0	11203.0	L174	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0106000020E61000000100000001030000000100000006...	NaN	NaN
3	BK	7831	6	318.0	676.0	1002.0	22.0	46.0	11234.0	L159	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0106000020E61000000100000001030000000100000005...	NaN	NaN
4	BK	7831	7	318.0	676.0	1002.0	22.0	46.0	11234.0	L159	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0106000020E61000000100000001030000000100000005...	NaN	NaN

5 rows × 99 columns

In [18]:

rows, columns = pluto_data.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 859,172 rows and 99 columns.

2.2 Unzipping and reading in data as CSV from local folder¶

In [19]:

# note: ive already placed data into this zip file
filename = 'data/sample-buildings.zip'

# opening zip using 'with' keyword in read mode
with zipfile.ZipFile(filename, 'r') as file:
    
    # extracing all items in our zipfile
    file.extractall('data/unzipped-data')

In [20]:

# list files in this file path
%ls data/unzipped-data/

sample-data.csv

In [21]:

# read data as a dataframe
path = 'data/unzipped-data/sample-data.csv'
sample_buidlings = pd.read_csv(path)

# previewing first five rows of data
sample_buidlings.head()

Out[21]:

	the_geom	NAME	BIN	CNSTRCT_YR	LSTMODDATE	LSTSTATYPE	DOITT_ID	HEIGHTROOF	FEAT_CODE	GROUNDELEV	BASE_BBL	MPLUTO_BBL	GEOMSOURCE
0	MULTIPOLYGON (((-73.87172426474349 40.65519420...	NaN	3394834	2011.0	08/22/2017 12:00:00 AM	Constructed	1250314	26.000000	2100.0	15.0	3.044521e+09	3.044521e+09	Other (Man
1	MULTIPOLYGON (((-73.86650099829305 40.74236058...	NaN	4540159	2010.0	08/22/2017 12:00:00 AM	Constructed	201366	28.000000	2100.0	37.0	4.018780e+09	4.018780e+09	Other (Man
2	MULTIPOLYGON (((-73.7940773567428 40.680625171...	NaN	4260357	2010.0	08/10/2017 12:00:00 AM	Constructed	1184712	29.000000	2100.0	20.0	4.120060e+09	4.120060e+09	Other (Man
3	MULTIPOLYGON (((-73.87805078807256 40.71475698...	NaN	4540051	2010.0	08/22/2017 12:00:00 AM	Constructed	1171655	28.330225	2100.0	112.0	4.030600e+09	4.030600e+09	Photogramm
4	MULTIPOLYGON (((-73.81520745135124 40.73053646...	NaN	4545453	2012.0	08/22/2017 12:00:00 AM	Constructed	1118502	16.640000	2100.0	74.0	4.066560e+09	4.066560e+09	Photogramm

In [22]:

rows, columns = sample_buidlings.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 15,498 rows and 15 columns.

2.3 Unzipping and reading in data as a CSV in-memory¶

In [23]:

# our download link
url = 'https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_pluto_20v1_csv.zip'

# reading in our zipfile data in-memory
content = requests.get(url)
zf = ZipFile(BytesIO(content.content))

# printing files in our zipfile
for item in zf.namelist():
    print("File in zip: {}".format(item))

File in zip: pluto_20v1.csv
File in zip: PLUTODD20v1.pdf
File in zip: PlutoReadme20v1.pdf

In [24]:

# read our csv data into a dataframe from our zipfile
file = 'pluto_20v1.csv'
pluto_data = pd.read_csv(zf.open(file), low_memory=False)

# previewing the first five rows of data
pluto_data.head()

Out[24]:

	borough	block	lot	cd	ct2010	cb2010	schooldist	council	zipcode	firecomp	...	dcasdate	zoningdate	landmkdate	basempdate	masdate	polidate	edesigdate	geom	dcpedited	notes
0	BK	834	46	307.0	106.0	2001.0	20.0	38.0	11220.0	L114	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0106000020E61000000100000001030000000100000005...	NaN	NaN
1	QN	4042	106	407.0	929.0	3000.0	25.0	19.0	11356.0	E297	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0106000020E61000000100000001030000000100000007...	NaN	NaN
2	BK	4679	17	317.0	866.0	3002.0	18.0	41.0	11203.0	L174	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0106000020E61000000100000001030000000100000006...	NaN	NaN
3	BK	7831	6	318.0	676.0	1002.0	22.0	46.0	11234.0	L159	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0106000020E61000000100000001030000000100000005...	NaN	NaN
4	BK	7831	7	318.0	676.0	1002.0	22.0	46.0	11234.0	L159	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0106000020E61000000100000001030000000100000005...	NaN	NaN

5 rows × 99 columns

In [25]:

rows, columns = pluto_data.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 859,172 rows and 99 columns.

3. Reading in data from NYC Open Data¶

3.1 Reading in data as CSV in static form¶

building_footprints_csv

Note:¶

The buildings footprints dataset identifier changes weekly, and so does the data api path. Click on the API Docs page and verify the correct dataset identifier. If you're not working with the correct id, you will receive a HTTP Error. Screenshots below:

Click on API Docs building_footprints_csv

Grab the updated dataset identifier building_footprints_csv

The dataset identifier is inserted into the api path below:
url = https://data.cityofnewyork.us/api/views/{DATASET_IDENTIFIER}/rows.csv?accessType=DOWNLOAD

In [26]:

# reading in data as a url from NYC Open Data
url = 'https://data.cityofnewyork.us/api/views/qb5r-6dgf/rows.csv?accessType=DOWNLOAD'

# saving data as a pandas dataframe named 'df_csv'
df_csv = pd.read_csv(url)

# previewing the first five rows 
df_csv.head()

Out[26]:

	the_geom	NAME	BIN	CNSTRCT_YR	LSTMODDATE	LSTSTATYPE	DOITT_ID	HEIGHTROOF	FEAT_CODE	GROUNDELEV	BASE_BBL	MPLUTO_BBL	GEOMSOURCE	GLOBALID
0	MULTIPOLYGON (((-73.96664570466969 40.62599676...	NaN	3170958	1925.0	08/22/2017 12:00:00 AM	Constructed	96807	29.749853	2100	40.0	3065220021	3065220021	Photogramm	{31298F86-3088-4F53-B3DB-71A9EFA6FA1F}
1	MULTIPOLYGON (((-74.16790202462265 40.63936048...	NaN	5028452	1965.0	08/22/2017 12:00:00 AM	Constructed	326368	22.630000	2100	39.0	5012640036	5012640036	Photogramm	{F5F8CDA5-69E2-46F8-8F69-BA95C025B520}
2	MULTIPOLYGON (((-74.19510813278613 40.55610681...	NaN	5078368	1970.0	08/22/2017 12:00:00 AM	Constructed	746627	35.760000	2100	51.0	5060190091	5060190091	Photogramm	{9F644794-F72C-4582-9E5E-B337E2B97068}
3	MULTIPOLYGON (((-73.96113466505085 40.57743931...	NaN	3245111	1928.0	08/22/2017 12:00:00 AM	Constructed	786626	37.500000	2100	6.0	3086910048	3086910048	Photogramm	{F916B22D-E25B-44AE-9FA9-2A51191B9CDF}
4	MULTIPOLYGON (((-73.75421559146166 40.75591276...	NaN	4161096	1950.0	08/22/2017 12:00:00 AM	Constructed	746409	18.015113	2100	93.0	4075020005	4075020005	Photogramm	{525F2C24-616B-4F29-98A3-8FEA5D4B1A7D}

In [27]:

rows, columns = df_csv.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 1,083,347 rows and 16 columns.

3.2 Reading in data as JSON in static form¶

I don't use this method often in my data wrangling process, but it is available and the steps are below. It's a bit of a manual process and best to avoid if you can.

building_footprints_csv

In [28]:

url = 'https://data.cityofnewyork.us/api/views/qb5r-6dgf/rows.json?accessType=DOWNLOAD'

# loads a json object as a python object
with urllib.request.urlopen(url) as url:
    data = json.loads(url.read().decode())
    
# identifying type of python object
type(data)    

Out[28]:

dict

In [29]:

# reviewing the dictionairy's keys
data.keys()

Out[29]:

dict_keys(['meta', 'data'])

In [30]:

# looking at the type of the data key
type(data['data'])

Out[30]:

list

In [31]:

# previewing the length of items in the data list
len(data['data'])

Out[31]:

In [32]:

# previewing the first row of our data
data['data'][0]

Out[32]:

['row-bgk2-jf54-8tu8',
 '00000000-0000-0000-0336-65AE1353DFEE',
 0,
 1713361986,
 None,
 1713361986,
 None,
 '{ }',
 'MULTIPOLYGON (((-73.96664570466969 40.62599676998366, -73.96684846176461 40.625977490862574, -73.96685938726297 40.62604419372411, -73.96661621040211 40.62606731716107, -73.96660638332114 40.626007324369795, -73.96664680403327 40.626003480977275, -73.96664570466969 40.62599676998366)))',
 None,
 '3170958',
 '1925',
 '2017-08-22T00:00:00',
 'Constructed',
 '96807',
 '29.74985318',
 '2100',
 '40',
 '0.0',
 '0.0',
 '3065220021',
 '3065220021',
 'Photogramm',
 '{31298F86-3088-4F53-B3DB-71A9EFA6FA1F}']

Retrieving column names from the meta data.

In [33]:

data['meta'].keys()

Out[33]:

dict_keys(['view'])

In [34]:

keys = data['meta']['view'].keys()

for key in keys:
    print(key)

id
name
assetType
averageRating
createdAt
displayType
downloadCount
hideFromCatalog
hideFromDataJson
newBackend
numberOfComments
oid
provenance
publicationAppendEnabled
publicationDate
publicationGroup
publicationStage
rowsUpdatedAt
rowsUpdatedBy
tableId
totalTimesRated
viewCount
viewLastModified
viewType
approvals
clientContext
columns
grants
metadata
owner
query
rights
tableAuthor
flags

In [35]:

# locating our columns (i.e. field names) and saving as a new variable called 'cols'
cols = data['meta']['view']['columns']
print('length of columns: {}'.format(len(cols)))

# previewing first five
cols[:5]

length of columns: 24

Out[35]:

[{'id': -1,
  'name': 'sid',
  'dataTypeName': 'meta_data',
  'fieldName': ':sid',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'id',
  'dataTypeName': 'meta_data',
  'fieldName': ':id',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'position',
  'dataTypeName': 'meta_data',
  'fieldName': ':position',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'created_at',
  'dataTypeName': 'meta_data',
  'fieldName': ':created_at',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'created_meta',
  'dataTypeName': 'meta_data',
  'fieldName': ':created_meta',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']}]

In [36]:

col_names = []

for col in cols:
    print(col['fieldName'])
    col_names.append(col['fieldName'])

:sid
:id
:position
:created_at
:created_meta
:updated_at
:updated_meta
:meta
the_geom
name
bin
cnstrct_yr
lstmoddate
lststatype
doitt_id
heightroof
feat_code
groundelev
shape_area
shape_len
base_bbl
mpluto_bbl
geomsource
globalid

In [37]:

# constructing a dataframe
df_json = pd.DataFrame(data['data'], columns=col_names)

df_json.head()

Out[37]:

	:sid	:id	:created_at	:created_meta	:updated_at	:updated_meta	:meta	the_geom	name	...	doitt_id	heightroof	feat_code	groundelev	base_bbl	mpluto_bbl	geomsource	globalid
0	row-bgk2-jf54-8tu8	00000000-0000-0000-0336-65AE1353DFEE	1713361986	None	1713361986	None	{ }	MULTIPOLYGON (((-73.96664570466969 40.62599676...	None	...	96807	29.74985318	2100	40	3065220021	3065220021	Photogramm	{31298F86-3088-4F53-B3DB-71A9EFA6FA1F}
1	row-qkid-3vfc.5kxn	00000000-0000-0000-310F-DC93101B7FB2	1713361986	None	1713361986	None	{ }	MULTIPOLYGON (((-74.16790202462265 40.63936048...	None	...	326368	22.63	2100	39	5012640036	5012640036	Photogramm	{F5F8CDA5-69E2-46F8-8F69-BA95C025B520}
2	row-cw7t-bpix_8nr6	00000000-0000-0000-27DC-95188F2F91C9	1713361986	None	1713361986	None	{ }	MULTIPOLYGON (((-74.19510813278613 40.55610681...	None	...	746627	35.76	2100	51	5060190091	5060190091	Photogramm	{9F644794-F72C-4582-9E5E-B337E2B97068}
3	row-xa8y-grfu.cue8	00000000-0000-0000-713B-3335FE6E4BFC	1713361986	None	1713361986	None	{ }	MULTIPOLYGON (((-73.96113466505085 40.57743931...	None	...	786626	37.5	2100	6	3086910048	3086910048	Photogramm	{F916B22D-E25B-44AE-9FA9-2A51191B9CDF}
4	row-d98p.ghx9-25dk	00000000-0000-0000-1E0D-153CE9C006DD	1713361986	None	1713361986	None	{ }	MULTIPOLYGON (((-73.75421559146166 40.75591276...	None	...	746409	18.01511294	2100	93	4075020005	4075020005	Photogramm	{525F2C24-616B-4F29-98A3-8FEA5D4B1A7D}

5 rows × 24 columns

In [38]:

# removing columns that start with ":"
df_json = df_json.loc[:, ~df_json.columns.str.startswith(':')]

df_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1083347 entries, 0 to 1083346
Data columns (total 16 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   the_geom    1083347 non-null  object
 1   name        2254 non-null     object
 2   bin         1083347 non-null  object
 3   cnstrct_yr  1072987 non-null  object
 4   lstmoddate  1083347 non-null  object
 5   lststatype  1083025 non-null  object
 6   doitt_id    1083347 non-null  object
 7   heightroof  1080700 non-null  object
 8   feat_code   1083347 non-null  object
 9   groundelev  1082783 non-null  object
 10  shape_area  1083347 non-null  object
 11  shape_len   1083347 non-null  object
 12  base_bbl    1083347 non-null  object
 13  mpluto_bbl  1083347 non-null  object
 14  geomsource  1083029 non-null  object
 15  globalid    1083347 non-null  object
dtypes: object(16)
memory usage: 132.2+ MB

In [39]:

# printing dimensions of data
rows, columns = df_json.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 1,083,347 rows and 16 columns.

3.3 Reading in Shapefile data¶

building_footprints_csv

In [40]:

url = 'https://data.cityofnewyork.us/api/geospatial/tqmj-j8zm?method=export&format=Shapefile'

# reading in data as a geodataframe
gdf = gpd.read_file(url)

# printing the first five rows
gdf.head()

Out[40]:

	boro_code	boro_name	shape_area	shape_leng	geometry
0	5.0	Staten Island	1.623621e+09	325917.353950	MULTIPOLYGON (((-74.05051 40.56642, -74.05047 ...
1	1.0	Manhattan	6.365205e+08	357713.308162	MULTIPOLYGON (((-74.01093 40.68449, -74.01193 ...
2	2.0	Bronx	1.187175e+09	463180.579449	MULTIPOLYGON (((-73.89681 40.79581, -73.89694 ...
3	3.0	Brooklyn	1.934138e+09	728146.574928	MULTIPOLYGON (((-73.86327 40.58388, -73.86381 ...
4	4.0	Queens	3.041419e+09	888199.731385	MULTIPOLYGON (((-73.82645 40.59053, -73.82642 ...

In [41]:

rows, columns = gdf.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 5 rows and 5 columns.

In [42]:

# plotting boros
gdf.plot()

Out[42]:

<AxesSubplot:>

3.4 Reading in data from Socrata Open Data API (SODA)¶

Note: If you haven't signed up for an app token, there might be a 1,000 rows limit. Of course, verify on the SODA documentation page.

building_footprints_csv

In [43]:

# nyc open data domain
socrata_domain = 'data.cityofnewyork.us' 

# building footprints dataset identifier
socrata_dataset_identifier = 'qb5r-6dgf' 

# The main class that interacts with the SODA API.     
client = Socrata(socrata_domain, None)

client.__dict__

WARNING:root:Requests made without an app_token will be subject to strict throttling limits.

Out[43]:

{'domain': 'data.cityofnewyork.us',
 'session': <requests.sessions.Session at 0x1712fd0a0>,
 'uri_prefix': 'https://',
 'timeout': 10}

Source: https://github.com/xmunoz/sodapy/blob/master/examples/basic_queries.ipynb

In [44]:

print(
    "Domain: {domain:}\
    \nSession: {session:}\
    \nURI Prefix: {uri_prefix:}".format(**client.__dict__)
)

Domain: data.cityofnewyork.us    
Session: <requests.sessions.Session object at 0x1712fd0a0>    
URI Prefix: https://

We are setting the limit at 1,000 rows (i.e. the full data set).

In [45]:

# retrieving data as a dictionary 
results = client.get(socrata_dataset_identifier, limit=1000)

# creating a dataframe from our dictionary
df_api = pd.DataFrame.from_dict(results)

# ending our API request
client.close()

# printing first five rows of data
df_api.head()

Out[45]:

	the_geom	bin	cnstrct_yr	lstmoddate	lststatype	doitt_id	heightroof	feat_code	groundelev	base_bbl	mpluto_bbl	geomsource	globalid
0	{'type': 'MultiPolygon', 'coordinates': [[[[-7...	3170958	1925	2017-08-22T00:00:00.000	Constructed	96807	29.74985318	2100	40	3065220021	3065220021	Photogramm	{31298F86-3088-4F53-B3DB-71A9EFA6FA1F}
1	{'type': 'MultiPolygon', 'coordinates': [[[[-7...	5028452	1965	2017-08-22T00:00:00.000	Constructed	326368	22.63	2100	39	5012640036	5012640036	Photogramm	{F5F8CDA5-69E2-46F8-8F69-BA95C025B520}
2	{'type': 'MultiPolygon', 'coordinates': [[[[-7...	5078368	1970	2017-08-22T00:00:00.000	Constructed	746627	35.76	2100	51	5060190091	5060190091	Photogramm	{9F644794-F72C-4582-9E5E-B337E2B97068}
3	{'type': 'MultiPolygon', 'coordinates': [[[[-7...	3245111	1928	2017-08-22T00:00:00.000	Constructed	786626	37.5	2100	6	3086910048	3086910048	Photogramm	{F916B22D-E25B-44AE-9FA9-2A51191B9CDF}
4	{'type': 'MultiPolygon', 'coordinates': [[[[-7...	4161096	1950	2017-08-22T00:00:00.000	Constructed	746409	18.01511294	2100	93	4075020005	4075020005	Photogramm	{525F2C24-616B-4F29-98A3-8FEA5D4B1A7D}

In [46]:

rows, columns = df_api.shape
print('This dataset has {:,} rows and {:,} columns.'.format(rows, columns))

This dataset has 1,000 rows and 15 columns.

Useful resources:

API Docs: https://dev.socrata.com/foundry/data.cityofnewyork.us/i62d-kjv8
Sign up for app token: https://data.cityofnewyork.us/profile/edit/developer_settings
Python client for the Socrata Open Data API: https://github.com/xmunoz/sodapy
Examples: https://github.com/xmunoz/sodapy/tree/master/examples

4. Writing Out Data¶

In [47]:

# read data as a dataframe
df = pd.read_csv('data/sample-data.csv')

# previewing first five rows in data
df.head()

Out[47]:

	the_geom	NAME	BIN	CNSTRCT_YR	LSTMODDATE	LSTSTATYPE	DOITT_ID	HEIGHTROOF	FEAT_CODE	GROUNDELEV	BASE_BBL	MPLUTO_BBL	GEOMSOURCE	GLOBALID
0	MULTIPOLYGON (((-73.87172426474349 40.65519420...	NaN	3394834	2011.0	08/22/2017 12:00:00 AM	Constructed	1250314	26.000000	2100	15.0	3044520924	3044520924	Other (Man	{C045C815-79DB-4644-AD9D-C34AC03D1AB4}
1	MULTIPOLYGON (((-73.86650099829305 40.74236058...	NaN	4540159	2010.0	08/22/2017 12:00:00 AM	Constructed	201366	28.000000	2100	37.0	4018780115	4018780115	Other (Man	{FDF673E7-FF92-4A7A-AF6D-C49D77343C47}
2	MULTIPOLYGON (((-73.87805078807256 40.71475698...	NaN	4540051	2010.0	08/22/2017 12:00:00 AM	Constructed	1171655	28.330225	2100	112.0	4030600139	4030600139	Photogramm	{788E5D72-46C1-443F-8BC9-6B97F329BFED}
3	MULTIPOLYGON (((-73.81520745135124 40.73053646...	NaN	4545453	2012.0	08/22/2017 12:00:00 AM	Constructed	1118502	16.640000	2100	74.0	4066560052	4066560052	Photogramm	{789A5A51-5B12-46DC-AE85-B06820A3225E}
4	MULTIPOLYGON (((-73.84769179857282 40.87911947...	NaN	2118998	2012.0	08/22/2017 12:00:00 AM	Constructed	1254551	33.000000	2100	154.0	2047220003	2047220003	Other (Man	{BB58FD7B-CC22-4896-901D-F8BAFF4AC129}

4.1 Writing to a CSV file¶

In [48]:

# writing files as a csv
df.to_csv('data/output.csv', index=False)

# listing items in data folder
%ls data/

README.md                nta_shape.shx            sample-data.gpkg
building-footprints.csv  output.csv               sample-data.json
nta_shape.cpg            sample-buildings.zip     sample-data.prj
nta_shape.dbf            sample-data.cpg          sample-data.shp
nta_shape.geojson        sample-data.csv          sample-data.shx
nta_shape.prj            sample-data.dbf          sample-data.xlsx
nta_shape.shp            sample-data.geojson      unzipped-data/

4.2 Writing to an Excel (xlsx) file¶

In [49]:

# writing files as an excel file
df.to_excel('data/output.xlsx', index=False)

# listing items in data folder
%ls data/

README.md                output.csv               sample-data.json
building-footprints.csv  output.xlsx              sample-data.prj
nta_shape.cpg            sample-buildings.zip     sample-data.shp
nta_shape.dbf            sample-data.cpg          sample-data.shx
nta_shape.geojson        sample-data.csv          sample-data.xlsx
nta_shape.prj            sample-data.dbf          unzipped-data/
nta_shape.shp            sample-data.geojson
nta_shape.shx            sample-data.gpkg

4.3 Writing to a JSON file¶

In [50]:

# writing files as json
df.to_json('data/output.json')

# listing items in data folder
%ls data/

README.md                output.csv               sample-data.gpkg
building-footprints.csv  output.json              sample-data.json
nta_shape.cpg            output.xlsx              sample-data.prj
nta_shape.dbf            sample-buildings.zip     sample-data.shp
nta_shape.geojson        sample-data.cpg          sample-data.shx
nta_shape.prj            sample-data.csv          sample-data.xlsx
nta_shape.shp            sample-data.dbf          unzipped-data/
nta_shape.shx            sample-data.geojson

4.4 Writing to a Shapefile¶

In [51]:

# read data as a geodataframe
gdf = gpd.read_file('data/sample-data.shp')

# previewing first five rows in data
gdf.head()

Out[51]:

	base_bbl	bin	cnstrct_yr	doitt_id	feat_code	geomsource	groundelev	heightroof	date_lstmo	time_lstmo	lststatype	mpluto_bbl	name	geometry
0	3044520924	3394834.0	2011.0	1250314.0	2100.0	Other (Man	15.0	26.000000	2017-08-22	00:00:00.000	Constructed	3044520924	None	POLYGON ((-73.87172 40.65519, -73.87179 40.655...
1	4018780115	4540159.0	2010.0	201366.0	2100.0	Other (Man	37.0	28.000000	2017-08-22	00:00:00.000	Constructed	4018780115	None	POLYGON ((-73.86650 40.74236, -73.86645 40.742...
2	4120060029	4260357.0	2010.0	1184712.0	2100.0	Other (Man	20.0	29.000000	2017-08-10	00:00:00.000	Constructed	4120060029	None	POLYGON ((-73.79408 40.68063, -73.79407 40.680...
3	4030600139	4540051.0	2010.0	1171655.0	2100.0	Photogramm	112.0	28.330225	2017-08-22	00:00:00.000	Constructed	4030600139	None	POLYGON ((-73.87805 40.71476, -73.87787 40.714...
4	4066560052	4545453.0	2012.0	1118502.0	2100.0	Photogramm	74.0	16.640000	2017-08-22	00:00:00.000	Constructed	4066560052	None	POLYGON ((-73.81521 40.73054, -73.81546 40.730...

In [52]:

gdf.to_file('data/output.shp')

# listing items in data folder
%ls data/

README.md                output.dbf               sample-data.geojson
building-footprints.csv  output.json              sample-data.gpkg
nta_shape.cpg            output.prj               sample-data.json
nta_shape.dbf            output.shp               sample-data.prj
nta_shape.geojson        output.shx               sample-data.shp
nta_shape.prj            output.xlsx              sample-data.shx
nta_shape.shp            sample-buildings.zip     sample-data.xlsx
nta_shape.shx            sample-data.cpg          unzipped-data/
output.cpg               sample-data.csv
output.csv               sample-data.dbf

5. Conclusion¶

In this notebook, we reviewed various ways to read (load) and write (save) data from NYC Open Data. Specifically, we focused on reading our data into a pandas dataframe. We also went over common file formats that you might encounter - csv, json, shapefiles, and zip files. In Part II, we will focus on basic data inspection and wrangling techniques in the data analysis workflow.