Public Datasets
Access and analyze a variety of public datasets hosted on Google Cloud Platform
Try it free View BigQuery ConsoleAccess and Analyze Data
Public Datasets on Google Cloud Platform makes it easy for users to access and analyze data in the cloud. These datasets are freely hosted and accessible using a variety of data warehouse and analytics software, from open source Apache Spark to cutting edge Google technologies like Google BigQuery and Google Cloud Dataflow. From structured genomic or encyclopedic data to unstructured climate data, Public Datasets provide a playground for those new to big data and data analysis and a powerful repository for skilled researchers. You can also integrate with your application to add valuable insights for your users. Whatever your use case, these datasets are freely available on GCP.
Google BigQuery Public Datasets
BigQuery hosts a variety of public datasets that can be analyzed using familiar SQL. Users can query this data directly in the BigQuery web UI or programmatically using the BigQuery REST API. These data sets are freely hosted and accessible to everyone. You can query this data up to 1TB per month for free. You pay only for the queries that you perform above this free quota, subject to query pricing details.
Google Genomics Public Datasets
Google collaborates with the genomics community to host select genomic data, like the 1000 Genomes Project, as a public resource. You can access these datasets through the Google Genomics API, the BigQuery web interface and open source examples.
Geo Imagery Datasets
Landsat and Sentinel satellite imagery datasets are available on Google Cloud Storage. You can use GCP to perform analysis and develop new products without needing to worry about the cost of storing the data or the time and cost required to download very large datasets.
In addition to these datasets hosted on Google Cloud Storage, a wide variety of standard Earth science raster datasets are also available in Earth Engine. Earth Engine provides a convenient web-based code editor designed to make developing complex geospatial workflows fast and easy.
BigQuery Datasets
- GDELT Book Corpus
- A dataset that contains 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes). Learn More
- GitHub Data
- This public dataset contains GitHub activity data for more than 2.8 million open source GitHub repositories, more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files. Learn More
- USA Names
- A Social Security Administration dataset that contains all names from Social Security card applications for births that occurred in the United States after 1879. Learn More
- USA Disease Surveillance
- A dataset published by the US Department of Health and Human Services that includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013. Learn More
- Major League Baseball Data
- This public data includes pitch-by-pitch data for Major League Baseball (MLB) games in 2016. Learn More
- Medicare Data
- This public dataset was created by the Centers for Medicare & Medicaid Services. The data summarizes the utilization and payments for procedures, services, and prescription drugs provided to Medicare beneficiaries. Learn More
- Open Images Data
- A dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. Learn More
- NOAA GSOD Weather Data
- This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes global data obtained from the USAF Climatology Center. This dataset covers GSOD data between 1929 and 2016, collected from over 9000 stations. Learn More
- NOAA GHCN
- This public dataset was created by the National Oceanic and Atmospheric Administration (NOAA) and includes climate summaries from land surface stations across the globe that have been subjected to a common suite of quality assurance reviews. This dataset draws from more than 20 sources, including some data from every year since 1763. Learn More
- Hacker News
- A dataset that contains all stories and comments from Hacker News since its launch in 2006. Learn More
- NYC TLC Trips
- Data collected by the NYC Taxi and Limousine Commission (TLC) that includes trip records from all trips completed in yellow and green taxis in NYC from 2009 to 2015. Learn More
Geo Imagery Datasets
- Landsat
- A satellite image dataset from the United States Geological Survey (USGS) that includes millions of multispectral images of the Earth's land surface, at resolutions of between 15 and 60 meters per pixel, from 1982 through the present. Learn More
- Earth Engine datasets
- Earth Engine’s public data catalog includes a variety of standard Earth science raster datasets. Learn More
- Sentinel-2
- A satellite image dataset from the European Space Agency (ESA) that includes multispectral images of the Earth's land surface, with a resolution of 10–60 meters per pixel, from 2015 through the present. Learn More
Genomics Datasets
- 1,000 Genomes
- This dataset comprises roughly 2,500 genomes from 25 populations around the world. Learn More
- Reference Genomes
- Reference Genomes such as GRCh37, GRCh37lite, GRCh38, hg19, hs37d5, and b37. Learn More
- Illumina Platinum Genomes
- This dataset comprises the 17 member CEPH pedigree 1463. Learn More
- Personal Genome Project Data
- This dataset comprises roughly 180 Complete Genomics genomes. Learn More
- ICGC-TCGA DREAM Mutation Calling Challenge synthetic genomes
- This dataset comprises the three public synthetic tumor/normal pairs created for the ICGC-TCGA DREAM Mutation Calling challenge. Learn More
- Simons Genome Diversity Project
- This dataset comprises 25 genomes from 13 diverse populations serving as the pilot project dataset for the Simons Genome Diversity Project. Learn More
- TCGA Cancer Genomics Data in the Cloud
- Open-access TCGA data including somatic mutation calls, clinical data, mRNA and miRNA expression, DNA methylation and protein expression from 33 different tumor types. Learn More
Public Datasets Pricing
Google Cloud Public Datasets are freely accessible with a Google account. Charges may be incurred for large queries and certain use cases.
- BigQuery - Public Datasets hosted in BigQuery provide users with free access of up to 1TB/mo in queries. Queries over the 1TB/mo are subject to query pricing.
- Google Cloud Storage - Public Datasets hosted in Google Cloud Storage, like raster and Genomics data, are free to access. You pay only for GCP resources used to analyze the data, such as compute resources or additional storage you use for your own applications.

