How can I do stratified sampling on BigQuery?
For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 category_ids in some of our tables.
With #standardSQL
, let's define our table and some stats over it:
WITH table AS (
SELECT *, subreddit category
FROM `fh-bigquery.reddit_comments.2018_09` a
), table_stats AS (
SELECT *, SUM(c) OVER() total
FROM (
SELECT category, COUNT(*) c
FROM table
GROUP BY 1
HAVING c>1000000)
)
In this setup:
subreddit
will be our categorySo, if we want 1% of each category in our sample:
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 1/100
)
GROUP BY 2
Or let's say we want ~80,000 samples - but chosen proportionally through all categories:
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 80000/total
)
GROUP BY 2
Now, if you want to get the ~same number of samples from each group (let's say, 20,000):
SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
SELECT id, category, c
FROM table a
JOIN table_stats b
USING(category)
WHERE RAND()< 20000/c
)
GROUP BY 2
If you want exactly 20,000 elements from each category:
SELECT ARRAY_LENGTH(cat_samples) samples, category, ROUND(100*ARRAY_LENGTH(cat_samples)/c,2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 20000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
)
If you want exactly 2% of each group:
SELECT COUNT(*) samples, sample.category, ROUND(100*COUNT(*)/ANY_VALUE(c),2) percentage
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND()) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
GROUP BY 2
If this last approach is what you want, you might notice it failing when you actually want to get data out. An early LIMIT
similar to the largest group size will make sure we don't sort more data than needed:
SELECT sample.*
FROM (
SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 105000) cat_samples, category, ANY_VALUE(c) c
FROM table a
JOIN table_stats b
USING(category)
GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows.
This looks like:
select t.*
from (select t.*,
row_number() over (order by category order by rand()) as seqnum
from t
) t
where seqnum % 10 = 1;
Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear.
If you want equal sized samples, then order within each category and just take a fixed number:
select t.*
from (select t.*,
row_number() over (partition by category order by rand()) as seqnum
from t
) t
where seqnum <= 100;
Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones.
Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.
seqnum
is a number in both cases. The only diff is that in one case you are (trying to) take a fixed percentage of samples per category, whereas in the 2nd one you are taking (at most) a fixed (and equal) number of samples per category, right?
row_number() over (order by income)
would also work with the modulo approach.
Commented
Aug 27, 2020 at 0:23