18

How can I do stratified sampling on BigQuery?

For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 category_ids in some of our tables.

2 Answers 2

32

With #standardSQL, let's define our table and some stats over it:

WITH table AS (
  SELECT *, subreddit category
  FROM `fh-bigquery.reddit_comments.2018_09` a
), table_stats AS (
  SELECT *, SUM(c) OVER() total 
  FROM (
    SELECT category, COUNT(*) c 
    FROM table
    GROUP BY 1 
    HAVING c>1000000)
)

In this setup:

  • subreddit will be our category
  • we only want subreddits with more than 1000000 comments

So, if we want 1% of each category in our sample:

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, category, c  
  FROM table a
  JOIN table_stats b
  USING(category)
  WHERE RAND()< 1/100 
)
GROUP BY 2

enter image description here

Or let's say we want ~80,000 samples - but chosen proportionally through all categories:

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, category, c  
  FROM table a
  JOIN table_stats b
  USING(category)
  WHERE RAND()< 80000/total
)
GROUP BY 2

enter image description here

Now, if you want to get the ~same number of samples from each group (let's say, 20,000):

SELECT COUNT(*) samples, category, ROUND(100*COUNT(*)/MAX(c),2) percentage
FROM (
  SELECT id, category, c  
  FROM table a
  JOIN table_stats b
  USING(category)
  WHERE RAND()< 20000/c
)
GROUP BY 2

enter image description here

If you want exactly 20,000 elements from each category:

SELECT ARRAY_LENGTH(cat_samples) samples, category, ROUND(100*ARRAY_LENGTH(cat_samples)/c,2) percentage
FROM (
  SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 20000) cat_samples, category, ANY_VALUE(c) c
  FROM table a
  JOIN table_stats b
  USING(category)
  GROUP BY category
)

enter image description here

If you want exactly 2% of each group:

SELECT COUNT(*) samples, sample.category, ROUND(100*COUNT(*)/ANY_VALUE(c),2) percentage
FROM (
  SELECT ARRAY_AGG(a ORDER BY RAND()) cat_samples, category, ANY_VALUE(c) c
  FROM table a
  JOIN table_stats b
  USING(category)
  GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
GROUP BY 2

enter image description here


If this last approach is what you want, you might notice it failing when you actually want to get data out. An early LIMIT similar to the largest group size will make sure we don't sort more data than needed:

SELECT sample.*
FROM (
  SELECT ARRAY_AGG(a ORDER BY RAND() LIMIT 105000) cat_samples, category, ANY_VALUE(c) c
  FROM table a
  JOIN table_stats b
  USING(category)
  GROUP BY category
), UNNEST(cat_samples) sample WITH OFFSET off
WHERE off<0.02*c
0
11

I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows.

This looks like:

select t.*
from (select t.*,
             row_number() over (order by category order by rand()) as seqnum
      from t
     ) t
where seqnum % 10 = 1;

Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear.

If you want equal sized samples, then order within each category and just take a fixed number:

select t.*
from (select t.*,
             row_number() over (partition by category order by rand()) as seqnum
      from t
     ) t
where seqnum <= 100;

Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones.

Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.

2
  • QQ - Why do you say that "The first has a particularly nice feature that it can also work with numeric dimensions as well."? seqnum is a number in both cases. The only diff is that in one case you are (trying to) take a fixed percentage of samples per category, whereas in the 2nd one you are taking (at most) a fixed (and equal) number of samples per category, right?
    – Josh
    Commented Aug 26, 2020 at 20:26
  • @Josh . . . What I mean is that an nth sample will work if you want to stratify by a numeric columns, for instance row_number() over (order by income) would also work with the modulo approach. Commented Aug 27, 2020 at 0:23

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.