Customers always ask us whether Kerf supports rolling up times into “time bars” or “time buckets.” The answer is yes, Kerf support time bars, and our time bars do a lot more than you might expect. Because not everybody is familiar with what time bars do, in this post we’ll give a basic overview of what time bars are and how they work.
Motivation
Time-series data always contains a timestamp column, and this column usually does more than mere record-keeping. With traditional data, you might track when a user registers for your e-commerce website, but the time when the user registers is irrelevant to the purpose of the site. A distinguishing quality of time-series data is that the time column matters. Changing the order of the users table doesn’t make a difference, but changing the times on a column of stock prices makes a big difference. Time matters here.
What this means is that you’re not really interested in a pure ordered table of events. The table is only a compacted depiction of what happened. If the table were really accurate, there would be empty space between the rows, proportional to the length of time between events. This is impractical, so we timestamp the events, and the distance between them is implied.
┌─────┬───────────────────────┐
│price│time │
├─────┼───────────────────────┤
│ 20.0│2016.11.11T00:00:11.000│
│ 24.0│2016.11.11T00:00:13.000│ //+02s difference
│ 23.0│2016.11.11T00:00:19.000│ //+06s difference
│ 23.0│2016.11.11T00:00:37.000│ //+18s difference
│ 25.0│2016.11.11T00:01:31.000│ //+36s difference
└─────┴───────────────────────┘
Though ultimately we work with tables, mentally we have to think of them as timelines. Timelines have a lot of weird properties tables don’t. Our ordered table can only have one row in the first position, one row in the second, and so on. A timeline, in comparison, can have 132 events in the first hour, none in the second, 15 in the third, etc. The arrival of events is irregular. Because this is tricky to deal with, we want to turn a timeline back into an ordered table, that is, we want to make an irregular sequence of events regular. This is what time bars are for.
select bars(5s,time) from t
┌───────────────────────┐
│time │
├───────────────────────┤
│2016.11.11T00:00:10.000│
│2016.11.11T00:00:10.000│ //repeat
│2016.11.11T00:00:15.000│
│2016.11.11T00:00:35.000│ //gap
│2016.11.11T00:01:30.000│
└───────────────────────┘
Time bars are used to normalize time-series. In Kerf, the bars function floors all the timestamps to the nearest multiple of five minutes, say. By design, this output contains a lot of repeats. The point of the repeats is that they are keys that you can group on. By performing aggregation functions on groups of rows, you can find the average price over five minute intervals, or the sum of the lots that were traded during those times. The aggregation reduces what happened during each window into a uniform output.
select avg(price) from t group by bars(5s, time)
┌───────────────────────┬─────┐
│time │price│
├───────────────────────┼─────┤
│2016.11.11T00:00:10.000│ 22.0│
│2016.11.11T00:00:15.000│ 23.0│
│2016.11.11T00:00:35.000│ 23.0│
│2016.11.11T00:01:30.000│ 25.0│
└───────────────────────┴─────┘
Aggregation is lossy. We’re condensing many rows into one, so we’re losing data. We’ve also rounded off the times, so we’re losing data there, too. The part that concerns us here is that we’re losing accuracy in the output.
Losing accuracy is a bad thing, but it’s sometimes worth it if you get something in return. The first advantange to lossiness is that it can make data comprehensible to people: looking at quarterly performance is more revealing than looking at pages of raw ticks. This is using a filter, in particular a downsampling filter, and downsampling filters are lossy methods for finding signal in lots of noisy data.
The second advantage to lossiness is that sometimes you need to delete data more than you need to conserve accuracy, and so you can bar your data to reduce what you have to store. This is great when the cost of disk is more important than some extra accuracy. It’s less great when insufficient technology forces you to use time bars because it otherwise can’t handle the volume of data. Kerf is designed to always handle the full volume of ticks without bars.
Absolute Bars
The standard way to create time bars is using what we’ll call absolute bars. Set to one-hour increments, an absolute time bar will roll up all the events from today at 7:00-7:59AM into a 7:00AM bucket, all the events from 8:00-8:59AM into an 8:00AM bucket, and so on. Tomorrow if events happen at 7:00AM, those would go in a separate bucket; future 7:00AM events never go into today’s 7:00AM bucket. That means that an hour bucket needs to keep track of the day, the month, and the year, even though it’s only grouping by hour. The rule is that every unit greater than the unit you’re grouping on is kept and unchanged. Units less than the one you’re grouping on will be truncated to zero. It’s easier to explain this with a picture.
┌─────┬───────────────────────┐
│price│time │
├─────┼───────────────────────┤
│ 10│2016.06.22T06:55:00.000│
│ 12│2016.06.22T07:01:00.000│
│ 11│2016.06.22T07:04:00.000│
│ 11│2016.06.22T07:22:00.000│
│ 12│2016.06.22T07:43:00.000│
│ 12│2016.06.22T09:04:00.000│
│ 13│2016.06.23T07:33:00.000│ //next day
└─────┴───────────────────────┘
┌─────┬───────────────────────┬───────────────────────┐
│price│time │timebars │
├─────┼───────────────────────┼───────────────────────┤
│ 10│2016.06.22T06:55:00.000│2016.06.22T06:00:00.000│
│ 12│2016.06.22T07:01:00.000│2016.06.22T07:00:00.000│
│ 11│2016.06.22T07:04:00.000│2016.06.22T07:00:00.000│
│ 11│2016.06.22T07:22:00.000│2016.06.22T07:00:00.000│
│ 12│2016.06.22T07:43:00.000│2016.06.22T07:00:00.000│
│ 12│2016.06.22T09:04:00.000│2016.06.22T09:00:00.000│
│ 13│2016.06.23T07:33:00.000│2016.06.23T07:00:00.000│
└─────┴───────────────────────┴───────────────────────┘
Here we’ve updated the table to show the time bars. In practice no one does this, since you can group on the result of the bars function without storing it, and that’s quicker if you’re already familiar with bars.
select avg price from t group by bars(1h,time)
┌───────────────────────┬─────┐
│time │price│
├───────────────────────┼─────┤
│2016.06.22T06:00:00.000│ 10.0│
│2016.06.22T07:00:00.000│ 11.5│
│2016.06.22T09:00:00.000│ 12.0│
│2016.06.23T07:00:00.000│ 13.0│
└───────────────────────┴─────┘
Relative times can be expressed succinctly in Kerf. In all our examples, relative times are given as the first argument to the bars function. Some common times to want to group on:
1y yearly
3m quarterly
1m monthly
7d weekly
1d daily
1h hourly
5i five-minute bars ('i' since 'm' is month)
1i one-minute bars
1s one-second bars
1000n microseconds (via nanoseconds)
Modular “Wall Clock” Bars
Many time-series exhibit some form of periodicity. Home electricity usage increases during the summer months. E-commerce websites sell more goods during the holidays. Stock trading activity decreases during lunch. A good way to isolate this trend is to stack the hours from different days on top of each other. We can call these modular bars.
The first step to generating modular bars is to consider 7AM as a time on the wall that recurs every day. Using this method you would group hours from different days together. This lets you see how the hours in the day perform against each other over time.
┌─────┬───────────────────────┐
│price│time │
├─────┼───────────────────────┤
│ 101│2016.06.22T06:30:00.000│
│ 102│2016.06.22T12:30:00.000│
│ 103│2016.06.22T18:30:00.000│
│ 102│2016.06.23T00:30:00.000│
│ 102│2016.06.23T06:30:00.000│
│ 101│2016.06.23T12:30:00.000│
└─────┴───────────────────────┘
select time['hour'] as hour from t
┌────┐
│hour│
├────┤
│ 6│
│ 12│
│ 18│
│ 0│
│ 6│
│ 12│
└────┘
select time['day'] as day, time['hour'] as hour from t
┌───┬────┐
│day│hour│
├───┼────┤
│ 22│ 6│
│ 22│ 12│
│ 22│ 18│
│ 23│ 0│
│ 23│ 6│
│ 23│ 12│
└───┴────┘
In Kerf, indexing in to a time or a time array with special keywords like ‘hour’ will return the attribute in question. With this technique you can extract
'date' //2016.06.22
'time' //10:11:00.000
'year' //2016
'month' //6
'day' //...
'hour'
'minute'
'second'
'millisecond'
'microsecond'
'nanosecond'
'week'
Note that this “drops” any time attribute larger or smaller than the attribute in question. So it differs from our absolute bars method from earlier. So different days may produce the same output for ‘hour’. These are keys we can group on.
select avg(price) from t group by time['hour']
┌────┬─────┐
│time│price│
├────┼─────┤
│ 6│101.5│
│ 12│101.5│
│ 18│103.0│
│ 0│102.0│
└────┴─────┘
To separate by day and hour, but not year, add another element to the list of group by arguments.
select avg(price) from t group by time['day'], time['hour']
┌────┬─────┬─────┐
│time│time1│price│
├────┼─────┼─────┤
│ 22│ 6│101.0│
│ 22│ 12│102.0│
│ 22│ 18│103.0│
│ 23│ 0│102.0│
│ 23│ 6│102.0│
│ 23│ 12│101.0│
└────┴─────┴─────┘
Which is the same, in this limited case, as the average on the absolute bars.
select avg price from t group by bars(1h, time)
┌───────────────────────┬─────┐
│time │price│
├───────────────────────┼─────┤
│2016.06.22T06:00:00.000│101.0│
│2016.06.22T12:00:00.000│102.0│
│2016.06.22T18:00:00.000│103.0│
│2016.06.23T00:00:00.000│102.0│
│2016.06.23T06:00:00.000│102.0│
│2016.06.23T12:00:00.000│101.0│
└───────────────────────┴─────┘
To contact us about Kerf, mail kevin@kerfsoftware.com