Kevin Lawler 11:00 am on August 26, 2016

Atlas Data Structure

Kerf introduces a new data structure called an “Atlas” that we think is a helpful thing to have inside of a programming language. We’ve found it very useful to have database Tables inside of a programming language, and the Atlas concept is a natural extension of that. Be careful about jumping to conclusions, though, since an Atlas does a few special things that make it unlike what you might expect. I’ll give a preview of the major points here:

Automatic indexing
Responds to SQL
First-class language type
Columnar implementation

We didn’t spend much time talking about Atlases originally, because interest in NoSQL-style computation seemed to be on the wane. Recently, however, we’ve had a lot of requests for those sorts of applications, and so a detailed description seems in order. (If you’ve come across any interesting problems like that in your work, tell us about them.)

If a Table is a way of including a SQL table inside of the language, then an Atlas is a way of including a NoSQL store: none of the rows have to have the same columns. In other words, an Atlas is a schemaless table, or simply a list of JSON objects, with keys in the objects pointing to values. The Atlas is a generalization of the Table where the rows don’t have to be uniform. In one sense, an Atlas is a sparse Table. You could also think of a Table as a vectorized Atlas.

: A regular table

: A JSON array of objects

: Awkward tabular representation of the JSON array

In Kerf we use the term “Map” to refer to what JSON calls an Object {"a":1}. So the term Atlas to refer to a List of Maps (with an Index) was an obvious choice. We use the special convenience syntax {{ and }} to create a table, and the related {[ and ]} convenience syntax to create an atlas. Both of these came about because they fit in well with JSON, and Kerf is a JSON (& SQL) superset.

1. The first thing that makes an Atlas special is that every key is indexed invisibly and by default. This means that no matter what key you query on, your query will be optimized. Atlases let you forget about indexing entirely. In a typical NoSQL store, if you want optimized queries on a key you have to issue the equivalent of a CREATE INDEX command, wait for the index to be created, incur future performance penalties, and so on. With Atlases you don’t have to fool around with any of that. The keys are already indexed, there’s no additional performance penalty, and adding a never-before-seen key doesn’t impact performance. Effectively, you can just forget about indices entirely, even for nested keys.

: A large, (mostly) random atlas

: Already optimized selects on two keys

2. The second thing that makes Atlases interesting is that you can use SQL on them and it will just work. Typical NoSQL stores force you to learn a proprietary query language. It’s a lot easier to be able to use the SQL that you already know, and then to have the output be what you would expect. Even though there’s no specification in SQL for how a NoSQL table should behave, most programmers have an intuitive sense of what should come back. A lot of crazy corner cases have to be handled to accommodate this method of thinking, but because that’s all taken care of on our end (the internals), you don’t have to worry about it at all.

: Sample query

: Multiple renamed columns

: Nested keys

: WHERE clause

: Aggregation, GROUP BY

: Integrated language calculations

3. The third thing that make Atlases special is that the data type is integrated with the language. An Atlas is a first-class data type, just like an int or an array. This means that from inside the language, you can create Atlases, query them, manipulate the results, etc. If you’re querying an old-fashioned database, you need to slurp your results in through the tiny straw of an API. But in Kerf, the language and database are the same, so the code you write works directly on the data. There’s no traversal between a language layer and a database layer. Because the two layers are integrated, so to speak, this means that you can perform “out-of-core” computations without putting special thought into it. It also means that there’s no impact to performance.

: Intermix atlases with the language

: Store and modify selections

4. The fourth thing that makes Atlases special is that they are completely and entirely columnar in order to optimize performance. The index is columnar, the keys are columnar, the values are columnar, everything is columnar. This helps with performance and also with serializing the object.

Because a few existing solutions for schemaless data don’t really perform as advertised, we’ve come across several companies who are suffering in silence with irregular data problems they don’t know how to solve. But there is a tool that exists to solve it, called an Atlas, and we package it in Kerf. If you are faced with a problem with a lot of columns, or ragged or irregular columns, or sparse data, or more than one table mixed together, or have just been disappointed in NoSQL in general, you may want to look into Kerf. Contact us for any reason at founders@kerfsoftware.com.

Kevin Lawler 1:58 pm on June 21, 2016

Time Bars

Customers always ask us whether Kerf supports rolling up times into “time bars” or “time buckets.” The answer is yes, Kerf support time bars, and our time bars do a lot more than you might expect. Because not everybody is familiar with what time bars do, in this post we’ll give a basic overview of what time bars are and how they work.

Motivation

Time-series data always contains a timestamp column, and this column usually does more than mere record-keeping. With traditional data, you might track when a user registers for your e-commerce website, but the time when the user registers is irrelevant to the purpose of the site. A distinguishing quality of time-series data is that the time column matters. Changing the order of the users table doesn’t make a difference, but changing the times on a column of stock prices makes a big difference. Time matters here.

What this means is that you’re not really interested in a pure ordered table of events. The table is only a compacted depiction of what happened. If the table were really accurate, there would be empty space between the rows, proportional to the length of time between events. This is impractical, so we timestamp the events, and the distance between them is implied.



┌─────┬───────────────────────┐
│price│time                   │
├─────┼───────────────────────┤
│ 20.0│2016.11.11T00:00:11.000│
│ 24.0│2016.11.11T00:00:13.000│ //+02s difference
│ 23.0│2016.11.11T00:00:19.000│ //+06s difference
│ 23.0│2016.11.11T00:00:37.000│ //+18s difference
│ 25.0│2016.11.11T00:01:31.000│ //+36s difference
└─────┴───────────────────────┘

Though ultimately we work with tables, mentally we have to think of them as timelines. Timelines have a lot of weird properties tables don’t. Our ordered table can only have one row in the first position, one row in the second, and so on. A timeline, in comparison, can have 132 events in the first hour, none in the second, 15 in the third, etc. The arrival of events is irregular. Because this is tricky to deal with, we want to turn a timeline back into an ordered table, that is, we want to make an irregular sequence of events regular. This is what time bars are for.


select bars(5s,time) from t

┌───────────────────────┐
│time                   │
├───────────────────────┤
│2016.11.11T00:00:10.000│
│2016.11.11T00:00:10.000│ //repeat
│2016.11.11T00:00:15.000│
│2016.11.11T00:00:35.000│ //gap
│2016.11.11T00:01:30.000│
└───────────────────────┘

Time bars are used to normalize time-series. In Kerf, the bars function floors all the timestamps to the nearest multiple of five minutes, say. By design, this output contains a lot of repeats. The point of the repeats is that they are keys that you can group on. By performing aggregation functions on groups of rows, you can find the average price over five minute intervals, or the sum of the lots that were traded during those times. The aggregation reduces what happened during each window into a uniform output.

select avg(price) from t group by bars(5s, time)

┌───────────────────────┬─────┐
│time                   │price│
├───────────────────────┼─────┤
│2016.11.11T00:00:10.000│ 22.0│
│2016.11.11T00:00:15.000│ 23.0│
│2016.11.11T00:00:35.000│ 23.0│
│2016.11.11T00:01:30.000│ 25.0│
└───────────────────────┴─────┘

Aggregation is lossy. We’re condensing many rows into one, so we’re losing data. We’ve also rounded off the times, so we’re losing data there, too. The part that concerns us here is that we’re losing accuracy in the output.

Losing accuracy is a bad thing, but it’s sometimes worth it if you get something in return. The first advantange to lossiness is that it can make data comprehensible to people: looking at quarterly performance is more revealing than looking at pages of raw ticks. This is using a filter, in particular a downsampling filter, and downsampling filters are lossy methods for finding signal in lots of noisy data.

The second advantage to lossiness is that sometimes you need to delete data more than you need to conserve accuracy, and so you can bar your data to reduce what you have to store. This is great when the cost of disk is more important than some extra accuracy. It’s less great when insufficient technology forces you to use time bars because it otherwise can’t handle the volume of data. Kerf is designed to always handle the full volume of ticks without bars.

Absolute Bars

The standard way to create time bars is using what we’ll call absolute bars. Set to one-hour increments, an absolute time bar will roll up all the events from today at 7:00-7:59AM into a 7:00AM bucket, all the events from 8:00-8:59AM into an 8:00AM bucket, and so on. Tomorrow if events happen at 7:00AM, those would go in a separate bucket; future 7:00AM events never go into today’s 7:00AM bucket. That means that an hour bucket needs to keep track of the day, the month, and the year, even though it’s only grouping by hour. The rule is that every unit greater than the unit you’re grouping on is kept and unchanged. Units less than the one you’re grouping on will be truncated to zero. It’s easier to explain this with a picture.



┌─────┬───────────────────────┐
│price│time                   │
├─────┼───────────────────────┤
│   10│2016.06.22T06:55:00.000│
│   12│2016.06.22T07:01:00.000│
│   11│2016.06.22T07:04:00.000│
│   11│2016.06.22T07:22:00.000│
│   12│2016.06.22T07:43:00.000│
│   12│2016.06.22T09:04:00.000│
│   13│2016.06.23T07:33:00.000│ //next day
└─────┴───────────────────────┘



┌─────┬───────────────────────┬───────────────────────┐
│price│time                   │timebars               │
├─────┼───────────────────────┼───────────────────────┤
│   10│2016.06.22T06:55:00.000│2016.06.22T06:00:00.000│
│   12│2016.06.22T07:01:00.000│2016.06.22T07:00:00.000│
│   11│2016.06.22T07:04:00.000│2016.06.22T07:00:00.000│
│   11│2016.06.22T07:22:00.000│2016.06.22T07:00:00.000│
│   12│2016.06.22T07:43:00.000│2016.06.22T07:00:00.000│
│   12│2016.06.22T09:04:00.000│2016.06.22T09:00:00.000│
│   13│2016.06.23T07:33:00.000│2016.06.23T07:00:00.000│
└─────┴───────────────────────┴───────────────────────┘

Here we’ve updated the table to show the time bars. In practice no one does this, since you can group on the result of the bars function without storing it, and that’s quicker if you’re already familiar with bars.

select avg price from t group by bars(1h,time)

┌───────────────────────┬─────┐
│time                   │price│
├───────────────────────┼─────┤
│2016.06.22T06:00:00.000│ 10.0│
│2016.06.22T07:00:00.000│ 11.5│
│2016.06.22T09:00:00.000│ 12.0│
│2016.06.23T07:00:00.000│ 13.0│
└───────────────────────┴─────┘

Relative times can be expressed succinctly in Kerf. In all our examples, relative times are given as the first argument to the bars function. Some common times to want to group on:



1y     yearly
3m     quarterly
1m     monthly
7d     weekly
1d     daily
1h     hourly
5i     five-minute bars ('i' since 'm' is month)
1i     one-minute bars
1s     one-second bars
1000n  microseconds (via nanoseconds)

Modular “Wall Clock” Bars

Many time-series exhibit some form of periodicity. Home electricity usage increases during the summer months. E-commerce websites sell more goods during the holidays. Stock trading activity decreases during lunch. A good way to isolate this trend is to stack the hours from different days on top of each other. We can call these modular bars.

The first step to generating modular bars is to consider 7AM as a time on the wall that recurs every day. Using this method you would group hours from different days together. This lets you see how the hours in the day perform against each other over time.


┌─────┬───────────────────────┐
│price│time                   │
├─────┼───────────────────────┤
│  101│2016.06.22T06:30:00.000│
│  102│2016.06.22T12:30:00.000│
│  103│2016.06.22T18:30:00.000│
│  102│2016.06.23T00:30:00.000│
│  102│2016.06.23T06:30:00.000│
│  101│2016.06.23T12:30:00.000│
└─────┴───────────────────────┘




select time['hour'] as hour from t

┌────┐
│hour│
├────┤
│   6│
│  12│
│  18│
│   0│
│   6│
│  12│
└────┘




select time['day'] as day, time['hour'] as hour from t

┌───┬────┐
│day│hour│
├───┼────┤
│ 22│   6│
│ 22│  12│
│ 22│  18│
│ 23│   0│
│ 23│   6│
│ 23│  12│
└───┴────┘

In Kerf, indexing in to a time or a time array with special keywords like ‘hour’ will return the attribute in question. With this technique you can extract


'date'     //2016.06.22
'time'     //10:11:00.000
'year'     //2016
'month'    //6
'day'      //...
'hour'
'minute'
'second'
'millisecond'
'microsecond'
'nanosecond'
'week'

Note that this “drops” any time attribute larger or smaller than the attribute in question. So it differs from our absolute bars method from earlier. So different days may produce the same output for ‘hour’. These are keys we can group on.


select avg(price) from t group by time['hour']

┌────┬─────┐
│time│price│
├────┼─────┤
│   6│101.5│
│  12│101.5│
│  18│103.0│
│   0│102.0│
└────┴─────┘

To separate by day and hour, but not year, add another element to the list of group by arguments.


select avg(price) from t group by time['day'], time['hour']

┌────┬─────┬─────┐
│time│time1│price│
├────┼─────┼─────┤
│  22│    6│101.0│
│  22│   12│102.0│
│  22│   18│103.0│
│  23│    0│102.0│
│  23│    6│102.0│
│  23│   12│101.0│
└────┴─────┴─────┘

Which is the same, in this limited case, as the average on the absolute bars.


select avg price from t group by bars(1h, time)

┌───────────────────────┬─────┐
│time                   │price│
├───────────────────────┼─────┤
│2016.06.22T06:00:00.000│101.0│
│2016.06.22T12:00:00.000│102.0│
│2016.06.22T18:00:00.000│103.0│
│2016.06.23T00:00:00.000│102.0│
│2016.06.23T06:00:00.000│102.0│
│2016.06.23T12:00:00.000│101.0│
└───────────────────────┴─────┘

To contact us about Kerf, mail kevin@kerfsoftware.com

Scott Locklin 6:45 pm on December 15, 2015

An introduction to Kerf

My pals generally act impressed when I show them my noodlings in the J language. I’m pretty sure they’re impressed with the speed and power of J because it is inarguably fast and powerful, but I’ve also always figured they more saw it as an exercise in obfuscated coding; philistines! While I can generally read my own J code, I must confess some of the more dense tacit style isn’t something I can read naturally without J’s code
dissector. I have also been at it for a while, and for a long time went on faith that this skill would come. Notation as a tool of thought is one of the most powerful ideas I’ve come across. The problem becomes talking people into adopting your notation. Building important pieces of your company around a difficult mathematical notation is a gamble which most companies are not willing to take.

Everyone knows about Arthur Whitney and K because of Kx systems database KDB. Having fiddled around with KDB and Eric Iverson and J-software’s Jd, the mind-boggling power of these things on time series and data problems in general makes me wonder why everyone doesn’t use these tools. Then I remember the first time I looked at things like this:

wavg:{(+/x*y)%+/x}        // K version
wavg=: +/ .* % +/@]        NB. J version

Oh yeah, that’s why J and K adoption are not universal. I mean, I can read it. That doesn’t mean everyone can read it. And I certainly can understand people’s reluctance to learn how to read things like this. It’s not easy.

For the last year and a half, my partner Kevin Lawler has been trying to fix this problem. You may know of him as the author of Kona, the open source version of K3. Kevin’s latest creation is Kerf. Kerf is basically an APL that humans can read, along with one of the highest performance time series databases money can buy. I liked it so much, I quit my interesting and promising day job doing Topological Data Analysis at Ayasdi, and will be dedicating the next few years of my life to this technology.

We know the above code fragments are weighted averages, but mostly because that’s what they’re called in the verb definitions. Mischievous programmers (the types who write code in K and J) might have called them d17 or something. Kerf looks a lot more familiar.

function wavg(x,y) {
  sum(x*y) / sum(x)
}

This is cheating a bit, since K & J don’t have a sum primitive, but it begins to show the utility of organizing your code in a more familiar way. Notice that x * y is done vector wise; no stinking loops necessary. Expressing the same thing in more primitive Kerf functions looks like this:

function wavg(x,y) {
  (+ fold x*y) / (+ fold x)
}

In J and K, the ‘/’ adverb sticks the left hand verb between all the elements on the right hand side. In Kerf, we call that operation “fold” (we also call adverbs “combinators” which we think is more descriptive for what they do in Kerf: I think John Earnest came up with the term).

You could also write the whole thing out in terms of for loops if you wanted to, but fold is easier to write, easier to read, and runs faster.

There are a few surprises with Kerf. One is the assignment operator.

a: range(5);
b: repeat(5,1);
KeRF> b
b
  [1, 1, 1, 1, 1]
KeRF> a
a
  [1, 2, 3, 4]

Seems odd. On the other hand, it looks a lot like json. In fact, you can compose things into a map in a very json like syntax:

aa:{a: 1 2 3, 
    b:'a bit of data', 
    c:range(10)};

KeRF> aa['a']
aa['a']
  [1, 2, 3]

KeRF> aa
aa
  {a:[1, 2, 3], 
   b:"a bit of data", 
   c:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

This seems like syntax sugar, but it actually helps. For example, if I had to feed variable ‘aa’ to somthing that likes to digest json representations of data, it pops it out in ascii json:

json_from_kerf(aa)
  "{\"a\":[1,2,3],\"b\":\"a bit of data\",\"c\":[0,1,2,3,4,5,6,7,8,9]}"

OK, no big deal; a language that has some APL qualities which speaks json. This is pretty good, but we’d be crazy to attempt to charge money for something like this (Kerf is not open source; Kevin and I have to eat). The core technology is a clean APL that speaks json, but the thing which is worth something is the database engine. Tables in kerf look like interned maps and are queried in the usual SQL way.

u: {{numbers: 19 17 32 8 2 -1 7, 
     strings: ["A","B","C","D","H","B","Q"]}}
select * from u where numbers>18

┌───────┬───────┐
│numbers│strings│
├───────┼───────┤
│     19│      A│
│     32│      C│
└───────┴───────┘

select numbers from u where strings="B"

┌───────┐
│numbers│
├───────┤
│     17│
│     -1│
└───────┘

Now the business with ‘:’ starts to make more sense. Since SQL is part of the language, the ‘=’ sign is busy doing other things, rather than setting equality. Now your eyes don’t have to make out any contextual differences or look for ‘==’ versus ‘=’ -everything with an ‘=’ is an equality test. Everything with an ‘:’ is setting a name somewhere.

Standard joins are available with left join:

v:{{a: 1 2 2 3, numbers: 19 17 1 99}}
left_join(v,u,"numbers")

┌─┬───────┬───────┐
│a│numbers│strings│
├─┼───────┼───────┤
│1│     19│      A│
│2│     17│      B│
│2│      1│   null│
│3│     99│   null│
└─┴───────┴───────┘

For timeseries, having a good time type, preferably first class and with the ability to look at nanoseconds is important. So are asof joins.

qq:{{nums: range(10), 
date: 1999.01.01+ (24 * 3600000000000)  * range(10), 
strg:["a","b","c","d","e","f","g","h","i","j"]}}
vv:{{nums: 10+range(10), 
date: 1999.01.01+ (12 * 3600000000000)  * range(10), 
strg:["a","b","c","d","e","f","g","h","i","j"]}}

select nums,nums1,mavg(3,nums1),strg,strg1,date from asof_join(vv,qq,[],"date")

┌────┬─────┬────────┬────┬─────┬───────────────────────┐
│nums│nums1│nums11  │strg│strg1│date                   │
├────┼─────┼────────┼────┼─────┼───────────────────────┤
│  10│    0│     0.0│   a│    a│             1999.01.01│
│  11│    0│     0.0│   b│    a│1999.01.01T12:00:00.000│
│  12│    1│0.333333│   c│    b│             1999.01.02│
│  13│    1│0.666667│   d│    b│1999.01.02T12:00:00.000│
│  14│    2│ 1.33333│   e│    c│             1999.01.03│
│  15│    2│ 1.66667│   f│    c│1999.01.03T12:00:00.000│
│  16│    3│ 2.33333│   g│    d│             1999.01.04│
│  17│    3│ 2.66667│   h│    d│1999.01.04T12:00:00.000│
│  ⋮│    ⋮│      ⋮│   ⋮│   ⋮│                      ⋮│
└────┴─────┴────────┴────┴─────┴───────────────────────┘

Kerf is still young and occasionally rough around the edges, but it is quite useful as it exists now: our customers and partners think so anyway. The only thing comparable to it from an engineering standpoint are the other APL based databases such as Kx and Jd. We think we have some obvious advantages in usability, and less obvious advantages in the intestines of Kerf. Columnar databases like Vertica and Redshift are great for some kinds of problems, but they don’t really compare: they can’t be extended in the same way that Kerf can, nor are they general purpose programming systems, which Kerf is.

We also have a lot of crazy ideas for building out Kerf as a large scale distributed analytics system. Kerf is already a suitable terascale database system; we think we could usefully expand out to hundreds of terabytes on data which isn’t inherently time oriented if someone needs such a thing. There is no reason for things like Hadoop and Spark to form the basis of large scale analytic platforms; people simply don’t know any better and make do with junk that doesn’t really work right, because it is already there.

You can download a time-limited version of Kerf from github. John Earnest has been doing some great work on the documentation as well. I’ve also set up a quick and dirty way to work with Kerf in emacs.

Keep up with Kerf at our company website:
www.kerfsoftware.com

A visionary who outlined a nice vision of a sort of “Cloud Kerf.”
http://conceptualorigami.blogspot.com/2010/12/vector-processing-languages-future-of.html