Scott Locklin 5:46 pm on May 16, 2016

Kerf meets the smartgrid

Kerf is primarily designed as a tool for dealing with data streams. The most obvious streams of data which bring in customers are stock market data, but there are lots of interesting data streams in the world. Enough that people are starting to have a name for this “vertical” -aka “internet of things.” One of the “things of the internet” I’ve worked with in the past is smartmeter data. This is a stream of electrical power consumption data from individual power meters in homes, commercial sites, factories, electric car chargers and what have you. While everyone knows that markets create big streams of potentially interesting data, the data from power grids is also big and interesting. It’s big because a typical city might have millions of power meters in it, recording data at minute or quarter hour increments. It’s interesting because power companies need to know what is going on with their networks to provide uninterrupted electrical service that we all take for granted. It’s also very interesting to power companies who rely on “renewables” such as wind or solar which are not steady or controllable producers of power. If you have a power grid which relies on mostly renewables, you have all kinds of complicated operations research problems to deal with during times when it is cloudy, dark or the breezes are calm.

Problems

The data volume of “internet of things” applications can be unwieldy to manage in standard tools, but there are analytics challenges as well. A standard challenge for smartgrid stuff is doing forecasts. Such models can get really hairy. Every smart meter installation is quite different from every other smart meter. A power meter for a car charging station is vastly different from that of a school, an apartment or a house, and all of those things are different from one another. The “users” of each apartment and house have different schedules, appliances and devices. Getting this right in an automated way can pay big dividends; you can provide services to the people with the meter, and you can do more accurate forecasts on a city wide level if you understand the generating process at the individual meter level.

If you want to get something like this right in an automated fashion, it’s a fairly large project involving paying people like me to think about it for a few months. Doing all this in an automated, general and reliable way involves feature creation, feature selection algorithms, machine learning, data gathering, data cleansing, weather forecasting, wind forecasting, optimization and the kitchen sink. I’d love to eventually have the full suite of such tools to do such a project in Kerf as primitives, but we don’t know if customers would appreciate or use such things, so I’ll make do with some small steps for this blog.

Let’s say we have some simple meter data with the average power usage per installation. No temperature or weather data here, so we’ll pretend these things are not important.

KeRF> meta_table(smeter)

┌��───────────┬────┬────────────┬────────────┬───────┐
│column      │type│type_name   │is_ascending│is_disk│
├────────────┼────┼────────────┼────────────┼───────┤
│       Power│  -3│float vector│           0│      1│
│        time│  -4│stamp vector│           0│      1│
│installation│   8│        enum│           0│      1│
└────────────┴────┴────────────┴────────────┴───────┘

len smeter
  21894311

A standard observation about forecasting smart grid stuff is that electricity use is periodic. People use different amounts of electricity depending on what hour of the day it is, and what month of the year it is. These are pretty good features that would go into any machine learning gizmo which would ultimately solve this problem.

A simple thing to do toward building a forecasting model for this sort of data would be to pull out some average baselines for hour of the day and month of the year, while adding some other quantities for debugging purposes:

tmp:select avg(Power) as PowerAvg,max(Power) as PowerMax,std(Power) as PowerStd,min(Power) as PowerMin from smeter where time>2014.01.01,group by time['month'],time['hour'],installation
tmp: select time as month,time1 as hour,* from tmp
tmp1:delete_keys(tmp,['time','time1'])
tmp:tmp1


┌─────┬────┬────────────┬────────┬────────┬────────┬─────────┐
│month│hour│installation│PowerAvg│PowerMax│PowerStd│PowerMin │
├─────┼────┼────────────┼────────┼────────┼────────┼─────────┤
│    9│   5│    1ce2e29f│ 113.889│ 363.697│ 77.6535│      0.0│
│    9│   6│    1ce2e29f│ 161.965│ 586.212│ 123.717│  10.7606│
│    9│   7│    1ce2e29f│  183.03│  905.73│ 186.052│      0.0│
│    9│   8│    1ce2e29f│ 252.629│ 1125.37│ 262.054│      0.0│
│    9│   9│    1ce2e29f│ 268.239│  1234.6│  281.07│  7.78113│
│    9│  10│    1ce2e29f│  270.93│ 1189.17│ 273.545│      0.0│
│    9│  11│    1ce2e29f│ 283.249│ 1186.31│ 288.657│      0.0│
│    9│  12│    1ce2e29f│ 281.222│ 1170.18│ 282.604│0.0728503│
│    9│  13│    1ce2e29f│ 284.771│ 1159.36│ 276.452│ 0.691112│
│    9│  14│    1ce2e29f│ 291.867│ 1194.55│ 280.146│  2.42096│
│    9│  15│    1ce2e29f│ 289.168│ 1342.17│  273.67│      0.0│
│    9│  16│    1ce2e29f│ 263.163│ 1081.73│ 254.826│      0.0│
│    9│  17│    1ce2e29f│ 177.601│ 905.783│ 162.548│  8.05548│
│    9│  18│    1ce2e29f│ 153.921│ 564.091│  125.28│  9.07539│
│    9│  19│    1ce2e29f│ 154.173│ 597.273│ 130.062│      0.0│
│    9│  20│    1ce2e29f│ 157.302│ 618.441│ 135.536│      0.0│
└─────┴────┴────────────┴────────┴────────┴────────┴─────────┘

The snippet of result I included from the selection gives the average power for different months and hours for each meter installation. You can see in the example above that the meter shown meter uses more power at noon than it does at midnight, at least for September, which is more or less what you’d expect.

You could use these averages as a sort of baseline model, either predicting each month/hour as being what they were before, or attempting to build on them in some other way. But, you can see that the peak usage is very high, as is the standard deviation, even in sample, so that’s not likely to work. Finally, a lot of the minima are 0 or very low.

Forging ahead anyway with our dumb model, perhaps joining this with our original data set might be a useful thing to do.

smeter['month']: flatten xvals select time['month'] from smeter
smeter['hour']: flatten xvals select time['hour'] from smeter

left_join(smeter,(select month,hour,installation,PowerAvg from tmp),["installation","month","hour"])

┌───────┬───────────────────────┬────────────┬─────┬────┬────────┐
│Power  │time                   │installation│month│hour│PowerAvg│
├───────┼───────────────────────┼────────────┼─────┼────┼────────┤
│53.6267│2014.09.21T05:15:00.000│    f387e7b0│    9│   5│ 101.992│
│41.4469│2014.09.21T05:30:00.000│    f387e7b0│    9│   5│ 101.992│
│86.6634│2014.09.21T05:45:00.000│    f387e7b0│    9│   5│ 101.992│
│ 91.229│2014.09.21T06:00:00.000│    f387e7b0│    9│   6│ 100.237│
│ 65.806│2014.09.21T06:15:00.000│    f387e7b0│    9│   6│ 100.237│
│57.8399│2014.09.21T06:30:00.000│    f387e7b0│    9│   6│ 100.237│
│81.6965│2014.09.21T06:45:00.000│    f387e7b0│    9│   6│ 100.237│
│47.6387│2014.09.21T07:00:00.000│    f387e7b0│    9│   7│ 103.481│
│53.9568│2014.09.21T07:15:00.000│    f387e7b0│    9│   7│ 103.481│
│211.005│2014.09.21T07:30:00.000│    f387e7b0│    9│   7│ 103.481│
│55.2504│2014.09.21T07:45:00.000│    f387e7b0│    9│   7│ 103.481│
│63.5775│2014.09.21T08:00:00.000│    f387e7b0│    9│   8│ 106.195│
│ 47.352│2014.09.21T08:15:00.000│    f387e7b0│    9│   8│ 106.195│
│81.0503│2014.09.21T08:30:00.000│    f387e7b0│    9│   8│ 106.195│
│83.0564│2014.09.21T08:45:00.000│    f387e7b0│    9│   8│ 106.195│
│65.3051│2014.09.21T09:00:00.000│    f387e7b0│    9│   9│ 111.119│
└───────┴───────────────────────┴────────────┴─────┴────┴────────┘

Well, we can see this won’t work, but how badly won’t it work? MAPE is easily defined in Kerf, and unlike them other databases, you can do sensible things like drop the (numerous) zero examples from the actual Power readings.

dropInf:{[x] x[which(abs(x)!=inf)]}
def mape(x,y){
 xx:abs(dropInf((x-y)/x));
 sum(xx) / count(xx)} 

 select 100*mape(Power,PowerAvg) as MAPE from smeter where time>2014.01.01 group by installation

┌────────────┬───────┐
│installation│MAPE   │
├────────────┼───────┤
│    f387e7b0│271.342│
│    eeb1e1e1│4898.16│
│    8d525ff4│157.625│
│    9cd89c1e│ 260.53│
│    106c60a8│354.543│
│    4925364d│59.0307│
│    e74a5b3f│101.899│
│    cd1f9a75│138.086│
│            ⋮│      ⋮│
└────────────┴───────┘

This is a failure as a forecasting technique, but it illustrates a Kerf workflow with internet of things data. You can go very fast on a large database even on a crummy little laptop. You can also do better things.

Let’s say you wanted to build some linear regression models based on, say, time of day and month of year. Easily done in Kerf. You need a little helper function to add a dummy variable, as the standard kerf least squares primitive doesn’t include a constant (it assumes beta*x = y).

xx: xvals select month,hour from tmp where installation='6af73e77'
  [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...], 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...]]

We want this matrix to have a constant laminated to the front of it, so

transpose 1 join mapright transpose xx
  [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...], 
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...], 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...]]

Now, let’s stick it in a convenience function for pulling out C’s and betas for a linear regression model:

y: flatten xvals select PowerAvg from tmp where installation='6af73e77'
Cbeta: {[x, y] lsq((transpose 1 join mapright transpose x), y)}
Cbeta(xx,y)
  [82.8605, -1.35372, 0.302477]

This could be useful if we built some dummy variables. If you regress as we did in this function, we get betas which are correlated to the encodings. This isn’t useful unless the hour of the day or month of the year as a growing function actually means something. Probably not. You’d probably want to blow those guys out as a linear regression on month of year, hour of day as dummy variables. Something like this:

xx: (1+range(12))= mapright flatten xvals select month from tmp where installation = '6af73e77'
  [[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
...
Cbeta(transpose xx,y)
  [71.5752, 1.78876, -1.54225, 3.20775, 3.62774, 6.81636, 26.7092, 25.2714, 29.176, 28.5876, 9.84373, -31.3805, -30.5305]

Using all these parts, we can build regression models based on hour of day, month of year. We can also see patterns in the months of the year; this installation uses more power in the summer months than in winter.

Exponential smoothing is another thing which we might want for examining or predicting things. Kerf doesn’t come with this as a core verb, but we just need to write the definition of exponential smoothing down along with the deconverge combinator; one of those beautiful situations where Kerf is a clean mathematical notation:

def ewma(x,a) {
 {[z,y] (a*z)+(1-a)*y} deconverge x
}

Predicting electrical load a day or week ahead is considered a difficult machine learning problem due to all the nonlinearities involved, and the fact that you’re ultimately predicting what a bunch of monkeys do with light switches. The general solution to this problem is something I’ve done in Lisp (and ported to R) using metric space and information theoretic techniques. When I was doing this, I had absurd overhead from the data storage system. Pulling the pieces of the data in and out of the database was awful. The data was too big to keep in memory all at once, because the DB system I was using was not designed for time series data. Had I some Kerf to solve the problem with, I could have as easily ported the code from Lisp to Kerf, and the whole thing would have been ridiculously scalable and deployable on an as-is basis, with additional features such as streaming queries and live monitoring immediately available. That’s the power of a tool like Kerf. Programmable database engines allows us to do lots of fun out of core calculations in a high level language.

Meanwhile this data set allows us to develop lots of interesting time oriented techniques for solving difficult problems. I’ll visit it again in further blogs.

Scott Locklin 9:48 pm on February 9, 2016

An Introduction to Combinators

One of the most powerful features of Kerf are the combinators. Kerf is a vector language. You can operate on vectors and matrices as if they are natural data units, achieving interpreter speedups over looping constructs. While everyone is used to for loops in an interpreter, interpreters do very badly on them, which is why they always encourage people programming in R and Matlab to use the vector constructs if available. If you think about what goes on in various kinds of interpreter, you realize that there is a lot going on inside of a for loop.
Depending on how the interpreter is implemented, you may have to parse each line in the loop for every iteration in the loop; you have to evaluate test conditions, maintain state and so on.

Even if your loop is trivial and does some element wise operation on vectors it ends up going slower than it would in a vector operation. Interpreters need to check things, build stacks, move things around in a for loop. For example:

a=rnorm(1000000)
sumint <- function(x){
x0=0
 for(i in 1:length(x)) {
 x0 = x0+ x[i]
 }
 x0
}
system.time(sumint(a))
   user  system elapsed 
  0.361   0.005   0.367 
system.time(sum(a))
   user  system elapsed 
  0.001   0.000   0.001

Compilers can generally optimize down to the metal, so that’s one way out. In Kerf, you can use combinators to help the interpreter avoid these problems. For example, summation amounts to putting + in between all the elements of a vector. Fold puts the function to its left in between all of the elements of the thing on the right.

timing(1)
a:rand(1000000,1.0)
+ fold a
 1ms

This is a trivial application, but it illustrates the power of the idea, and its speed in action. Tell the interpreter to do a lot of things at once, and it will be done at close to machine speed, even if there is no compiled primitive to accomplish the task. With a good set of combinators you can achieve the satori known as “No Stinking Loops.”

Sum is a trivial example (there is a primitive in Kerf for this), but one which illustrates the power of the idea. Fold takes any function with two inputs and sticks it in between elements

  function foo(a,b){ sqrt(a+b)}
foo fold range(10)
  3.51283

unfold is like fold, except you keep your intermediate results around

function foo(a,b){ sqrt(a+b)}
foo unfold range(10)
  3.51283
  [0, 1.0, 1.73205, 2.17533, 2.48502, 2.73588, 2.95565, 3.15526, 3.33995, 3.51283]

mapdown can be used with unary functions or binary functions. While mapdown is pretty similar to applying a unary function to a list, it retains the shape of a map or table:

count mapdown   {a:1 2 3,b:1 2 }
  [3, 2]

For binary functions, it assumes the left and right data arguments have the same shape:

1 2 3 + mapdown 4 5 6
  [5, 7, 9]
 2 3 + mapdown 4 5 6
         ^
 Conformable error

Just as “fold” is a lot like “reduce,” “mapdown” is like “map” in the standard functional programming model that gets so much attention.

mapback is one of my favorite combinators, as this sort of operation is incredibly dreary to write in a for loop. Have you ever wanted to run an operation on an element and the element before it in a vector? I know I have. Calculating differences for example.

reverse range(10)
  [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
- mapback reverse range(10)
[9, -1, -1, -1, -1, -1, -1, -1, -1, -1]

While many language give you primitives for calculating differences on a vector, only a few APLs give you the ability to do this in general. To see what is going on, use the binary function “join”

join(1,2)
  [1, 2]

join mapback reverse range(10)
  [9, 
 [8, 9], 
 [7, 8], 
 [6, 7], 
 [5, 6], 
 [4, 5], 
 [3, 4], 
 [2, 3], 
 [1, 2], 
 [0, 1]]

mapright maps binary verb and data on the left hand side to each of the values on the right side,

1 2 + mapright 1 2 3
  [[2, 3], 
 [3, 4], 
 [4, 5]]
1 2 join mapright 1 2 3
  [[1, 2, 1], 
 [1, 2, 2], 
 [1, 2, 3]]

mapleft, as you might expect, maps the data on the right to binary verb on the left hand side to each of the values on the left hand side.

1 2 + mapleft 1 2 3
  [[2, 3, 4], 
 [3, 4, 5]]
1 2 join mapleft 1 2 3
  [[1, 1, 2, 3], 
 [2, 1, 2, 3]]

Sometimes mapleft and mapright evaluate to the same thing. Remembering that the equals sign and “equals” are binary functions in Kerf, we can use mapright to create a diagonal array.

range(5) equals mapright range(5)
  [[1, 0, 0, 0, 0], 
 [0, 1, 0, 0, 0], 
 [0, 0, 1, 0, 0], 
 [0, 0, 0, 1, 0], 
 [0, 0, 0, 0, 1]]

range(5) equals mapleft range(5)
  [[1, 0, 0, 0, 0], 
 [0, 1, 0, 0, 0], 
 [0, 0, 1, 0, 0], 
 [0, 0, 0, 1, 0], 
 [0, 0, 0, 0, 1]]

“converge” is a classic combinator from functional programming which operates on unary functions. For binary functions, it is the same thing as “fold.” It has a brother function in deconverge which allows you to see the intermediate steps. There are two ways to use it.

The simplest usage of converge is to execute a function on a value, then its result, until that value no longer changes. Non numeric people probably don’t run into this very often, but numerics guys are always doing this. Rather than constructing some elaborate while loop, you can just write the recursion relation down. For example, calculating the golden ratio:

{[x] 1+1/x} converge 1
  1.61803

“deconverge” is the same thing, except it keeps around the intermediate results in a list. This is useful for debugging purposes, but this intermediate list might be something you’re also interested in.

{[x] 1+1/x} deconverge 1 
  [1, 2.0, 1.5, 1.66667, 1.6, 1.625, 1.61538, 1.61905, 1.61765, 1.61818, 1.61798, 1.61806, 1.61803, 1.61804, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803, 1.61803]

The other way to use converge is to declare you’re going to do something to the right hand value, left hand value number of times:

4 {[x] x * 2} converge [4,2]
  [64, 32]

deconverge applied to the same lambda function:

4 {[x] x * 2} deconverge [4,2]
  [[4, 2], 
 [8, 4], 
 [16, 8], 
 [32, 16], 
 [64, 32]]

Or, in the case of the Golden Ratio, you might be interested in just running 10 iterations to see what’s going on (and make sure it doesn’t run forever; don’t worry though, ctrl-C still works):

10 {[x] 1+1/x} deconverge 1
  [1, 2.0, 1.5, 1.66667, 1.6, 1.625, 1.61538, 1.61905, 1.61765, 1.61818, 1.61798]

Combinators are one of the things that makes Kerf different. They’re the same idea as adverbs in J or K, but they represent the patterns that are most often productively used. Personally, I have always hated writing loops; it drives me bananas having to say, “OK, computer, I want to count through my list this way and do these things to it: start at index 0, then step forward by 1.” Too many ways to insert a typo in the loop, and you get sick of typing the same damn thing, over and over again. Even if they run slower than a loop, I’d want some kind of combinator thing, just to make my life easier and less error prone.

Scott Locklin 5:15 pm on January 19, 2016

Timestamps done right

I’ve used a lot of tools meant for dealing with time series. Heck, I’ve written a few at this point. The most fundamental piece of dealing with timeseries is a timestamp type. Under the covers, a timestamp is just a number which can be indexed. Normal humans have a hard time dealing with a number that represents seconds of the epoch, or nanoseconds since whenever. Humans need to see things which look like the ISO format for timestamps.

Very few programming languages have timestamps as a native type. Some SQLs do, but SQL isn’t a very satisfactory programming language by itself. At some point you want to pull your data into something like R or Matlab and deal with your timestamps in an environment that you can do linear regressions in. Kerf is the exception.

Consider the case where you have a bunch of 5 minute power meter readings (say, from a factory) with timestamps. You’re probably storing your data in a database somewhere, because it won’t fit into memory in R. Every time you query your data for a useful chunk, you have to parse the stamps in the chunk into a useful type; timeDate in the case of R. Because the guys who wrote R didn’t think to include a useful timestamp data type, the DB package doesn’t know about timeDate (it is an add on package), and so each timestamp for each query has to be parsed. This seems trivial, but a machine learning gizmo I built was entirely performance bound by this process. Instead of parsing the timestamps once in an efficient way into the database, and passing the timestamp type around as if it were an int or a float, you end up parsing them every time you run the forecast, and in a fairly inefficient way. I don’t know of any programming languages other than Kerf which get this right. I mean, just try it in Java.

Kerf gets around this by integrating the database with the language.

Kerf also has elegant ways of dealing with timestamps within the language itself.

Consider a timestamp in R’s timeDate. R’s add-on packages timeDate + zoo or xts are my favorite way of doing such things in R, and it’s the one I know best, so this will be my comparison class.

require(timeDate) 
a=as.timeDate("2012-01-01")
GMT
[1] [2012-01-01]

In Kerf, we can just write the timestamp down

a:2012.01.01
  2012.01.01

A standard problem is figuring out what a date is relative to a given day. In R, you have to know that it’s basically storing seconds, so:

as.timeDate("2012-01-01") + 3600*24
GMT
[1] [2012-01-02]

Kerf, just tell it to add a day:

2012.01.01 + 1d
  2012.01.02

This gets uglier when you have to do something more complex. Imagine you have to add a month and a day. To do this in general in R is complex and involves writing functions.

In Kerf, this is easy:

2012.01.01 + 1m1d
  2012.02.02

Same story with hours, minutes and seconds

2012.01.01 + 1m1d + 1h15i17s
  2012.02.02T01:15:17.000

And if you have to find a bunch of times which are a month, day, hour and 15 minutes and 17 seconds away from the original date, you can do a little Kerf combinator magic:

b: 2012.01.01 + (1m1d + 1h15i17s) times mapright  range(10)
  [2012.01.01, 2012.02.02T01:15:17.000, 2012.03.03T02:30:34.000, 2012.04.04T03:45:51.000, 2012.05.05T05:01:08.000, 2012.06.06T06:16:25.000, 2012.07.07T07:31:42.000, 2012.08.08T08:46:59.000, 2012.09.09T10:02:16.000, 2012.10.10T11:17:33.000]

The mapright combinator runs the verb and noun to its right on the vector which is to the left. So you’re multiplying (1m1d + 1h15i17s) by range(10) (which is the usual [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ), then adding it to 2012.01.01.

You can’t actually do this in a simple way in R. Since there is no convenient token to add a month, you have to generate a time sequence with monthly periods. The rest is considerably less satisfying as well, since you have to remember to add numbers. In my opinion, this is vastly harder to read and maintain than the kerf line.

b=timeSequence(from=as.timeDate("2012-01-01"),length.out=10,by="month") + (3600*24 + 3600 + 15*60 + 17) *0:9
 [2012-01-01 00:00:00] [2012-02-02 01:15:17] [2012-03-03 02:30:34] [2012-04-04 03:45:51] [2012-05-05 05:01:08] [2012-06-06 06:16:25] [2012-07-07 07:31:42] [2012-08-08 08:46:59] [2012-09-09 10:02:16] [2012-10-10 11:17:33]

This represents a considerable achievement in language design; an APL which is easier to read than a commonly used programming language for data scientists. I am not tooting my own horn here, Kevin did it.

If I wanted to know what week or second these times occur at, I can subset the implied fields in a simple way in Kerf:

b['week']
  [1, 6, 10, 15, 19, 24, 28, 33, 37, 42]
b['second']
  [0, 17, 34, 51, 8, 25, 42, 59, 16, 33]

I think the way to do this in R is with the “.endpoints” function, but it doesn’t seem to do the right thing

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04 LTS
other attached packages:
[1] xts_0.9-7         zoo_1.7-12        timeDate_3012.100

.endpoints(b, on="week")
 [1]  0  1  2  3  4  5  6  7  8  9 10
.endpoints(b, on="second")
 [1]  0  1  2  3  4  5  6  7  8  9 10

You can cast to a POSIXlt and get the second at least, but no week of year.

as.POSIXlt(b)$week
NULL
as.POSIXlt(b)$sec
 [1]  0 17 34 51  8 25 42 59 16 33

Maybe doing this using one of the other date classes, like as.Date…

 weekGizmo<-function(x){ as.numeric(format(as.Date(time(x))+3,"%U")) }

Not exactly clear, but it does work. If you have ever done things with time in R, you will have had an experience like this. I’m already reaching for a few different kinds of date and time objects in R. There are probably a dozen kinds of timestamps in R which do different subsets of things, because whoever wrote them wasn’t happy with what was available at the time. One good one is better. That way when you have some complex problem, you don’t have to look at 10 different R manuals and add on packages to get your problem solved.

Here’s a more complex problem. Let’s say you had a million long timeseries with some odd periodicities and you want to find the values which occur at week 10, second 33 of any hour.

ts:{{pwr:rand(1000000,1.0),time:(2012.01.01 + (1h15i17s times mapright  range(1000000)))}}
timing(1)
select *,time['second'] as seconds,time['week'] as weeks from ts where time['second']=33 ,time['week'] =10

┌────────┬───────────────────────┬───────┬─────┐
│pwr     │time                   │seconds│weeks│
├────────┼───────────────────────┼───────┼─────┤
│0.963167│2012.03.01T01:40:33.000│     33│   10│
│0.667559│2012.03.04T04:57:33.000│     33│   10│
│0.584127│2013.03.06T05:06:33.000│     33│   10│
│0.349303│2013.03.09T08:23:33.000│     33│   10│
│0.397669│2014.03.05T01:58:33.000│     33│   10│
│0.850102│2014.03.08T05:15:33.000│     33│   10│
│0.733821│2015.03.03T22:50:33.000│     33│   10│
│0.179552│2015.03.07T02:07:33.000│     33│   10│
│       ⋮│                      ⋮│      ⋮│    ⋮│
└────────┴───────────────────────┴───────┴─────┘
    314 ms

In R, I’m not sure how to do this … you’d have to use a function that outputs the week of year then something like this (which, FWIIW, is fairly slow) function to do the query.

require(xts)
ts=xts(runif(1000000), as.timeDate("2012-01-01") + (3600 + 15*60 + 17) *0:999999)
weekGizmo<-function(x){ as.numeric(format(as.Date(time(x))+3,"%U")) }
queryGizmo newx
 newx[(wks==10) & (secs==33)]
}
system.time(queryGizmo(ts))
   user  system elapsed 
  4.215   0.035   4.254

The way R does timestamps isn’t terrible for a language designed in the 1980s, and the profusion of time classes is to be expected from a language that has been around that long. Still, it is 2016, and there is nothing appreciably better out there other than Kerf.

Lessons for future language authors:

If you query from a database, that type needs to be propagated through to the language as a first class type. If this isn’t possible to do directly, there should be some way of quickly translating between classes in the DB and classes that doesn’t involve parsing a string representation.
timestamps should be a first class type in your programming language. Not an add on type as in R or Python or Java.
timestamps should have performant and intuitive ways of accessing implied fields
it would be nice if it handles nanoseconds gracefully, even though it is hard to measure nanoseconds.

Scott Locklin 6:45 pm on December 15, 2015

An introduction to Kerf

My pals generally act impressed when I show them my noodlings in the J language. I’m pretty sure they’re impressed with the speed and power of J because it is inarguably fast and powerful, but I’ve also always figured they more saw it as an exercise in obfuscated coding; philistines! While I can generally read my own J code, I must confess some of the more dense tacit style isn’t something I can read naturally without J’s code
dissector. I have also been at it for a while, and for a long time went on faith that this skill would come. Notation as a tool of thought is one of the most powerful ideas I’ve come across. The problem becomes talking people into adopting your notation. Building important pieces of your company around a difficult mathematical notation is a gamble which most companies are not willing to take.

Everyone knows about Arthur Whitney and K because of Kx systems database KDB. Having fiddled around with KDB and Eric Iverson and J-software’s Jd, the mind-boggling power of these things on time series and data problems in general makes me wonder why everyone doesn’t use these tools. Then I remember the first time I looked at things like this:

wavg:{(+/x*y)%+/x}        // K version
wavg=: +/ .* % +/@]        NB. J version

Oh yeah, that’s why J and K adoption are not universal. I mean, I can read it. That doesn’t mean everyone can read it. And I certainly can understand people’s reluctance to learn how to read things like this. It’s not easy.

For the last year and a half, my partner Kevin Lawler has been trying to fix this problem. You may know of him as the author of Kona, the open source version of K3. Kevin’s latest creation is Kerf. Kerf is basically an APL that humans can read, along with one of the highest performance time series databases money can buy. I liked it so much, I quit my interesting and promising day job doing Topological Data Analysis at Ayasdi, and will be dedicating the next few years of my life to this technology.

We know the above code fragments are weighted averages, but mostly because that’s what they’re called in the verb definitions. Mischievous programmers (the types who write code in K and J) might have called them d17 or something. Kerf looks a lot more familiar.

function wavg(x,y) {
  sum(x*y) / sum(x)
}

This is cheating a bit, since K & J don’t have a sum primitive, but it begins to show the utility of organizing your code in a more familiar way. Notice that x * y is done vector wise; no stinking loops necessary. Expressing the same thing in more primitive Kerf functions looks like this:

function wavg(x,y) {
  (+ fold x*y) / (+ fold x)
}

In J and K, the ‘/’ adverb sticks the left hand verb between all the elements on the right hand side. In Kerf, we call that operation “fold” (we also call adverbs “combinators” which we think is more descriptive for what they do in Kerf: I think John Earnest came up with the term).

You could also write the whole thing out in terms of for loops if you wanted to, but fold is easier to write, easier to read, and runs faster.

There are a few surprises with Kerf. One is the assignment operator.

a: range(5);
b: repeat(5,1);
KeRF> b
b
  [1, 1, 1, 1, 1]
KeRF> a
a
  [1, 2, 3, 4]

Seems odd. On the other hand, it looks a lot like json. In fact, you can compose things into a map in a very json like syntax:

aa:{a: 1 2 3, 
    b:'a bit of data', 
    c:range(10)};

KeRF> aa['a']
aa['a']
  [1, 2, 3]

KeRF> aa
aa
  {a:[1, 2, 3], 
   b:"a bit of data", 
   c:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

This seems like syntax sugar, but it actually helps. For example, if I had to feed variable ‘aa’ to somthing that likes to digest json representations of data, it pops it out in ascii json:

json_from_kerf(aa)
  "{\"a\":[1,2,3],\"b\":\"a bit of data\",\"c\":[0,1,2,3,4,5,6,7,8,9]}"

OK, no big deal; a language that has some APL qualities which speaks json. This is pretty good, but we’d be crazy to attempt to charge money for something like this (Kerf is not open source; Kevin and I have to eat). The core technology is a clean APL that speaks json, but the thing which is worth something is the database engine. Tables in kerf look like interned maps and are queried in the usual SQL way.

u: {{numbers: 19 17 32 8 2 -1 7, 
     strings: ["A","B","C","D","H","B","Q"]}}
select * from u where numbers>18

┌───────┬───────┐
│numbers│strings│
├───────┼───────┤
│     19│      A│
│     32│      C│
└───────┴───────┘

select numbers from u where strings="B"

┌───────┐
│numbers│
├───────┤
│     17│
│     -1│
└───────┘

Now the business with ‘:’ starts to make more sense. Since SQL is part of the language, the ‘=’ sign is busy doing other things, rather than setting equality. Now your eyes don’t have to make out any contextual differences or look for ‘==’ versus ‘=’ -everything with an ‘=’ is an equality test. Everything with an ‘:’ is setting a name somewhere.

Standard joins are available with left join:

v:{{a: 1 2 2 3, numbers: 19 17 1 99}}
left_join(v,u,"numbers")

┌─┬───────┬───────┐
│a│numbers│strings│
├─┼───────┼───────┤
│1│     19│      A│
│2│     17│      B│
│2│      1│   null│
│3│     99│   null│
└─┴───────┴───────┘

For timeseries, having a good time type, preferably first class and with the ability to look at nanoseconds is important. So are asof joins.

qq:{{nums: range(10), 
date: 1999.01.01+ (24 * 3600000000000)  * range(10), 
strg:["a","b","c","d","e","f","g","h","i","j"]}}
vv:{{nums: 10+range(10), 
date: 1999.01.01+ (12 * 3600000000000)  * range(10), 
strg:["a","b","c","d","e","f","g","h","i","j"]}}

select nums,nums1,mavg(3,nums1),strg,strg1,date from asof_join(vv,qq,[],"date")

┌────┬─────┬────────┬────┬─────┬───────────────────────┐
│nums│nums1│nums11  │strg│strg1│date                   │
├────┼─────┼────────┼────┼─────┼───────────────────────┤
│  10│    0│     0.0│   a│    a│             1999.01.01│
│  11│    0│     0.0│   b│    a│1999.01.01T12:00:00.000│
│  12│    1│0.333333│   c│    b│             1999.01.02│
│  13│    1│0.666667│   d│    b│1999.01.02T12:00:00.000│
│  14│    2│ 1.33333│   e│    c│             1999.01.03│
│  15│    2│ 1.66667│   f│    c│1999.01.03T12:00:00.000│
│  16│    3│ 2.33333│   g│    d│             1999.01.04│
│  17│    3│ 2.66667│   h│    d│1999.01.04T12:00:00.000│
│  ⋮│    ⋮│      ⋮│   ⋮│   ⋮│                      ⋮│
└────┴─────┴────────┴────┴─────┴───────────────────────┘

Kerf is still young and occasionally rough around the edges, but it is quite useful as it exists now: our customers and partners think so anyway. The only thing comparable to it from an engineering standpoint are the other APL based databases such as Kx and Jd. We think we have some obvious advantages in usability, and less obvious advantages in the intestines of Kerf. Columnar databases like Vertica and Redshift are great for some kinds of problems, but they don’t really compare: they can’t be extended in the same way that Kerf can, nor are they general purpose programming systems, which Kerf is.

We also have a lot of crazy ideas for building out Kerf as a large scale distributed analytics system. Kerf is already a suitable terascale database system; we think we could usefully expand out to hundreds of terabytes on data which isn’t inherently time oriented if someone needs such a thing. There is no reason for things like Hadoop and Spark to form the basis of large scale analytic platforms; people simply don’t know any better and make do with junk that doesn’t really work right, because it is already there.

You can download a time-limited version of Kerf from github. John Earnest has been doing some great work on the documentation as well. I’ve also set up a quick and dirty way to work with Kerf in emacs.

Keep up with Kerf at our company website:
www.kerfsoftware.com

A visionary who outlined a nice vision of a sort of “Cloud Kerf.”
http://conceptualorigami.blogspot.com/2010/12/vector-processing-languages-future-of.html