Kevin Lawler 11:00 am on August 26, 2016

Atlas Data Structure

Kerf introduces a new data structure called an “Atlas” that we think is a helpful thing to have inside of a programming language. We’ve found it very useful to have database Tables inside of a programming language, and the Atlas concept is a natural extension of that. Be careful about jumping to conclusions, though, since an Atlas does a few special things that make it unlike what you might expect. I’ll give a preview of the major points here:

Automatic indexing
Responds to SQL
First-class language type
Columnar implementation

We didn’t spend much time talking about Atlases originally, because interest in NoSQL-style computation seemed to be on the wane. Recently, however, we’ve had a lot of requests for those sorts of applications, and so a detailed description seems in order. (If you’ve come across any interesting problems like that in your work, tell us about them.)

If a Table is a way of including a SQL table inside of the language, then an Atlas is a way of including a NoSQL store: none of the rows have to have the same columns. In other words, an Atlas is a schemaless table, or simply a list of JSON objects, with keys in the objects pointing to values. The Atlas is a generalization of the Table where the rows don’t have to be uniform. In one sense, an Atlas is a sparse Table. You could also think of a Table as a vectorized Atlas.

: A regular table

: A JSON array of objects

: Awkward tabular representation of the JSON array

In Kerf we use the term “Map” to refer to what JSON calls an Object {"a":1}. So the term Atlas to refer to a List of Maps (with an Index) was an obvious choice. We use the special convenience syntax {{ and }} to create a table, and the related {[ and ]} convenience syntax to create an atlas. Both of these came about because they fit in well with JSON, and Kerf is a JSON (& SQL) superset.

1. The first thing that makes an Atlas special is that every key is indexed invisibly and by default. This means that no matter what key you query on, your query will be optimized. Atlases let you forget about indexing entirely. In a typical NoSQL store, if you want optimized queries on a key you have to issue the equivalent of a CREATE INDEX command, wait for the index to be created, incur future performance penalties, and so on. With Atlases you don’t have to fool around with any of that. The keys are already indexed, there’s no additional performance penalty, and adding a never-before-seen key doesn’t impact performance. Effectively, you can just forget about indices entirely, even for nested keys.

: A large, (mostly) random atlas

: Already optimized selects on two keys

2. The second thing that makes Atlases interesting is that you can use SQL on them and it will just work. Typical NoSQL stores force you to learn a proprietary query language. It’s a lot easier to be able to use the SQL that you already know, and then to have the output be what you would expect. Even though there’s no specification in SQL for how a NoSQL table should behave, most programmers have an intuitive sense of what should come back. A lot of crazy corner cases have to be handled to accommodate this method of thinking, but because that’s all taken care of on our end (the internals), you don’t have to worry about it at all.

: Sample query

: Multiple renamed columns

: Nested keys

: WHERE clause

: Aggregation, GROUP BY

: Integrated language calculations

3. The third thing that make Atlases special is that the data type is integrated with the language. An Atlas is a first-class data type, just like an int or an array. This means that from inside the language, you can create Atlases, query them, manipulate the results, etc. If you’re querying an old-fashioned database, you need to slurp your results in through the tiny straw of an API. But in Kerf, the language and database are the same, so the code you write works directly on the data. There’s no traversal between a language layer and a database layer. Because the two layers are integrated, so to speak, this means that you can perform “out-of-core” computations without putting special thought into it. It also means that there’s no impact to performance.

: Intermix atlases with the language

: Store and modify selections

4. The fourth thing that makes Atlases special is that they are completely and entirely columnar in order to optimize performance. The index is columnar, the keys are columnar, the values are columnar, everything is columnar. This helps with performance and also with serializing the object.

Because a few existing solutions for schemaless data don’t really perform as advertised, we’ve come across several companies who are suffering in silence with irregular data problems they don’t know how to solve. But there is a tool that exists to solve it, called an Atlas, and we package it in Kerf. If you are faced with a problem with a lot of columns, or ragged or irregular columns, or sparse data, or more than one table mixed together, or have just been disappointed in NoSQL in general, you may want to look into Kerf. Contact us for any reason at founders@kerfsoftware.com.

Kevin Lawler 1:58 pm on June 21, 2016

Time Bars

Customers always ask us whether Kerf supports rolling up times into “time bars” or “time buckets.” The answer is yes, Kerf support time bars, and our time bars do a lot more than you might expect. Because not everybody is familiar with what time bars do, in this post we’ll give a basic overview of what time bars are and how they work.

Motivation

Time-series data always contains a timestamp column, and this column usually does more than mere record-keeping. With traditional data, you might track when a user registers for your e-commerce website, but the time when the user registers is irrelevant to the purpose of the site. A distinguishing quality of time-series data is that the time column matters. Changing the order of the users table doesn’t make a difference, but changing the times on a column of stock prices makes a big difference. Time matters here.

What this means is that you’re not really interested in a pure ordered table of events. The table is only a compacted depiction of what happened. If the table were really accurate, there would be empty space between the rows, proportional to the length of time between events. This is impractical, so we timestamp the events, and the distance between them is implied.



┌─────┬───────────────────────┐
│price│time                   │
├─────┼───────────────────────┤
│ 20.0│2016.11.11T00:00:11.000│
│ 24.0│2016.11.11T00:00:13.000│ //+02s difference
│ 23.0│2016.11.11T00:00:19.000│ //+06s difference
│ 23.0│2016.11.11T00:00:37.000│ //+18s difference
│ 25.0│2016.11.11T00:01:31.000│ //+36s difference
└─────┴───────────────────────┘

Though ultimately we work with tables, mentally we have to think of them as timelines. Timelines have a lot of weird properties tables don’t. Our ordered table can only have one row in the first position, one row in the second, and so on. A timeline, in comparison, can have 132 events in the first hour, none in the second, 15 in the third, etc. The arrival of events is irregular. Because this is tricky to deal with, we want to turn a timeline back into an ordered table, that is, we want to make an irregular sequence of events regular. This is what time bars are for.


select bars(5s,time) from t

┌───────────────────────┐
│time                   │
├───────────────────────┤
│2016.11.11T00:00:10.000│
│2016.11.11T00:00:10.000│ //repeat
│2016.11.11T00:00:15.000│
│2016.11.11T00:00:35.000│ //gap
│2016.11.11T00:01:30.000│
└───────────────────────┘

Time bars are used to normalize time-series. In Kerf, the bars function floors all the timestamps to the nearest multiple of five minutes, say. By design, this output contains a lot of repeats. The point of the repeats is that they are keys that you can group on. By performing aggregation functions on groups of rows, you can find the average price over five minute intervals, or the sum of the lots that were traded during those times. The aggregation reduces what happened during each window into a uniform output.

select avg(price) from t group by bars(5s, time)

┌───────────────────────┬─────┐
│time                   │price│
├───────────────────────┼─────┤
│2016.11.11T00:00:10.000│ 22.0│
│2016.11.11T00:00:15.000│ 23.0│
│2016.11.11T00:00:35.000│ 23.0│
│2016.11.11T00:01:30.000│ 25.0│
└───────────────────────┴─────┘

Aggregation is lossy. We’re condensing many rows into one, so we’re losing data. We’ve also rounded off the times, so we’re losing data there, too. The part that concerns us here is that we’re losing accuracy in the output.

Losing accuracy is a bad thing, but it’s sometimes worth it if you get something in return. The first advantange to lossiness is that it can make data comprehensible to people: looking at quarterly performance is more revealing than looking at pages of raw ticks. This is using a filter, in particular a downsampling filter, and downsampling filters are lossy methods for finding signal in lots of noisy data.

The second advantage to lossiness is that sometimes you need to delete data more than you need to conserve accuracy, and so you can bar your data to reduce what you have to store. This is great when the cost of disk is more important than some extra accuracy. It’s less great when insufficient technology forces you to use time bars because it otherwise can’t handle the volume of data. Kerf is designed to always handle the full volume of ticks without bars.

Absolute Bars

The standard way to create time bars is using what we’ll call absolute bars. Set to one-hour increments, an absolute time bar will roll up all the events from today at 7:00-7:59AM into a 7:00AM bucket, all the events from 8:00-8:59AM into an 8:00AM bucket, and so on. Tomorrow if events happen at 7:00AM, those would go in a separate bucket; future 7:00AM events never go into today’s 7:00AM bucket. That means that an hour bucket needs to keep track of the day, the month, and the year, even though it’s only grouping by hour. The rule is that every unit greater than the unit you’re grouping on is kept and unchanged. Units less than the one you’re grouping on will be truncated to zero. It’s easier to explain this with a picture.



┌─────┬───────────────────────┐
│price│time                   │
├─────┼───────────────────────┤
│   10│2016.06.22T06:55:00.000│
│   12│2016.06.22T07:01:00.000│
│   11│2016.06.22T07:04:00.000│
│   11│2016.06.22T07:22:00.000│
│   12│2016.06.22T07:43:00.000│
│   12│2016.06.22T09:04:00.000│
│   13│2016.06.23T07:33:00.000│ //next day
└─────┴───────────────────────┘



┌─────┬───────────────────────┬───────────────────────┐
│price│time                   │timebars               │
├─────┼───────────────────────┼───────────────────────┤
│   10│2016.06.22T06:55:00.000│2016.06.22T06:00:00.000│
│   12│2016.06.22T07:01:00.000│2016.06.22T07:00:00.000│
│   11│2016.06.22T07:04:00.000│2016.06.22T07:00:00.000│
│   11│2016.06.22T07:22:00.000│2016.06.22T07:00:00.000│
│   12│2016.06.22T07:43:00.000│2016.06.22T07:00:00.000│
│   12│2016.06.22T09:04:00.000│2016.06.22T09:00:00.000│
│   13│2016.06.23T07:33:00.000│2016.06.23T07:00:00.000│
└─────┴───────────────────────┴───────────────────────┘

Here we’ve updated the table to show the time bars. In practice no one does this, since you can group on the result of the bars function without storing it, and that’s quicker if you’re already familiar with bars.

select avg price from t group by bars(1h,time)

┌───────────────────────┬─────┐
│time                   │price│
├───────────────────────┼─────┤
│2016.06.22T06:00:00.000│ 10.0│
│2016.06.22T07:00:00.000│ 11.5│
│2016.06.22T09:00:00.000│ 12.0│
│2016.06.23T07:00:00.000│ 13.0│
└───────────────────────┴─────┘

Relative times can be expressed succinctly in Kerf. In all our examples, relative times are given as the first argument to the bars function. Some common times to want to group on:



1y     yearly
3m     quarterly
1m     monthly
7d     weekly
1d     daily
1h     hourly
5i     five-minute bars ('i' since 'm' is month)
1i     one-minute bars
1s     one-second bars
1000n  microseconds (via nanoseconds)

Modular “Wall Clock” Bars

Many time-series exhibit some form of periodicity. Home electricity usage increases during the summer months. E-commerce websites sell more goods during the holidays. Stock trading activity decreases during lunch. A good way to isolate this trend is to stack the hours from different days on top of each other. We can call these modular bars.

The first step to generating modular bars is to consider 7AM as a time on the wall that recurs every day. Using this method you would group hours from different days together. This lets you see how the hours in the day perform against each other over time.


┌─────┬───────────────────────┐
│price│time                   │
├─────┼───────────────────────┤
│  101│2016.06.22T06:30:00.000│
│  102│2016.06.22T12:30:00.000│
│  103│2016.06.22T18:30:00.000│
│  102│2016.06.23T00:30:00.000│
│  102│2016.06.23T06:30:00.000│
│  101│2016.06.23T12:30:00.000│
└─────┴───────────────────────┘




select time['hour'] as hour from t

┌────┐
│hour│
├────┤
│   6│
│  12│
│  18│
│   0│
│   6│
│  12│
└────┘




select time['day'] as day, time['hour'] as hour from t

┌───┬────┐
│day│hour│
├───┼────┤
│ 22│   6│
│ 22│  12│
│ 22│  18│
│ 23│   0│
│ 23│   6│
│ 23│  12│
└───┴────┘

In Kerf, indexing in to a time or a time array with special keywords like ‘hour’ will return the attribute in question. With this technique you can extract


'date'     //2016.06.22
'time'     //10:11:00.000
'year'     //2016
'month'    //6
'day'      //...
'hour'
'minute'
'second'
'millisecond'
'microsecond'
'nanosecond'
'week'

Note that this “drops” any time attribute larger or smaller than the attribute in question. So it differs from our absolute bars method from earlier. So different days may produce the same output for ‘hour’. These are keys we can group on.


select avg(price) from t group by time['hour']

┌────┬─────┐
│time│price│
├────┼─────┤
│   6│101.5│
│  12│101.5│
│  18│103.0│
│   0│102.0│
└────┴─────┘

To separate by day and hour, but not year, add another element to the list of group by arguments.


select avg(price) from t group by time['day'], time['hour']

┌────┬─────┬─────┐
│time│time1│price│
├────┼─────┼─────┤
│  22│    6│101.0│
│  22│   12│102.0│
│  22│   18│103.0│
│  23│    0│102.0│
│  23│    6│102.0│
│  23│   12│101.0│
└────┴─────┴─────┘

Which is the same, in this limited case, as the average on the absolute bars.


select avg price from t group by bars(1h, time)

┌───────────────────────┬─────┐
│time                   │price│
├───────────────────────┼─────┤
│2016.06.22T06:00:00.000│101.0│
│2016.06.22T12:00:00.000│102.0│
│2016.06.22T18:00:00.000│103.0│
│2016.06.23T00:00:00.000│102.0│
│2016.06.23T06:00:00.000│102.0│
│2016.06.23T12:00:00.000│101.0│
└───────────────────────┴─────┘

To contact us about Kerf, mail kevin@kerfsoftware.com

Kevin Lawler 5:58 pm on March 30, 2016

The Best Algorithm No One Knows About

Here’s a program roughly 0% of programmers know how to write: generate a list of random tweets, without duplication. You may think you know how to do this, but you almost assuredly don’t.

Say you work at Twitter and have to select just one tweet at random. This is an easy problem. Generate a random number less than the total number of tweets, and draw the tweet that corresponds somehow to your number. If you have to make a list of tweets, then, provided you don’t mind duplicates, you can repeat this same process over and over. We get a list of tweets drawn at random, possibly with some duplication. Any programmer could write this program.

The catch comes when you add the requirement that the selected tweets be unique. The whole problem changes. Once uniqueness is required, you get trapped by another, implicit, requirement, which is that we expect every tweet to be equally likely to appear.

(Stated more formally: given non-negative integers k and n with k <= n, generate a list of k distinct random numbers, each less than n. We reasonably ask that this be done in O(k) time and O(1) additional space. Finally, the distribution must be uniform.)

We started with the language of tweets, but it’s also possible to use the language of cards: deal k cards from a deck of size n, without replacement. So a poker hand like A2345 of clubs is valid for k=5 and n=52, but a hand containing AAAAA of clubs is not.

The misleading part in using the language of cards is that you don’t often consider a deck of size 2⁶⁴. What’s nice about the card formulation though is that it conveys how simple the problem statement really is. This is a fundamental problem that was open for a long time. Nobody knew how to deal cards.

The first obstacle that makes the “dealing” (without replacement) hard is that the “draw” method doesn’t work. If you draw (with replacement) over and over again, ignoring cards you’ve already picked, you can simulate a deal, but you run into problems. The main one is that eventually you’re ignoring too many cards and the algorithm doesn’t finish on time (this is the coupon collector’s problem).

There are several things to try next and none of them work. You can’t generate a random number for every card in the deck (too many cards). You can’t rearrange the deck (too much space). Once you get the basic complexity requirements down, what makes the problem slippery is keeping the balance in the distribution. You can try to play with “resizing the window” you’re looking at, but most of these methods add bias (usually, you undersample early items and oversample later ones).

Card dealing was unsolved for a long time. Finally, Jeffrey Vitter published the solution in 1984. A cleaned-up version appeared in 1987. To put this in context, public-key cryptography algorithms appeared about a decade earlier in the 1970s. Basically all other fundamental problems that sound like this were solved in the 1960s and 1950s, so we might have expected to have a solution from then. But in the 1980s, dealing cards was a research-level problem in Knuth.

The algorithm is still relatively unknown. When I came across a need for the algorithm in my work, it took a lot of detective work to even find it. There was no source code online. The algorithm is not referenced often, and when it is, it is in a highly obscure fashion. After reaching Vitter’s papers it again takes a concentrated effort to figure out what you need. This was a big surprise to me.

I think the explanation for this is that the method was originally presented as a capstone effort in the area of “reservoirs” and “tape sampling”. These days, while it’s possible you’ve worked at a place that uses tape drives for backup, chances are you’ve never programmed for a tape drive. I certainly haven’t. Based on my reading of the research, by the mid-80s subjects like tape drive sampling were considered antique. The most important algorithm appeared after people had stopped looking at that field. The only reason I know the field exists is it’s a giant section in Knuth that I’ve flipped past many times. Earlier algorithms for “card dealing” or “urn sampling” are hidden in there.

Today tape drive algorithms seem pointless, but there’s a good reason for approaching the problem from this perspective. Random access is costly on a tape drive. Regular access is costly on a tape drive. Tape drives are much less nimble than disk drives, and this enforces a strongly sequential read mentality. A sequential, long-duration read pattern is a useful way to frame the problem, as it results in a solution which is both sorted and “online”. Sorted means the dealt cards are in order, and online means you can do them one at a time, perhaps while you wait for a spindle to turn. This completist approach may be how you need to think about the problem in order to solve it in the first place.

The reason it’s nice for an algorithm to produce results in sorted order is because it’s easy to randomize the order after the fact. The reverse is not true. Sorting algorithms are linear only in restricted cases, and so the addition of an O(k*log(k)) sort will forfeit our original O(k) runtime. List shuffling algorithms are O(k) and easy to implement.

That it’s so easy to shuffle a small list is probably why the algorithm is overlooked. When most people need to deal cards they shuffle the entire deck. This works as long as the size of the deck is within reason. What happens is that once the deck is over a certain size, people resort to simple drawing algorithms, and nobody looks too closely at bias or inefficiency in the randomized application.

This is unfortunate. If you’ve ever sat through a statistics or combinatorics class, you know drawing balls from urns without replacement is a fundamental sampling application. In array languages like APL, sampling with and without replacement are implemented as first-class primitives “deal” and “draw”. It’s nice in theory to have sampling as a primitive since then you don’t have to worry about screwing it up. You do have to worry about the language implementors getting it wrong, though. I’ve talked with several people who’ve implemented deal-like primitives in the past, and Vitter’s name only comes up when I bring it up, so I assume many existing deal implementations either use slower methods, or, I guess, some kind of ad hoc method with an incorrect distribution.

We use Vitter’s algorithm in Kerf, and deal() is a first-class primitive. Kerf’s deal() is great for when banks need to sample a random collection of stock tickers, or a random collection of trades from throughout the day. Uses appear all the time.

I’ve cleaned up an older version of the C source code we use and made it available below. This source first appeared in a previous project here. A description of the algorithm and pseudocode can be found here in Vitter’s paper.

//Copyright Kevin Lawler, released under ISC License

double random_double()  //your random double function
{
  //[0,1) uniformly random double, mt19937-64.c source is fine
  //keep in mind most double-based implementations drop randomness to 52-bits
  return genrand64_real2();
}

//POTENTIAL_OPTIMIZATION_POINT: Christian Neukirchen points out we can replace exp(log(x)*y) by pow(x,y)
//POTENTIAL_OPTIMIZATION_POINT: Vitter paper points out an exponentially distributed random var can provide speed ups
//Vitter, J.S. - An Efficient Algorithm for Sequential Random Sampling - ACM Trans. Math. Software 11 (1985), 37-57.
//'a' is space allocated for the hand
//'n' is the size of the hand
//'N' is the upper bound on the random card values
void vitter(int64_t *a, int64_t n, int64_t N) //Method D
{
 

  int64_t i = 0;
  int64_t j = -1;
  int64_t t; 
  int64_t qu1 = -n + 1 + N; 
  int64_t S; 
  int64_t negalphainv = -13;
  int64_t threshold = -negalphainv*n;
  int64_t limit;

  double nreal = n;
  double Nreal = N;
  double ninv = 1.0/n;
  double nmin1inv = 1.0/(n-1);
  double Vprime = exp(log(random_double())*ninv);

  double qu1real = -nreal + 1.0 + Nreal;
  double negSreal;
  double U; 
  double X; 
  double y1; 
  double y2; 
  double top; 
  double bottom;
  

  while(n > 1 && threshold < N)
  {
    nmin1inv=1.0/(-1.0+nreal);

    while(1)
    {
      while(1)
      {
        X = Nreal * (-Vprime + 1.0);
        S = floor(X);

        if(S < qu1)
        {
          break;
        }

        Vprime = exp(log(random_double()) * ninv);
      }

      U = random_double();
      
      negSreal = -S;
      
      y1 = exp(log(U*Nreal/qu1real) * nmin1inv);
      
      Vprime = y1 * (-X / Nreal+1.0) * (qu1real / (negSreal + qu1real));
      
      if(Vprime <= 1.0)
      {
        break;
      }
      
      y2 = 1.0; 
      
      top = -1.0 + Nreal;       
      
      if(-1+n > S)
      {
        bottom = -nreal + Nreal;
        limit = -S + N;
      }
      else
      {
        bottom = -1.0 + negSreal + Nreal; 
        limit = qu1;
      }
      
      for(t = N-1; t >= limit; t--)
      {
        y2=(y2 * top) / bottom;
        top--; 
        bottom--;
      }
      
      if(Nreal / (-X + Nreal) >= y1 * exp(log(y2) * nmin1inv))
      {
        Vprime = exp(log(random_double()) * nmin1inv);
        break;
      }

      Vprime = exp(log(random_double()) * ninv);
    }

    j += S + 1;

    a[i++] = j;

    N = -S + (-1 + N); 

    Nreal = negSreal + (-1.0 + Nreal);

    n--;
    nreal--;
    ninv = nmin1inv;

    qu1 = -S + qu1;
    qu1real = negSreal + qu1real;

    threshold += negalphainv;
  }

  if(n>1)
  {
    vitter_a(a+i,n,N,j); // if i>0 then n has been decremented
  }
  else
  {
    S = floor(N * Vprime);

    j += S + 1;

    a[i++] = j;
  }

}

void vitter_a(int64_t *a, int64_t n, int64_t N, int64_t j) //Method A
{
  int64_t S, i = 0;
  double top = N - n, Nreal = N, V, quot;

  while(n >= 2)
  {
    V = random_double(); 
    S=0;
    quot=top/Nreal;

    while (quot>V)
    {
      S++; top--; Nreal--;
      quot = (quot * top)/Nreal;
    }

    j+=S+1;

    a[i++]=j;

    Nreal--;

    n--;
  }

  S = floor(round(Nreal) * random_double());

  j+=S+1;

  a[i++]=j;

}

Kevin Lawler 1:34 pm on February 22, 2016

String Interning Done Right

Refresher

The common method for interning strings breaks in fantastic ways. In Kerf, we’ve taken the old method and revised it for success in the current generation of languages.

If you’ve forgotten what string interning is from your time with Java, it’s a simple way of ensuring that any string appears only once. So for instance, if I have ten objects whose type is the string “lion”, I don’t have ten copies of the string “lion” in memory. What I have instead is a single copy of “lion” and then ten links which all point to the single copy somehow.

Most often these links are pointers (raw addresses in memory). We’ll discuss how this breaks soon. The reference copy of “lion” is never allowed to be moved, changed, or released. It’s permanent. The other implementation detail to figure out is how to keep the strings unique. The next time we create a “lion” string we need to trade our string in for a link, and what is usually done is that the reference copy is stored in a hashtable or some other deduplicating data structure. This lets us figure out that lion already exists, and when we perform the check we can walk away with our link at the same time. If the phrase is “bullmation”, and it doesn’t already exist, then we can add the initial copy to the data structure at the same time.

intern-diagram-01

One benefit of string interning is that you don’t waste space if you’re storing the same long string over and over again.

There is another benefit to string interning, which is that it converts a list of variable-length strings to a fixed-width array. An array of pointers may be indexed into easily, just as any other array. It is not possible to index into a list of variable-length strings without additional overhead.

One really bad thing about string interning is that you usually can’t get rid of any string you’ve seen. If your language is naively storing pointers, you have no way of knowing when it gets rid of the last one. This means that the more strings you come across the more memory you lose forever. Simple parsing jobs can cause languages with string interning to run out of memory.

Java most famously uses string interning, but the concept originated in early Lisp. It shows up in some form or another in most languages. Kerf’s authors learned much about string interning from K, which got it seemingly directly from Lisp, and we’ll get to that in a second.

The Java Language

Java likes string interning because it provides a shortcut method for testing string equality. To test the equality of two strings in Java you simply check whether the links are equal. Links are a small fixed size, but strings can be any length, so testing takes O(1) time instead of O(n). Java also touts that this test can be accomplished with pointers, as in “string1 == string2” instead of the more verbose “string1.equals(string2)”: this does little other than to generate trivia and serve as a further point of confusion.

Java is half-hearted in its commitment to interned strings. Performance reasons force Java to expose multiple other categories of string types. Character arrays are a basic type in Java, and character buffers are a special object. You can see why Java needs these types. With permanent default interning, any sort of sequence involving character-level appends, such as reading the contents of a book from a file, would result in a preposterous O(n²) version of an otherwise trivial technique. The multiple categories of string objects leads to a fragmented ecosystem and a real headache.

What is exasperating about Java, is that they probably shouldn’t have exposed string interning at all. The kind of string editing operation that devolves to O(n²), and pollutes the pool, is by far the most common use case for Java strings. And the kind of use case that demands interning, frequent equality testing, such as in a database, rarely appears in Java, which is not an appropriate language for that level of performance anyway. So they got it backwards. What they should’ve done is made “lion” an uninterned character array, and suffered the penalty on equality testing, which in Java applications would be unnoticeable.

(Commenters point out that Java has changed over the years, so the preceding refers only to Java as I was taught.)

The K Language

Arthur Whitney’s K series of languages makes more appropriate use of string interning. K supports both interned and non-interned strings, but correctly makes the basic string type non-interned, and reserves the interned type for special occasions.

K’s string interning uses raw pointers. K calls interned strings “symbols”—a name adopted from Lisp—and leans heavily on interning’s constant-time comparison benefits. The most common use for a symbol in K is for representing a stock ticker, such as ‘IBM’ or ‘MSFT’, and K stores long columns of these symbols. If you do a table scan, for instance, there’s a savings in doing a single pointer comparison over doing a character-by-character comparison each time.

Internally, K depends on symbols for a variety of other speedups. It uses linear-time lookup dictionaries, instead of constant hash lookups, so interning is a necessity there. Variable lookup also depends on interning, and uses the same linear-time dictionaries. Without interning the core parts of the language would be appreciably slower.

Whitney’s thinking about symbols informs much of our views on the subject, and we consider our thoughts on the matter to be the natural extension of his. Working with his symbol implementation helped us to see where to take the concept next.

Modern Problems

What’s broken in previous implementations of string interning is the global, per-process pool for immutable strings, and the use of memory pointers to reference them. So, everything.

The core issue is that process-level interning is the wrong way to think about strings. The evolving view is that the process is a single, individually-unimportant node in the graph of some inter-process operation. So focusing on the process as the end-all is the wrong way to go about it. Dynamic pointers are one of the big offenders here. But so is the global pool. Processes change from launch to launch, and while it is usually only modest overhead to rebuild an intern pool, the transitory nature of the process and the oracular nature of the pool make global interning an awkward choice. This is most noticeable when writing objects to disk or sending them across the network. As soon as the process needs to communicate the scheme breaks down. The pointer scheme makes life easier for the language implementor, at least at first, but it makes it harder for everything and everyone else involved in supporting the language.

In the old system, each process is an island. A pointer from one process is useless to another process, or to a file on disk. At a minimum this adds an extra layer of serialization and deserialization. What should you turn the pointer into (null delimited strings, string objects?)? Do you send the whole pool each time? How do you avoid storing a fixed-width column as a variable-length one?

One common and fairly bad way to send an array of interned strings across the network is to send them as a null-delimited list of variable length strings. So if you have the string “manticore” repeated hundreds of times, some systems actually transfer the entire null delimited string hundreds of times, with no deduplication. The strategy here is to replace all fixed-width pointers with variable-length null delimited strings.

Another poor technique we’ve seen is a scheme for writing a column of symbols to disk. In this case, the product writes a separate deduplicated null-terminated list of symbols for each table, that effectively represents a static copy of the intern pool. This symbol list is shared among all columns. Then the individual column is written, which consists of a fixed-width array of numeric indices into the symbol list. So the pointers have been translated into indices into this list.

When it comes time to read a column from the disk, the entire symbol list must be loaded into memory, and then the symbol column must be translated from indices into dynamic pointers. There is overhead in inserting the symbol from the list into the pool (rehashing), and overhead in translating to pointers. This translation is always necessary because you can’t use the same integer mapping from two different tables (to see why, consider a table where ‘1’ is ‘YHOO’ and another where ‘1’ is ‘SBUX’). More generally, such symbol lists cannot be shared, though this was often desirable, and they trivially fall out of sync between tables and processes.

Depending on how you want to look at it, the “global pool of pointers” method forces you to use three distinct representations (memory, disk, and network) and at least four translation layers: back and forth from memory to the disk, and back and forth from memory to the network.

Kerf’s Solution

The right way to think about interning is to reduce the scope of the intern pool. In Kerf, instead of maintaining one global pool per process, we maintain local pools on a per-object basis. We assume that the process is one of many, and we avoid the idea of a global pool. Then we get rid of dynamic pointers. Kerf indexes into object-local pools using numeric indices. This works exceptionally well for array languages, but the technique extends to regular imperative languages as well.

This technique allows objects to have the same representation in-memory, on disk, and transferred over the network. No translation layer is necessary. Objects with interned strings can now be mapped. The technique also allows Kerf to get away with using a single string type, instead of several incompatible ones.

Kerf is an array language, and to support string interning it exposes a special kind of array called an ‘Enum’. An Enum works like you would expect a regular array to work, with the exception that, internally, the array deduplicates objects. So the following list, when stored as an Enum,

['lion', 'lion', 'lion']

appears to be a regular array, but actually only stores one copy of the string ‘lion’ in an associated hashset, and then stores three integer references to inside the hashset. The pool is dynamic. You could use a hashtable or a regular array/list instead of a hashset, though those options are less suitable. When you insert an item into the enum it checks to see whether that item exists, and adds only an integer reference. Same story when you concatenate a list to the enum (except optimized), and so on. But in every instance the enum offers the same functionality as a regular array. There is no limit to the number of interned items, and Kerf’s enum support arbitrary data types, not just strings. Kerf uses enums for columns in a table, say.

intern-diagram-02

The natural objection to this is that, if you use multiple columns, you create multiple copies of the strings. In practice this objection turns out not to matter. The number of pools in which a given string may appear is relatively small (perhaps 10). The pools themselves are small (anywhere from ten entries to one million entries). The size of the pools is dominated by the number of indices referencing them (say billions of indices). Reference counted objects means that all of this matters even less anyways.

Let’s take an example. There are about 3,000 tickers traded on the NASDAQ. If we assume they’re 10 characters each (they’re less), that’s only about 30k in overhead per pool. With ten enumerated columns, that’s only 300k in overhead. A daily column may have 30M entries, a historical column may have thousands of times that. So for stock prices we’re talking starting duplication ratios of about 0.1% to thousands of times less. In practice, because the local pools can be reference counted, there may not be any duplication at all.

The second objection is that it is not immediately obvious how to reconcile two separate objects with different pools. We’ve built an entire language on top of this architecture, and we can say from experience that in practice this also turns out to be simple. These cases appear less frequently than you might think. Standard unoptimized tools handle basically all of these without any additional effort. In cases where additional effort is necessary, the architecture turns out to be the right tool for the job. Consider the example of optimizing the case where two potentially different lists of interned strings are tested for equality. You cannot simply compare the vectors of indices, because the adjoining pools might be composed differently. What you can do, however, is create an O(c) lookup table that handles the translation. Because the pools are small, the lookup tables are guaranteed to be small as well. The operation remains linear in the number of strings.

All sorts of interesting results pop out of this theory of string interning. Perhaps the most impressive is how easy it is to grade (sort) interned strings in O(n) time instead of O(n*lg n) time. I’ll skip over the details here, but effectively, you get to use a counting sort on a list of integers, which is a linear sort, and likely at the practical lower bound. Imagine that: linear sorting for strings. It works so well that I suspect there are cases where you want to enumerate a regular list before sorting it. I find this to be incredible. Other results like this one continue to appear naturally, which to us is strong evidence this is the right path for string interning.

Kerf blog

Author Archives: Kevin Lawler