27

In understand that when I have a category variable in a model passed to a statsmodels fit that dummy variables will automatically be generated for the categories. For example if I have a variable 'Location' with values 'IndianOcean', 'Thailand', 'China' and 'Mars' I will get variables in my model of the form

Location[T.Thailand]

with one of the value not represented. By default the excluded variable seems to be the least common one. Is there a way to specify — ideally within the model specification — which value is treated as the "base value" and excluded?

2
  • 1
    It seems that using C in the formula (as in ... + C(Location, Treatment) + ... does the trick, but this results in some pretty ugly category names that I'd like to avoid. Commented Mar 16, 2014 at 13:12
  • 1
    I don't understand this. Do you write e.g. C(Location, 'IndianOcean') if you want 'IndianOcean' to be the reference category from the variable 'Location'? Commented Jul 30, 2014 at 12:30

4 Answers 4

44

You can pass a reference arg to the Treatment contrast, using syntax like

"y ~ C(Location, Treatment(reference='China'))"

http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.Treatment

If you have a better suggestion for naming conventions please file an issue with patsy.

Sign up to request clarification or add additional context in comments.

6 Comments

To be explicit, the syntax is "y ~ C(Location, Treatment(reference='China'))" .
@PiotrMigdal thanks for clarifying. I wish the original answer actually included code.
"y ~ C(Location, Treatment('China'))" works as well.
@jseabold , I'm getting error as follows PatsyError: Error evaluating factor: TypeError: 'Series' object is not callable. while doing the above two methods. Do you have any idea ?
I am having this problem as well. "TypeError: 'Series' object is not callable"
|
4

If you use single quotes to wrap your string, reference's argument needs to be wrapped with double quotes. Very easy mistake to make. I was using single quotes on both.

For example:

'y ~ C(Location, Treatment(reference="China"))'

is correct.

'y ~ C(Location, Treatment(reference='China'))'

is not correct.

Comments

3

Ok, maybe someone will find this one helpfull. I needed to set a new baseline category for the dependent variable, I had no idea how to do it. I searched and found nothing, so i simply added a "_" for the other categories. If you have 3 categories A, B, C, and you want your baseline to be C you just change the labeles from A and B to _A and _B. It works. I appears that the baseline category is defined by sorted()

Maybe someone knows a proper way to do it, this is not very phytonic, ja.

Comments

0

In Python 3.11.11 with statsmodels 0.14.4, I kept getting the error PatsyError: Error evaluating factor: TypeError: 'Series' object is not callable when trying to run "y ~ C(Location,Treatment(reference='China'))" or "y ~ C(Location,Treatment('China'))". On reflection this is probably because I unluckily had a column in the dataframe named Treatment.

The solution, which I haven't seen online, was to explicitly invoke patsy:

y ~ C(Location,patsy.Treatment(reference='China'))

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.