
When working on an ordinary classification problem, one of the most important tasks is feature engineering: creating new features from the data. This guide shows different ways to create those new features from existing columns or dictionaries, so you don’t have to check Stack Overflow ever again for column creation!
This guide can be divided into four parts. If you are looking for a special case, check out where to find this case here:
-
Part 1: Basic Operations Using the operators, handling dates and ranking items.
-
Part 2: Conditions and Functions Here you can see how to create new columns with existing or user-defined functions. If you work with a large dataset and want to create columns based on conditions in an efficient way, check out number 8!
-
Part 3: Multiple Column Creation It is possible to create multiple columns in one line.
-
Part 4: Other Data Sources Last but not least, how to merge dataframes and use dictionaries for mapping values.
In the code examples, a simple dataframe is used:

Okay, here we go!
Part 1: Basic Operations
1. Using the operators
The easiest way to create new columns is by using the operators. If you want to add, subtract, multiply, divide, etcetera you can use the existing operator directly.
# multiplication with a scalar
df['netto_times_2'] = df['netto'] * 2
# subtracting two columns
df['tax'] = df['bruto'] - df['netto']
# this also works for text
df['fullname'] = df['firstname'] + ' ' + df['lastname']
df[['firstname', 'lastname', 'bruto', 'netto', 'netto_times_2', 'tax', 'fullname']].head()

2. Extracting information from a date column
Dates can contain valuable information. In our example dataframe, we can calculate the age of a person or extract the year of birth. Another option is to calculate the days since a date.
# converting the dtype
df['birthdate'] = pd.to_datetime(df.birthdate)
# extracting the year from a date
df['year_of_birth'] = df['birthdate'].dt.year
# calculating the age of a person
now = pd.to_datetime('now')
df['age'] = (now - df['birthdate']).astype('<m8[Y]').astype('int')
# calculating days since date
now = pd.to_datetime('now')
df['days_since_birth'] = df.apply(lambda row: (now - row['birthdate']).days, axis=1)
df[['birthdate', 'year_of_birth', 'age', 'days_since_birth']].head()

3. Ranking values
If you want to rank column values from 1 to n, you can use rank:
df['netto_ranked'] = df['netto'].rank(ascending=False)
df['netto_pct_ranked'] = df['netto'].rank(pct=True)
df[['netto','netto_ranked', 'netto_pct_ranked']].head()

Part 2: Conditions and Functions
4. Conditional column creation
If you have a condition you can use np.where:
df['child'] = np.where(df['age'] < 18, 1, 0)
df['male'] = np.where(df['gender'] == 'M', 1, 0)
df[['age', 'gender', 'child', 'male']].head()

5. Apply an existing function to a column
If you want to use an existing function and apply this function to a column, df.apply is your friend. E.g. if you want to transform a numerical column using the np.log1p function, you can do it in the following way:
# applying an existing function to a column
df['log1p_bruto'] = df['bruto'].apply(np.log1p)
df[['bruto', 'log1p_bruto']].head()

6. Apply and lambda combined
In the first example, we subtracted the values of the bruto and netto columns. It is easy to use basic operators, but you can also use apply combined with a lambda function:
df['tax'] = df.apply(lambda row: row.bruto - row.netto, axis=1)
df[['bruto', 'netto','tax']].head()

7. Create a custom function (and apply)
Sometimes you have multiple conditions and you want to apply a function to multiple columns at the same time. It can be done by using a custom made function, and applying this function to your dataframe. Let’s create age groups in our dataframe.
# create a function to define age groups
def age_groups(row):
if row['age'] < 18:
return 0
elif row['age'] >= 18 and row['age'] < 30:
return 1
elif row['age'] >= 30 and row['age'] < 60:
return 2
else:
return 3
# apply to dataframe, use axis=1 to apply the function to every row
df['age_groups'] = df.apply(age_groups, axis=1)
df[['age', 'age_groups']].head()

You can easily use multiple columns and multiple conditions with this way of column creation. If you want to use age and bruto income to interpret salaries:
def age_salary(row):
if row['bruto'] / row['age'] > 100:
return 'high salary'
elif row['bruto'] / row['age'] <= 100 and row['bruto'] / row['age'] > 50:
return 'medium salary'
elif row['bruto'] / row['age'] < 50 and row['bruto'] / row['age'] > 0:
return 'low salary'
else:
return 'no salary'
df['salary_age_relation'] = df.apply(age_salary, axis=1)
df.head()

8. Multiple conditions (vectorized solution)
The solution in the previous example works, but might not be the best. If you are looking for a more efficient solution (e.g. if you deal with a large dataset), you can specify your conditions in a list and use np.select:
conditions = [df['bruto'] / df['age'] > 100,
(df['bruto'] / df['age'] <= 100) & (df['bruto'] / df['age'] > 50), (df['bruto'] / df['age'] < 50) & (df['bruto'] / df['age'] > 0)]
outputs = ['high salary', 'medium salary', 'low salary']
df['salary_age_relation'] = np.select(conditions, outputs, 'no salary')
df.head()

This gives the same results as the previous code example, but with better performance.
Part 3: Multiple Column Creation
9. Create multiple columns using one function
To create a fullname column, we used basic operations (check out the first example). What if you have a fullname column, and you want to extract the first and lastname from this column? Good news, you can do this in one line using zip.
# creating the fullname column again
df['fullname'] = df['firstname'] + ' ' + df['lastname']
## method 1: define a function to split the column
# use the fullname column and split this column
def split_fullname(row):
splitted = row['fullname'].split(' ')
return splitted[0], splitted[1]
# use zip to create the columns
df['first'], df['last'] = zip(*df.apply(split_fullname, axis=1))
df[['fullname', 'first', 'last']].head()

10. Create multiple columns using one line
It is possible to create the same columns (first- and lastname) in one line, with zip, apply and lambda:
## method 2: combine zip, apply and lambda for a one line solution
df['first'], df['last'] = zip(*df.apply(lambda row: row['fullname'].split(' '), axis=1))
df[['fullname', 'first', 'last']].head()

Part 4: Other Data Sources
11. Mapping a dictionary
A regular way for column creation is to use a dictionary for mapping values. You can create this dictionary from another table or create your own. You can specify nan values in the dictionary or call fillna after the mapping for missing values.
# mapping a dictionary
genderdict = {'F': 'Female', 'M': 'Male', np.nan: 'Other'}
df['gender_meaning1'] = df['gender'].map(genderdict)
# you can also use fillna after map, this yields the same column
genderdict = {'F': 'Female', 'M': 'Male'}
df['gender_meaning2'] = df['gender'].map(genderdict).fillna('Other')
df.head()

12. Merging dataframes
When you want to combine dataframes, you can do this by merging them on a specified key. Imagine there is another dataframe about professions of some persons:

By calling merge on the original dataframe, the new columns will be added. If the dataframes have one name in common, this column is used when merging the dataframes. The other columns will be added to the original dataframe. Do not forget to specify how=’left’ if you want to keep the records from the first dataframe. You can compare this with a join in SQL.
df.merge(df_profession, how='left')
df.head()

Good luck with your Data Science tasks and in particular column creation!
Don’t forget to subscribe if you’d like to get an email whenever I publish a new article. ❤



