
Classes provide a useful way of combining data and functionality. The modularity of classes enables effective troubleshooting, code reuse, and problem-solving. For example, if your code breaks, you will be able to point to a specific class or class method without having to sift through much else code. Because of these factors, model development naturally lends itself to object-oriented programming practices.
In this post, we will discuss how we can use object-oriented programming for reading data, splitting data for training, fitting models, and making predictions. We will be using weather data which can be found here.
Before we get started let’s import pandas:
import pandas as pd
Now, let’s define a class called Model:
class Model:
def __init__(self, datafile = "weatherHistory.csv"):
self.df = pd.read_csv(datafile)
The class will have an ‘init‘ function, also called a constructor, which allows us to initialize data upon the creation of a new instance of that class. We can define a new variable, ‘model_instance’, as an object (an instance of the Model class):
if __name__ == '__main__':
model_instance = Model()
We should be able to access the data frame through the object, ‘model_instance.’ Let’s call the data frame and print the first five rows of the data:
if __name__ == '__main__':
model_instance= Model()
print(model_instance.df.head())

Looks good.
The next thing we can do is define a linear regression object in the initialization function. This is not to be confused with ‘model_instance’, which is an instance of our custom class ‘Model’:
class Model:
def __init__(self, datafile = "weatherHistory.csv"):
self.df = pd.read_csv(datafile)
self.linear_reg = LinearRegression()
Again, it is worth noting that LinearRegression is a separate class from our custom ‘Model’ class and that in the line of code:
self.linear_reg = LinearRegression()
we are defining an instance of the LinearRegression class.
Now, let’s make sure we can access our linear regression object:
if __name__ == '__main__':
model_instance = Model()
print(model_instance.linear_reg)

The next thing we can do is define a method that lets us split our data for training and testing. The function will take a ‘test_size’ parameter which will let us specify the size of training and testing.
First, let’s import the ‘train_test_split’ method from ‘sklearn’ and let’s also import ‘NumPy’ :
from sklearn.model_selection import train_test_split
import numpy as np
We will build a linear regression model which we will use to predict temperature. For simplicity, let’s use ‘Humidity’ and ‘Pressure (millibars)’ as our input and ‘Temperature’ as our output. We define our split method as follows:
def split(self, test_size):
X = np.array(self.df[['Humidity', 'Pressure (millibars)']])
y = np.array(self.df['Temperature (C)'])
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X, y, test_size = test_size, random_state = 42)
Next, let’s print ‘X_train’ and ‘y_train’ for inspection:
if __name__ == '__main__':
model_instance = Model()
model_instance.split(0.2)
print(model_instance.X_train)
print(model_instance.y_train)

We will now define a ‘fit’ function for our linear regression model:
def fit(self):
self.model = self.linear_reg.fit(self.X_train, self.y_train)
We will also define a ‘predict’ function:
def predict(self):
result = self.linear_reg.predict(self.X_test)
return result
Now, let’s print our test predictions where the test size is 20% of the data:
if __name__ == '__main__':
model_instance = Model()
model_instance.split(0.2)
model_instance.fit()
print(model_instance.predict())

We can also print model performance:
if __name__ == '__main__':
model_instance = Model()
model_instance.split(0.2)
model_instance.fit()
print("Accuracy: ", model_instance.model.score(model_instance.X_test, model_instance.y_test))

We can also pass an ‘input_value’ parameter to our predict method, which will allow us to make out of sample predictions. If ‘None’ is passed then predictions will be made on the test input. Otherwise, predictions will be made on the ‘input_value’:
def predict(self, input_value):
if input_value == None:
result = self.linear_reg.predict(self.X_test)
else:
result = self.linear_reg.predict(np.array([input_value]))
return result
Let’s call predict with some out of sample test input:
if __name__ == '__main__':
model_instance = Model()
model_instance.split(0.2)
model_instance.fit()
print(model_instance.predict([.9, 1000]))

We can also define a random forest regression model object as a Model field and run our script:
class Model:
def __init__(self, datafile = "weatherHistory.csv"):
self.df = pd.read_csv(datafile)
self.linear_reg = LinearRegression()
self.random_forest = RandomForestRegressor()
def split(self, test_size):
X = np.array(self.df[['Humidity', 'Pressure (millibars)']])
y = np.array(self.df['Temperature (C)'])
self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X, y, test_size = test_size, random_state = 42)
def fit(self):
self.model = self.random_forest.fit(self.X_train, self.y_train)
def predict(self, input_value):
if input_value == None:
result = self.random_forest.fit(self.X_test)
else:
result = self.random_forest.fit(np.array([input_values]))
return result
if __name__ == '__main__':
model_instance = Model()
model_instance.split(0.2)
model_instance.fit()
print("Accuracy: ", model_instance.model.score(model_instance.X_test, model_instance.y_test))

You can easily modify the code to build support vector regression models, ‘xgboost’ models, and much more. We can further generalize our class by passing a parameter to the constructor that, when specified, chooses from a list of possible models.
The logic could look something like:
class Model:
def __init__(self, datafile = "weatherHistory.csv", model_type = None):
self.df = pd.read_csv(datafile)
if model_type == 'rf':
self.user_defined_model = RandomForestRegressor()
else:
self.user_defined_model = LinearRegression()
And your ‘fit’ and ‘predict’ methods are modified:
def fit(self):
self.model = self.user_defined_model.fit(self.X_train, self.y_train)
def predict(self, input_value):
if input_value == None:
result = self.user_defined_model.fit(self.X_test)
else:
result = self.user_defined_model.fit(np.array([input_values]))
return result
And we execute as follows:
if __name__ == '__main__':
model_instance = Model(model_type = 'rf')
model_instance.split(0.2)
model_instance.fit()
print("Accuracy: ", model_instance.model.score(model_instance.X_test, model_instance.y_test))

And if we pass ‘None’ we get:
if __name__ == '__main__':
model_instance = Model(model_type = None)
model_instance.split(0.2)
model_instance.fit()
print("Accuracy: ", model_instance.model.score(model_instance.X_test, model_instance.y_test))

I’ll stop here but I encourage you to add additional model objects. Some interesting examples you can try are support vector machines, ‘xgboost’ regression, and ‘lightgbm’ regression models. It may also be useful to add helper methods that generate summary statistics like mean and standard deviation for any of the numerical columns. You can also define methods that help you select features by calculating statistics like correlation.
To recap, in this post I discussed how to build machine learning models within the object-orientated programming framework. This framework is useful for troubleshooting, problem-solving, field collection, method collection, and much more. I hope you find a use for OOP in your own Data Science projects. The code from this post is available on GitHub. Thank you for reading and happy machine learning!