Pipeline: Making Machine Learning workflow better
In Machine learning we have the ability to automate our workflow, In this article I’m going to help you get started at a beginner’s level on building pipeline in ML with python.
So, what is Pipeline?
A machine learning pipeline is used to automate our machine learning workflows. The way they work is by allowing a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether there are positive or negative.
We are going to construct a data preparation and modeling pipeline and when we start writing the code you’ll see and understand the bigger picture of all the explanation above.
We will be using Scikit learn popularly imported as SK LEARN.
We are going to use it to automate a standard ML Workflow, you can read more about it on the sckit learn documentation here
One major problem to avoid in your work as and ML practitioner is data leakage and for us to avoid that we need to properly prepare our data and get comprehensive knowledge about it, pipeline helps us achieve this by making sure that standardization is constrained to each fold of your cross-validation procedure.
We’ll build a simple Pipeline that standardizes our data then create a model which we will evaluate with a leave one out cross-validation and the next step will be building a pipeline for feature extraction and modeling.
#Import all our packages
from pandas import read_csv
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
After we have imported our data the next step is to load our dataset
# load data
Data = iris_dataset.csv
dataframe = read_csv(Data)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
Next, we have to create our pipeline
#creating pipeline
estimators = []
estimators.append(( standardize , StandardScaler()))
estimators.append(( lda , LinearDiscriminantAnalysis()))
model = Pipeline(estimators)
And the last step is to evaluate the pipeline we just created
#evaluate pipeline
loocv = LeaveOneOut()
results = cross_val_score(model, X, Y, cv=loocv)
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)
Now the next part we will look at will be building a pipeline for feature extracting and modeling, it is very important to avoid data leakage. Feature extraction can be seen as the act of reducing the number of resources required to describe a voluminous dataset. When carrying out an analysis of complex data. there are 4 steps in building this pipeline:
- Feature Extraction with Principal Component Analysis
- Feature Extraction with Statistical Selection
- Feature Union
- Learn a Logistic Regression Model.
Now let’s build a pipeline that will extract features and them build a model.
First, we’ll load our data
Data = iris_dataset.csv
dataframe = read_csv(Data)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
Next we will create a feature union
#create feature union
features = []
features.append(( pca , PCA(n_components=3)))
features.append(( select_best , SelectKBest(k=6)))
feature_union = FeatureUnion(features)
Lastly, we have to create and evaluate
#create pipeline
estimators = []
estimators.append(( feature_union , feature_union))
estimators.append(( logistic , LinearRegression()))
model = Pipeline(estimators)
#evaluate pipeline
LeaveOneOut = LeaveOneOut(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=LeaveOneOut)
print(results.mean())
Conclusion
Now that looked very easy not because pipelines are that easy but because I chose to do the most beginners friendly tutorial, this should give you a head start and idea of what pipelines imply, if you’ll love to check out more complex and advanced tutorials check SK learns documentation or other resources online.