Shaping Data with Pipelines
Load in required libraries
Define which columns are numerical versus categorical (and which is the label column)
Set up stages
Index the categorical columns and perform One Hot Encoding
One Hot Encoding will convert a categorical column into multiple columns for each class. (This process is similar to dummy coding.)
Index the label column and perform One Hot Encoding
Note: If you are preparing the data for use in regression algorithms, there's no need to One Hot Encode the label column (since it should be numerical).
Assemble the data together as a vector
This step transforms all the numerical data along with the encoded categorical data into a series of vectors using the VectorAssembler
function.
Scale features using Normalization
Set up the transformation pipeline using the stages you've created along the way
Pipeline Saving and Loading
Once your transformation pipeline has been creating on your training dataset, it's a good idea to save these transformation steps for future use. For example, we can save the pipeline so that we can equally transform new data before scoring it through a trained machine learning model. This also helps to cut down on errors when using new data that has classes (in categorical variables) or previously unused columns.
Save the transformation pipeline
Load in the transformation pipeline
Last updated