About Spark MLlib
MLlib is Apache Spark's scalable machine learning library.
MLlib works with Spark's APIs and with NumPy in Python and with R libraries. Since Spark excels at iterative computation, MLlib runs very fast with highly-scalable, high-quality algorithms that leverage iteration.
Included Functionality:
ML algorithms include:
Classification: logistic regression, naive Bayes,...
Regression: generalized linear regression, survival regression,...
Decision trees, random forests, and gradient-boosted trees
Recommendation: alternating least squares (ALS)
Clustering: K-means, Gaussian mixtures (GMMs),...
Topic modeling: latent Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern mining
ML workflow utilities include:
Feature transformations: standardization, normalization, hashing,...
ML Pipeline construction
Model evaluation and hyper-parameter tuning
ML persistence: saving and loading models and Pipelines
Other utilities include:
Distributed linear algebra: SVD, PCA,...
Statistics: summary statistics, hypothesis testing,...
Resources
Last updated