Spark Process Dataframe with Random Forest -


using answer spark 1.5.1, mllib random forest probability, able train random forest using ml.classification.randomforestclassifier, , process holdout dataframe trained random forest.

the problem have save trained random forest process dataframe (with same features training set) in future.

the classification example on page uses mllib.tree.model.randomforestmodel, shows how save trained forest, best of understanding can trained on (and processed on in future) labeledpoint rdd. issue have labeledpoint rdd can contain label double , features vector, lose non-label/non-feature columns need other purposes.

so guess need way either save result of ml.classification.randomforestclassifier, or construct labeledpoint rdd that can retain columns other label , features required forest trained through mllib.tree.model.randomforestmodel.

anyone know why both , not 1 of ml , mllib libraries exist?

many reading question, , in advance solutions/suggestions.

i'll re-use what's been said in spark programming guide :

the spark.ml package aims provide uniform set of high-level apis built on top of dataframes users create , tune practical machine learning pipelines.

in spark, core feature it's rdds. there excellent paper on topic if interested, can add link later.

the comes mllib, independent library @ first , got soaked spark project. nevertheless, machine learning algorithms in spark written on rdds.

then dataframe abstraction added project , more practical ways of building machine learning applications needed include transformers , evaluator , importantly pipeline.

data engineer or scientist matter didn't need study underlying tech. abstraction.

you can use both, need remember algorithm use ml made in mllib , abstracted easier usage.


Comments