feature_selection package#
Submodules#
feature_selection.feature_selection module#
- feature_selection.feature_selection.feature_selection(context, df_artifact, k: int = 5, min_votes: float = 0.5, label_column: str | None = None, stat_filters: list | None = None, model_filters: dict | None = None, max_scaled_scores: bool = True, sample_ratio: float | None = None, output_vector_name: float | None = None, ignore_type_errors: bool = False)[source]#
Applies selected feature selection statistical functions or models on our ‘df_artifact’.
Each statistical function or model will vote for it’s best K selected features. If a feature has >= ‘min_votes’ votes, it will be selected.
- Parameters:
context – the function context.
df_artifact – dataframe to pass as input.
k – number of top features to select from each statistical function or model.
min_votes – minimal number of votes (from a model or by statistical function) needed for a feature to be selected. Can be specified by percentage of votes or absolute number of votes.
label_column – ground-truth (y) labels.
stat_filters – statistical functions to apply to the features (from sklearn.feature_selection).
model_filters – models to use for feature evaluation, can be specified by model name (ex. LinearSVC), formalized json (contains ‘CLASS’, ‘FIT’, ‘META’) or a path to such json file.
max_scaled_scores – produce feature scores table scaled with max_scaler.
sample_ratio – percentage of the dataset the user wishes to compute the feature selection process on.
output_vector_name – creates a new feature vector containing only the identifies features.
ignore_type_errors – skips datatypes that are neither float nor int within the feature vector.