feature_selection package#

Submodules#

feature_selection.feature_selection module#

feature_selection.feature_selection.feature_selection(context, df_artifact, k: int = 5, min_votes: float = 0.5, label_column: str | None = None, stat_filters: list | None = None, model_filters: dict | None = None, max_scaled_scores: bool = True, sample_ratio: float | None = None, output_vector_name: float | None = None, ignore_type_errors: bool = False)[source]#

Applies selected feature selection statistical functions or models on our ‘df_artifact’.

Each statistical function or model will vote for it’s best K selected features. If a feature has >= ‘min_votes’ votes, it will be selected.

Parameters:
  • context – the function context.

  • df_artifact – dataframe to pass as input.

  • k – number of top features to select from each statistical function or model.

  • min_votes – minimal number of votes (from a model or by statistical function) needed for a feature to be selected. Can be specified by percentage of votes or absolute number of votes.

  • label_column – ground-truth (y) labels.

  • stat_filters – statistical functions to apply to the features (from sklearn.feature_selection).

  • model_filters – models to use for feature evaluation, can be specified by model name (ex. LinearSVC), formalized json (contains ‘CLASS’, ‘FIT’, ‘META’) or a path to such json file.

  • max_scaled_scores – produce feature scores table scaled with max_scaler.

  • sample_ratio – percentage of the dataset the user wishes to compute the feature selection process on.

  • output_vector_name – creates a new feature vector containing only the identifies features.

  • ignore_type_errors – skips datatypes that are neither float nor int within the feature vector.

feature_selection.feature_selection.plot_stat(context, stat_name, stat_df)[source]#
feature_selection.feature_selection.show_values_on_bars(axs, h_v='v', space=0.4)[source]#

Module contents#