Examples ======== This page contains practical examples of using ``skfeaturellm`` in different scenarios. Classification -------------- Example applied to a classification task using the `Iris plants dataset `_ from ``sklearn.datasets``. Passing ``y`` to ``fit()`` injects dataset statistics into the LLM prompt for richer feature suggestions. .. code-block:: python from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from skfeaturellm import LLMFeatureEngineer iris_data = load_iris(as_frame=True) X, y = iris_data.data, iris_data.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) target_description = ( "Classification task predicting species of iris plants " "(3 classes: setosa, versicolor, virginica)" ) feature_descriptions = [ {"name": "sepal length (cm)", "type": "float64", "description": "The sepal lengths in centimeters"}, {"name": "sepal width (cm)", "type": "float64", "description": "The sepal widths in centimeters"}, {"name": "petal length (cm)", "type": "float64", "description": "The petal lengths in centimeters"}, {"name": "petal width (cm)", "type": "float64", "description": "The petal widths in centimeters"}, ] engineer = LLMFeatureEngineer( problem_type="classification", model_name="gpt-4o", max_features=5, ) # Fit on training data only — passing y injects dataset statistics into the prompt engineer.fit( X=X_train, y=y_train, feature_descriptions=feature_descriptions, target_description=target_description, ) X_train_transformed = engineer.transform(X_train) X_test_transformed = engineer.transform(X_test) print(X_train_transformed.columns.tolist()) Regression ----------- Example applied to a regression task using the `Diabetes dataset `_ from ``sklearn.datasets``. .. code-block:: python from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from skfeaturellm import LLMFeatureEngineer diabetes_data = load_diabetes(as_frame=True) X, y = diabetes_data.data, diabetes_data.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) target_description = ( "Regression task predicting the quantitative measure of disease progression " "one year after baseline" ) norm_method = "mean centered and scaled by the standard deviation" feature_descriptions = [ {"name": "age", "type": "float64", "description": f"Age in years ({norm_method})"}, {"name": "sex", "type": "float64", "description": f"Sex of the patient ({norm_method})"}, {"name": "bmi", "type": "float64", "description": f"Body mass index ({norm_method})"}, {"name": "bp", "type": "float64", "description": f"Average blood pressure ({norm_method})"}, {"name": "s1", "type": "float64", "description": f"TC, total serum cholesterol ({norm_method})"}, {"name": "s2", "type": "float64", "description": f"LDL, low-density lipoprotein ({norm_method})"}, {"name": "s3", "type": "float64", "description": f"HDL, high-density lipoprotein ({norm_method})"}, {"name": "s4", "type": "float64", "description": f"TCH, total cholesterol/HDL ratio ({norm_method})"}, {"name": "s5", "type": "float64", "description": f"s5 ltg, possibly log of serum triglycerides ({norm_method})"}, {"name": "s6", "type": "float64", "description": f"s6 glu, blood sugar level ({norm_method})"}, ] engineer = LLMFeatureEngineer( problem_type="regression", model_name="gpt-4o", max_features=5, ) engineer.fit( X=X_train, y=y_train, feature_descriptions=feature_descriptions, target_description=target_description, ) X_train_transformed = engineer.transform(X_train) X_test_transformed = engineer.transform(X_test) Feature Evaluation and Selection --------------------------------- ``evaluate_features()`` scores each generated feature using mutual information (classification) or Pearson/Spearman correlation (regression). Use the results to select only features that provide real signal before training your final model. .. code-block:: python # Score features on training data eval_result = engineer.evaluate_features(X_train, y_train, is_transformed=False) print(eval_result.summary()) # Select features with a positive mutual information score scores = eval_result.summary() good_features = scores[scores["mutual_information"] > 0].index.tolist() # Build train and test sets with only the selected features base_cols = X_train.columns.tolist() X_train_eng = engineer.transform(X_train)[base_cols + good_features] X_test_eng = engineer.transform(X_test)[base_cols + good_features] Production Pipeline -------------------- After evaluating and selecting features, export them to a ``FeatureEngineeringTransformer`` for use in a scikit-learn ``Pipeline``. This separates the LLM exploration phase from deterministic production inference. .. code-block:: python from sklearn.pipeline import Pipeline from xgboost import XGBClassifier from skfeaturellm import LLMFeatureEngineer, FeatureEngineeringTransformer # --- Exploration: fit and evaluate --- engineer = LLMFeatureEngineer( problem_type="classification", model_name="gpt-4o", max_features=10, ) engineer.fit(X_train, y=y_train, feature_descriptions=feature_descriptions) engineer.transform(X_train) # populates generated_features # Evaluate and select eval_result = engineer.evaluate_features(X_train, y_train) good_features = ( eval_result.summary()[eval_result.summary()["mutual_information"] > 0] .index.tolist() ) # --- Production: export to a deterministic transformer --- transformer = engineer.to_transformer(features=good_features) pipeline = Pipeline([ ("features", transformer), ("model", XGBClassifier()), ]) pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) Saving and Loading ------------------ Serialize the ``FeatureEngineeringTransformer`` to JSON so the LLM is never called again. Only the transformation configs are stored — call ``fit()`` to re-learn stateful parameters (e.g., bin edges) on any data split. .. code-block:: python from skfeaturellm import FeatureEngineeringTransformer # Save transformation configs transformer.save("transformer.json") # Restore in a later session or on a different machine loaded = FeatureEngineeringTransformer.load("transformer.json") # Fit on training data and apply — no LLM call required pipeline = Pipeline([("features", loaded), ("model", XGBClassifier())]) pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) Notebook Tutorial ----------------- A complete end-to-end tutorial using the `Bank Loan Credit Risk dataset `_ is available as a Jupyter notebook in the ``examples/`` directory of the repository: - `01_SKFeatureLLM_Tutorial.ipynb `_ The notebook covers: data loading with ``kagglehub``, baseline XGBoost, LLM feature engineering with dataset statistics injection, per-feature evaluation, feature selection, and production deployment with ``FeatureEngineeringTransformer`` inside a scikit-learn ``Pipeline``.