Examples

This page contains practical examples of using skfeaturellm in different scenarios.

Classification

Example applied to a classification task using the Iris plants dataset from sklearn.datasets. Passing y to fit() injects dataset statistics into the LLM prompt for richer feature suggestions.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from skfeaturellm import LLMFeatureEngineer

iris_data = load_iris(as_frame=True)
X, y = iris_data.data, iris_data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

target_description = (
    "Classification task predicting species of iris plants "
    "(3 classes: setosa, versicolor, virginica)"
)
feature_descriptions = [
    {"name": "sepal length (cm)", "type": "float64", "description": "The sepal lengths in centimeters"},
    {"name": "sepal width (cm)", "type": "float64", "description": "The sepal widths in centimeters"},
    {"name": "petal length (cm)", "type": "float64", "description": "The petal lengths in centimeters"},
    {"name": "petal width (cm)", "type": "float64", "description": "The petal widths in centimeters"},
]

engineer = LLMFeatureEngineer(
    problem_type="classification",
    model_name="gpt-4o",
    max_features=5,
)

# Fit on training data only — passing y injects dataset statistics into the prompt
engineer.fit(
    X=X_train,
    y=y_train,
    feature_descriptions=feature_descriptions,
    target_description=target_description,
)

X_train_transformed = engineer.transform(X_train)
X_test_transformed = engineer.transform(X_test)

print(X_train_transformed.columns.tolist())

Regression

Example applied to a regression task using the Diabetes dataset from sklearn.datasets.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from skfeaturellm import LLMFeatureEngineer

diabetes_data = load_diabetes(as_frame=True)
X, y = diabetes_data.data, diabetes_data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

target_description = (
    "Regression task predicting the quantitative measure of disease progression "
    "one year after baseline"
)
norm_method = "mean centered and scaled by the standard deviation"
feature_descriptions = [
    {"name": "age", "type": "float64", "description": f"Age in years ({norm_method})"},
    {"name": "sex", "type": "float64", "description": f"Sex of the patient ({norm_method})"},
    {"name": "bmi", "type": "float64", "description": f"Body mass index ({norm_method})"},
    {"name": "bp", "type": "float64", "description": f"Average blood pressure ({norm_method})"},
    {"name": "s1", "type": "float64", "description": f"TC, total serum cholesterol ({norm_method})"},
    {"name": "s2", "type": "float64", "description": f"LDL, low-density lipoprotein ({norm_method})"},
    {"name": "s3", "type": "float64", "description": f"HDL, high-density lipoprotein ({norm_method})"},
    {"name": "s4", "type": "float64", "description": f"TCH, total cholesterol/HDL ratio ({norm_method})"},
    {"name": "s5", "type": "float64", "description": f"s5 ltg, possibly log of serum triglycerides ({norm_method})"},
    {"name": "s6", "type": "float64", "description": f"s6 glu, blood sugar level ({norm_method})"},
]

engineer = LLMFeatureEngineer(
    problem_type="regression",
    model_name="gpt-4o",
    max_features=5,
)

engineer.fit(
    X=X_train,
    y=y_train,
    feature_descriptions=feature_descriptions,
    target_description=target_description,
)

X_train_transformed = engineer.transform(X_train)
X_test_transformed = engineer.transform(X_test)

Feature Evaluation and Selection

evaluate_features() scores each generated feature using mutual information (classification) or Pearson/Spearman correlation (regression). Use the results to select only features that provide real signal before training your final model.

# Score features on training data
eval_result = engineer.evaluate_features(X_train, y_train, is_transformed=False)
print(eval_result.summary())

# Select features with a positive mutual information score
scores = eval_result.summary()
good_features = scores[scores["mutual_information"] > 0].index.tolist()

# Build train and test sets with only the selected features
base_cols = X_train.columns.tolist()
X_train_eng = engineer.transform(X_train)[base_cols + good_features]
X_test_eng = engineer.transform(X_test)[base_cols + good_features]

Production Pipeline

After evaluating and selecting features, export them to a FeatureEngineeringTransformer for use in a scikit-learn Pipeline. This separates the LLM exploration phase from deterministic production inference.

from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from skfeaturellm import LLMFeatureEngineer, FeatureEngineeringTransformer

# --- Exploration: fit and evaluate ---
engineer = LLMFeatureEngineer(
    problem_type="classification",
    model_name="gpt-4o",
    max_features=10,
)
engineer.fit(X_train, y=y_train, feature_descriptions=feature_descriptions)
engineer.transform(X_train)  # populates generated_features

# Evaluate and select
eval_result = engineer.evaluate_features(X_train, y_train)
good_features = (
    eval_result.summary()[eval_result.summary()["mutual_information"] > 0]
    .index.tolist()
)

# --- Production: export to a deterministic transformer ---
transformer = engineer.to_transformer(features=good_features)

pipeline = Pipeline([
    ("features", transformer),
    ("model", XGBClassifier()),
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Saving and Loading

Serialize the FeatureEngineeringTransformer to JSON so the LLM is never called again. Only the transformation configs are stored — call fit() to re-learn stateful parameters (e.g., bin edges) on any data split.

from skfeaturellm import FeatureEngineeringTransformer

# Save transformation configs
transformer.save("transformer.json")

# Restore in a later session or on a different machine
loaded = FeatureEngineeringTransformer.load("transformer.json")

# Fit on training data and apply — no LLM call required
pipeline = Pipeline([("features", loaded), ("model", XGBClassifier())])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Notebook Tutorial

A complete end-to-end tutorial using the Bank Loan Credit Risk dataset is available as a Jupyter notebook in the examples/ directory of the repository:

The notebook covers: data loading with kagglehub, baseline XGBoost, LLM feature engineering with dataset statistics injection, per-feature evaluation, feature selection, and production deployment with FeatureEngineeringTransformer inside a scikit-learn Pipeline.