Get Started

The following information is designed to get users up and running with skfeaturellm quickly. For more detailed information, see the links in each of the subsections.

Installation

skfeaturellm currently supports:

environments with python version 3.10, 3.11, or 3.12.
operating systems Mac OS X, Unix-like OS, Windows 8.1 and higher
installation via PyPI

Please see the Installation guide for step-by-step instructions on the package installation.

Key Concepts

skfeaturellm is a Python library that brings the power of Large Language Models (LLMs) to feature engineering for tabular data, wrapped in a familiar scikit-learn–style API. The library is model-agnostic: it works with any LLM available from LangChain (OpenAI, Anthropic, etc.). It leverages LLMs’ capabilities to automatically generate and implement meaningful features for your machine learning tasks.

Quickstart

The code snippets below introduce skfeaturellm’s core workflow. Both examples follow the same pattern:

Split data into train and test sets.
Call fit() on the training set only, passing y so that dataset statistics are injected into the LLM prompt.
Call transform() on each split independently.

For an automated multi-round generate → select → feedback loop, use fit_selective() instead of fit(). See the User Guide for details.

Note

Always fit on training data only to avoid leaking test-set information into the LLM prompt.

Classification

Example applied to a classification task. The example uses the Iris plants dataset from sklearn.datasets.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from skfeaturellm import LLMFeatureEngineer

iris_data = load_iris(as_frame=True)
X, y = iris_data.data, iris_data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

target_description = (
    "Classification task predicting species of iris plants "
    "(3 classes: setosa, versicolor, virginica)"
)
feature_descriptions = [
    {"name": "sepal length (cm)", "type": "float64", "description": "The sepal lengths in centimeters"},
    {"name": "sepal width (cm)", "type": "float64", "description": "The sepal widths in centimeters"},
    {"name": "petal length (cm)", "type": "float64", "description": "The petal lengths in centimeters"},
    {"name": "petal width (cm)", "type": "float64", "description": "The petal widths in centimeters"},
]

llm_feature_engineer = LLMFeatureEngineer(
    problem_type="classification",
    model_name="gpt-4o",
    max_features=5,
)

# Fit on training data — passing y injects dataset statistics into the LLM prompt
llm_feature_engineer.fit(
    X=X_train,
    y=y_train,
    feature_descriptions=feature_descriptions,
    target_description=target_description,
)

# Transform train and test independently
X_train_transformed = llm_feature_engineer.transform(X_train)
X_test_transformed = llm_feature_engineer.transform(X_test)

print(X_train_transformed.columns.tolist())

Regression

Example applied to a regression task. The example uses the Diabetes dataset from sklearn.datasets.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from skfeaturellm import LLMFeatureEngineer

diabetes_data = load_diabetes(as_frame=True)
X, y = diabetes_data.data, diabetes_data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

target_description = (
    "Regression task predicting the quantitative measure of disease progression "
    "one year after baseline"
)
norm_method = "mean centered and scaled by the standard deviation"
feature_descriptions = [
    {"name": "age", "type": "float64", "description": f"Age in years ({norm_method})"},
    {"name": "sex", "type": "float64", "description": f"Sex of the patient ({norm_method})"},
    {"name": "bmi", "type": "float64", "description": f"Body mass index ({norm_method})"},
    {"name": "bp", "type": "float64", "description": f"Average blood pressure ({norm_method})"},
    {"name": "s1", "type": "float64", "description": f"TC, total serum cholesterol ({norm_method})"},
    {"name": "s2", "type": "float64", "description": f"LDL, low-density lipoprotein ({norm_method})"},
    {"name": "s3", "type": "float64", "description": f"HDL, high-density lipoprotein ({norm_method})"},
    {"name": "s4", "type": "float64", "description": f"TCH, total cholesterol/HDL ratio ({norm_method})"},
    {"name": "s5", "type": "float64", "description": f"s5 ltg, possibly log of serum triglycerides ({norm_method})"},
    {"name": "s6", "type": "float64", "description": f"s6 glu, blood sugar level ({norm_method})"},
]

llm_feature_engineer = LLMFeatureEngineer(
    problem_type="regression",
    model_name="gpt-4o",
    max_features=5,
)

# Fit on training data — passing y injects dataset statistics into the LLM prompt
llm_feature_engineer.fit(
    X=X_train,
    y=y_train,
    feature_descriptions=feature_descriptions,
    target_description=target_description,
)

# Transform train and test independently
X_train_transformed = llm_feature_engineer.transform(X_train)
X_test_transformed = llm_feature_engineer.transform(X_test)

print(X_train_transformed.columns.tolist())