Get Started
The following information is designed to get users up and running with skfeaturellm quickly. For more detailed information, see the links in each of the subsections.
Installation
skfeaturellm currently supports:
environments with python version 3.10, 3.11, or 3.12.
operating systems Mac OS X, Unix-like OS, Windows 8.1 and higher
installation via PyPI
Please see the Installation guide for step-by-step instructions on the package installation.
Key Concepts
skfeaturellm is a Python library that brings the power of Large Language Models (LLMs) to feature engineering for tabular data, wrapped in a familiar scikit-learn–style API. The library is model-agnostic: it works with any LLM available from LangChain (OpenAI, Anthropic, etc.). It leverages LLMs’ capabilities to automatically generate and implement meaningful features for your machine learning tasks.
Quickstart
The code snippets below introduce skfeaturellm’s core workflow. Both examples follow the same pattern:
Split data into train and test sets.
Call
fit()on the training set only, passingyso that dataset statistics are injected into the LLM prompt.Call
transform()on each split independently.
For an automated multi-round generate → select → feedback loop, use fit_selective() instead of fit(). See the User Guide for details.
Note
Always fit on training data only to avoid leaking test-set information into the LLM prompt.
Classification
Example applied to a classification task. The example uses the Iris plants dataset from sklearn.datasets.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from skfeaturellm import LLMFeatureEngineer
iris_data = load_iris(as_frame=True)
X, y = iris_data.data, iris_data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
target_description = (
"Classification task predicting species of iris plants "
"(3 classes: setosa, versicolor, virginica)"
)
feature_descriptions = [
{"name": "sepal length (cm)", "type": "float64", "description": "The sepal lengths in centimeters"},
{"name": "sepal width (cm)", "type": "float64", "description": "The sepal widths in centimeters"},
{"name": "petal length (cm)", "type": "float64", "description": "The petal lengths in centimeters"},
{"name": "petal width (cm)", "type": "float64", "description": "The petal widths in centimeters"},
]
llm_feature_engineer = LLMFeatureEngineer(
problem_type="classification",
model_name="gpt-4o",
max_features=5,
)
# Fit on training data — passing y injects dataset statistics into the LLM prompt
llm_feature_engineer.fit(
X=X_train,
y=y_train,
feature_descriptions=feature_descriptions,
target_description=target_description,
)
# Transform train and test independently
X_train_transformed = llm_feature_engineer.transform(X_train)
X_test_transformed = llm_feature_engineer.transform(X_test)
print(X_train_transformed.columns.tolist())
Regression
Example applied to a regression task. The example uses the Diabetes dataset from sklearn.datasets.
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from skfeaturellm import LLMFeatureEngineer
diabetes_data = load_diabetes(as_frame=True)
X, y = diabetes_data.data, diabetes_data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
target_description = (
"Regression task predicting the quantitative measure of disease progression "
"one year after baseline"
)
norm_method = "mean centered and scaled by the standard deviation"
feature_descriptions = [
{"name": "age", "type": "float64", "description": f"Age in years ({norm_method})"},
{"name": "sex", "type": "float64", "description": f"Sex of the patient ({norm_method})"},
{"name": "bmi", "type": "float64", "description": f"Body mass index ({norm_method})"},
{"name": "bp", "type": "float64", "description": f"Average blood pressure ({norm_method})"},
{"name": "s1", "type": "float64", "description": f"TC, total serum cholesterol ({norm_method})"},
{"name": "s2", "type": "float64", "description": f"LDL, low-density lipoprotein ({norm_method})"},
{"name": "s3", "type": "float64", "description": f"HDL, high-density lipoprotein ({norm_method})"},
{"name": "s4", "type": "float64", "description": f"TCH, total cholesterol/HDL ratio ({norm_method})"},
{"name": "s5", "type": "float64", "description": f"s5 ltg, possibly log of serum triglycerides ({norm_method})"},
{"name": "s6", "type": "float64", "description": f"s6 glu, blood sugar level ({norm_method})"},
]
llm_feature_engineer = LLMFeatureEngineer(
problem_type="regression",
model_name="gpt-4o",
max_features=5,
)
# Fit on training data — passing y injects dataset statistics into the LLM prompt
llm_feature_engineer.fit(
X=X_train,
y=y_train,
feature_descriptions=feature_descriptions,
target_description=target_description,
)
# Transform train and test independently
X_train_transformed = llm_feature_engineer.transform(X_train)
X_test_transformed = llm_feature_engineer.transform(X_test)
print(X_train_transformed.columns.tolist())