Get Started =============== The following information is designed to get users up and running with ``skfeaturellm`` quickly. For more detailed information, see the links in each of the subsections. Installation ~~~~~~~~~~~~~~~~~ ``skfeaturellm`` currently supports: - environments with python version 3.10, 3.11, or 3.12. - operating systems Mac OS X, Unix-like OS, Windows 8.1 and higher - installation via `PyPI `_ Please see the :doc:`installation` guide for step-by-step instructions on the package installation. Key Concepts ~~~~~~~~~~~~~~~~~~~ ``skfeaturellm`` is a Python library that brings the power of Large Language Models (LLMs) to feature engineering for tabular data, wrapped in a familiar scikit-learn–style API. The library is **model-agnostic**: it works with any LLM available from LangChain (OpenAI, Anthropic, etc.). It leverages LLMs' capabilities to automatically generate and implement meaningful features for your machine learning tasks. Quickstart ~~~~~~~~~~~~~~~~~~~ The code snippets below introduce ``skfeaturellm``'s core workflow. Both examples follow the same pattern: 1. Split data into train and test sets. 2. Call ``fit()`` on the training set only, passing ``y`` so that dataset statistics are injected into the LLM prompt. 3. Call ``transform()`` on each split independently. For an automated multi-round generate → select → feedback loop, use ``fit_selective()`` instead of ``fit()``. See the :doc:`user_guide` for details. .. note:: Always fit on training data only to avoid leaking test-set information into the LLM prompt. Classification -------------- Example applied to a classification task. The example uses the `Iris plants dataset `_ from `sklearn.datasets`. .. code-block:: python from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from skfeaturellm import LLMFeatureEngineer iris_data = load_iris(as_frame=True) X, y = iris_data.data, iris_data.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) target_description = ( "Classification task predicting species of iris plants " "(3 classes: setosa, versicolor, virginica)" ) feature_descriptions = [ {"name": "sepal length (cm)", "type": "float64", "description": "The sepal lengths in centimeters"}, {"name": "sepal width (cm)", "type": "float64", "description": "The sepal widths in centimeters"}, {"name": "petal length (cm)", "type": "float64", "description": "The petal lengths in centimeters"}, {"name": "petal width (cm)", "type": "float64", "description": "The petal widths in centimeters"}, ] llm_feature_engineer = LLMFeatureEngineer( problem_type="classification", model_name="gpt-4o", max_features=5, ) # Fit on training data — passing y injects dataset statistics into the LLM prompt llm_feature_engineer.fit( X=X_train, y=y_train, feature_descriptions=feature_descriptions, target_description=target_description, ) # Transform train and test independently X_train_transformed = llm_feature_engineer.transform(X_train) X_test_transformed = llm_feature_engineer.transform(X_test) print(X_train_transformed.columns.tolist()) Regression ----------- Example applied to a regression task. The example uses the `Diabetes dataset `_ from `sklearn.datasets`. .. code-block:: python from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from skfeaturellm import LLMFeatureEngineer diabetes_data = load_diabetes(as_frame=True) X, y = diabetes_data.data, diabetes_data.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) target_description = ( "Regression task predicting the quantitative measure of disease progression " "one year after baseline" ) norm_method = "mean centered and scaled by the standard deviation" feature_descriptions = [ {"name": "age", "type": "float64", "description": f"Age in years ({norm_method})"}, {"name": "sex", "type": "float64", "description": f"Sex of the patient ({norm_method})"}, {"name": "bmi", "type": "float64", "description": f"Body mass index ({norm_method})"}, {"name": "bp", "type": "float64", "description": f"Average blood pressure ({norm_method})"}, {"name": "s1", "type": "float64", "description": f"TC, total serum cholesterol ({norm_method})"}, {"name": "s2", "type": "float64", "description": f"LDL, low-density lipoprotein ({norm_method})"}, {"name": "s3", "type": "float64", "description": f"HDL, high-density lipoprotein ({norm_method})"}, {"name": "s4", "type": "float64", "description": f"TCH, total cholesterol/HDL ratio ({norm_method})"}, {"name": "s5", "type": "float64", "description": f"s5 ltg, possibly log of serum triglycerides ({norm_method})"}, {"name": "s6", "type": "float64", "description": f"s6 glu, blood sugar level ({norm_method})"}, ] llm_feature_engineer = LLMFeatureEngineer( problem_type="regression", model_name="gpt-4o", max_features=5, ) # Fit on training data — passing y injects dataset statistics into the LLM prompt llm_feature_engineer.fit( X=X_train, y=y_train, feature_descriptions=feature_descriptions, target_description=target_description, ) # Transform train and test independently X_train_transformed = llm_feature_engineer.transform(X_train) X_test_transformed = llm_feature_engineer.transform(X_test) print(X_train_transformed.columns.tolist())