Text vectorization

Overview

LLMs can be used solely for data preprocessing by embedding a chunk of text of arbitrary length to a fixed-dimensional vector, that can be further used with virtually any model (e.g. classification, regression, clustering, etc.).

Example 1: Embedding the text

from skllm.models.gpt.vectorization import GPTVectorizer

vectorizer = GPTVectorizer(batch_size=2)
X = vectorizer.fit_transform(["This is a text", "This is another text"])

Example 2: Combining the vectorizer with the XGBoost classifier in a scikit-learn pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

steps = [("GPT", GPTVectorizer()), ("Clf", XGBClassifier())]
clf = Pipeline(steps)
clf.fit(X_train, y_train_encoded)
yh = clf.predict(X_test)

API Reference

The following API reference only lists the parameters needed for the initialization of the estimator. The remaining methods follow the syntax of a scikit-learn transformer.

GPTVectorizer

from skllm.models.gpt.vectorization import GPTVectorizer

Parameter	Type	Description
`model`	`str`	Model to use, by default "text-embedding-3-small".
`batch_size`	`int`	Number of samples per request, by default 1.
`key`	`Optional[str]`	Estimator-specific API key; if None, retrieved from the global config, by default None.
`org`	`Optional[str]`	Estimator-specific ORG key; if None, retrieved from the global config, by default None.