Tagging

Named Entity Recognition

Overview

Warning

Named Entity Recognition is an experimental feature and may be subject to instability. Please be aware that the API and/or functionality could change.

Named Entity Recognition is a process of locating and classifying the named entities in a provided text.

Currently, Scikit-LLM has a single NER estimator (only works with the GPT family) called Explainable NER.

Exemplary usage:

from skllm.models.gpt.tagging.ner import GPTExplainableNER as NER

entities = {
  "PERSON": "A name of an individual.",
  "ORGANIZATION": "A name of a company.",
  "DATE": "A specific time reference."
}

data = [
  "Tim Cook announced new Apple products in San Francisco on June 3, 2022.",
  "Elon Musk visited the Tesla factory in Austin on January 10, 2021.",
  "Mark Zuckerberg introduced Facebook Metaverse in Silicon Valley on May 5, 2023."
]

ner = NER(entities=entities, display_predictions=True)
tagged = ner.fit_transform(data)

The model will tag the entities and provide a short reasoning behind its choice. If the display_predictions output is set to True, the outputs of the model are parsed automatically and presented in a human readable way: each entity is highlighted and the explanation is displayed on hovering over the entity.

Exemplary output:

==============================

Entities:PERSONORGANIZATIONDATE

Tim Cookannounced newAppleproducts in San Francisco onJune 3, 2022.
Elon Muskvisited theTeslafactory in Austin onJanuary 10, 2021.
Mark ZuckerbergintroducedFacebookMetaverse in Silicon Valley onMay 5, 2023.

==============================

The display_output functionality works in both Jupyter Notebook and plain Python scripts. When used outside Jupyter, a new HTML page will be auto-generated and opened in a new browser window.

Sparse vs Dense NER

We distinguish between two modes of generating the predictions: sparse and dense.

In dense mode the model produces a complete (tagged) output right away, while in sparse mode only a list of entities is produced which is then mapped to the text via regex.

In most of the scenarios the usage of sparse mode should be preferable for the following reasons:

  • lower number of output tokens (cheaper to use);
  • strict validation -> it is guaranteed that the output is invertable and only contains the specified entities;
  • higher accuracy, especially with smaller models.

Dense mode should only be used when the following conditions are met:

  • a larger model is used (e.g. gpt-4);
  • the text is expected to contain multiple (distinct) instances of lexically ambiguous words.

For example, in a sentence "Apple is the favorite fruit of the CEO of Apple", the first and second occurrences of the word "Apple" should be classified as different entities, which is only possible using the dense mode.

API Reference

The following API reference only lists the parameters needed for the initialization of the estimator. The remaining methods follow the syntax of a scikit-learn transformer.

GPTExplainableNER

from skllm.models.gpt.tagging.ner import GPTExplainableNER
ParameterTypeDescription
entitiesdictA dictionary of entities to recognize, with keys as uppercase entity names and values as descriptions.
display_predictionsOptional[bool]Determines whether to display predictions, by default False.
sparse_outputOptional[bool]Determines whether to generate a sparse representation of the predictions, by default True.
modelOptional[str]A model to use, by default "gpt-4o".
keyOptional[str]Estimator-specific API key; if None, retrieved from the global config, by default None.
orgOptional[str]Estimator-specific ORG key; if None, retrieved from the global config, by default None.
num_workersOptional[int]Number of workers (threads) to use, by default 1.
Previous
Overview