Spacy Pipeline

Introducing the Spacy Pipeline

  • When you call nlp on a text, spacy first tokenizes the text to produce a Doc object
  • The Doc is then processed in several different steps
  • This is referred to as the processing pipeline
  • The pipeline typically consists of a pos tagger, parser, and entity recognizer
  • Each pipeline component returns the processed Doc, which is then passed on to the next component

Pipeline

Describing the Pipeline

  • The Tokenizer component separates raw text into tokens
  • The Tagger component assigns part-of-speech tags
  • The DependencyParser component assigns dependency labels
  • The EntityRecognizer component detects and labels named entities
  • The TextCategorizer component assigns document labels
Name Component Creates
tokenizer Tokenizer Doc
tagger Tagger Doc[i].tag_
parser DependencyParser Doc[i].dep_
ner EntityRecognizer Doc[i].ents_
textcat TextCategorizer Doc.cats

References

Previous
Next

The Token Class

Language Data