What is a Token?
- A
Token
represents a word, punctuation symbol, whitespace, etc. - Each
Token
represents an input string from aDoc
object encoded to hash values - Linguistic annotations are available as
Token
attributes - To get the readable attribute representation of an attribute, we need to add an underscore
_
to its name
More on Token Attributes
text:
The input text contentlemma_:
Base form of the tokenpos_:
Generic part-of-speech tags found heretag_:
Specific part-of-speech tags found heredep_:
Dependency relation found hereshape_:
Orthographic features of tokenis_alpha:
Does the token consist of non-alphabetical characters?
Sample Code
>>> doc = nlp("Wow! Spacy is a great tool and I'm wanting to learn more. Please, teach me, sir.")
# Text of token
>>> doc[10].text
'wanting'
# Lemma of token
>>> doc[10].lemma
7597692042947428029 # some hash value
>>> doc[10].lemma_
'want'
# Generic POS of token
>>> doc[10].pos_
'VERB'
# Specific POS of token
>>> doc[10].tag_
'VBG'
# Dependency of token
>>> doc[10].dep_
'ROOT'
# Shape of token
>>> doc[10].shape_
'xxxx'
References
Previous
Next