Describing Tokenization
- Tokenization is the task of splitting input text into meaningful segments called tokens
- The input to the tokenizer is a unicode text
- The output is a
Doc
object - To construct a
Doc
object, we need aVocab
objecy, a sequence of stringed words, and optionally a sequence of booleaned spaces
The Algorithm for Tokenization
- The
Tokenizer
receives some raw text. - Iterate over whitespace-separated substrings.
- Check whether we have an explicitly defined rule for this substring. If we do, then we should use the rule and skip the remaining steps.
- Otherwise, try consuming a prefix. Return to step 2 if we consume a prefix.
- Otherwise, try consuming a suffix. Return to step 2 if we consume a suffix.
- Otherwise, try consuming a special case (e.g. symbol). Return to step 2 if we consume a special case.
- Otherwise, try consuming infixes (e.g. hyphens) and split the substring into tokens on all infixes.
- Once we get to this step without any consumption or discovered rule, then handle the substring as a single token.
Spacy Pseudocode for Tokenization
tokenizer_pseudo_code(self, special_cases, prefix_search,
suffix_search, infix_finditer, token_match):
tokens = []
for substring in text.split():
suffixes = []
while substring:
while (prefix_search(substring) or
suffix_search(substring)):
if substring in special_cases:
tokens.extend(special_cases[substring])
substring = ''
break
if prefix_search(substring):
split = prefix_search(substring).end()
tokens.append(substring[:split])
substring = substring[split:]
if substring in special_cases:
continue
if suffix_search(substring):
split = suffix_search(substring).start()
suffixes.append(substring[split:])
substring = substring[:split]
if substring in special_cases:
tokens.extend(special_cases[substring])
substring = ''
elif token_match(substring):
tokens.append(substring)
substring = ''
elif list(infix_finditer(substring)):
infixes = infix_finditer(substring)
offset = 0
for match in infixes:
tokens.append(substring[offset:match.start()])
match_interval = match.start():match.end()
tokens.append(substring[match_interval])
offset = match.end()
if substring[offset:]:
tokens.append(substring[offset:])
substring = ''
elif substring:
tokens.append(substring)
substring = ''
tokens.extend(reversed(suffixes))
return tokens
References
Previous
Next