Data Science

Motivating Rule-Based Matching

The Matcher class operates over Token objects
The PhraseMatcher class operates over Doc objects
These rule matches can refer to token annotations or large terminology lists
For example, we may use the Matcher class to find a combination of three tokens:
- A token whose lowercase form matches hello
- A token whose is_punct flag is set to True
- A token whose lowercase form matches world

Sample Code

>>> doc = nlp('Barack Obama defends his healthcare reforms')

# Initialize and add rules
>>> matcher = PhraseMatcher(nlp.vocab)
>>> matcher.add('OBAMA', None, nlp('Barack Obama'))
>>> matcher.add('HEALTH', None,
...             nlp('health care reform'),
...             nlp('healthcare reforms'))

# Find matches
>>> matcher(doc)
[(7732777389095836264, 0, 2), (3161894980173008574, 4, 6)]

# Get match ids
>>> nlp.vocab.strings[7732777389095836264]
'OBAMA'
>>> nlp.vocab.strings[3161894980173008574]
'HEALTH'

References

Named Entity Recognition

GoldParse