14  Sequence Matching

Author

Patrick J. Burns

Published

August 26, 2024

14.1 Sequence matching with LatinCy

We can use LatinCy annotations as the basis for matching spans of tokens using spaCy’s Matcher. This includes basic Token attributes like ORTH, TEXT, NORM, and LOWER as well as those annotated by the LatinCy pipeline like LEMMA, POS, TAG, MORPH, DEP, and ENT_TYPE. There are also more general attributes like IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_UPPER, IS_TITLE, IS_PUNCT, IS_SPACE, and LIKE_NUM, among still others. The full list can be found here. Moreover, there are many operators, quantifiers, and other operations that can be using to create ever-increasingly complex patterns. The combinatorial patterns for using the Matcher are frankly enormous and so this should be considered only a most basic introduction for what is possible.

Here we run through a quick example based on finding variations of res and then by pattern-matching extensions of res publica using the Praefatio to Livy’s Ab urbe condita.

# Imports & setup

import spacy
from pprint import pprint
from tabulate import tabulate

nlp = spacy.load('la_core_web_lg')

with open('livy_praefatio.txt') as f:
    text = f.read() 

doc = nlp(text)
print(doc[:100])
Facturusne operae pretium sim si a primordio urbis res populi Romani perscripserim nec satis scio nec , si sciam , dicere ausim , quippe qui cum ueterem tum uolgatam esse rem uideam , dum noui semper scriptores aut in rebus certius aliquid allaturos se aut scribendi arte rudem uetustatem superaturos credunt . Utcumque erit , iuuabit tamen rerum gestarum memoriae principis terrarum populi pro uirili parte et ipsum consuluisse ; et si in tanta scriptorum turba mea fama in obscuro sit , nobilitate ac magnitudine eorum me qui nomini officient meo consoler . Res est praeterea et immensi operis ,

The Matcher is initialized with the Vocab object from our loaded pipeline.

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
print(matcher)
<spacy.matcher.matcher.Matcher object at 0x109be6ef0>

We use the Matcher by adding patterns, quite sensibly using the add method. With the add method, we assign a match_id “name” for the pattern and the patterns themselves. The patters are lists of lists of dictionaries; those dictionaries are arranged sequentially by the tokens sequences we want to match, where the dictionary keys are the attributes and the dictionary values are the our specific terms to be matched for that attribute. So, in the example below, we are looking for any span of tokens in the provided Doc where the text attribute matches—and matches exactly—the string “res”. The Matcher returns the match_id as well as the start and end indices for each matched span.

pattern = [{'TEXT': 'res'}]
matcher.add('res_tokens', [pattern])

matches = matcher(doc)

matches_data = []

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    matches_data.append((string_id, start, end, span.text))

print(tabulate(matches_data, headers=['Match ID', 'Start', 'End', 'Matched text']))
Match ID      Start    End  Matched text
----------  -------  -----  --------------
res_tokens        8      9  res
res_tokens      422    423  res
# Helper functions

def pattern2matches(pattern_name, pattern):
    matcher = Matcher(nlp.vocab)
    matcher.add(pattern_name, [pattern])
    matches = matcher(doc)
    matches_data = []
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        matches_data.append((string_id, start, end, span.text))

    return matches_data

def tabulate_matches(pattern_name, pattern):
    matches_data = pattern2matches(pattern_name, pattern)
    print(tabulate(matches_data, headers=['Match ID', 'Start', 'End', 'Matched text']))

Note that the matches above on TEXT are case-sensitive. We can widen our search for res by using the LOWER attribute…

pattern = [{'LOWER': 'res'}]
tabulate_matches('res_uncased', pattern)
Match ID       Start    End  Matched text
-----------  -------  -----  --------------
res_uncased        8      9  res
res_uncased       93     94  Res
res_uncased      422    423  res

Extending this logic even further, we can widen the search again by matching not only the token “Res”/“res” but all tokens for which the LatinCy lemmatizers have assigned the lemma “res”. This is done by using the LEMMA attribute.

pattern = [{'LEMMA': 'res'}]
tabulate_matches('res_lemma', pattern)
Match ID      Start    End  Matched text
----------  -------  -----  --------------
res_lemma         8      9  res
res_lemma        30     31  rem
res_lemma        39     40  rebus
res_lemma        57     58  rerum
res_lemma        93     94  Res
res_lemma       213    214  rerum
res_lemma       378    379  rerum
res_lemma       397    398  rei
res_lemma       422    423  res
res_lemma       461    462  rerum
res_lemma       504    505  rei

So far, all of our patterns have included only a single token. We can extend our search to include multiple sequential tokens by adding more dictionaries to the list of dictionaries. In the example below, we are looking for any span where the first token has the lemma “res” and is followed by any token with the POS of “NOUN”.

pattern = [{'LEMMA': 'res'}, {'POS': 'NOUN'}]
tabulate_matches('res_lemma_noun', pattern)
Match ID          Start    End  Matched text
--------------  -------  -----  --------------
res_lemma_noun        8     10  res populi

By contrast, we can return a span where the first token has the lemma “res” and is not followed by any token with the POS of “NOUN”.

pattern = [{'LEMMA': 'res'}, {'TAG': 'NOUN', "OP": "!"}]
tabulate_matches('res_lemma_noun_not', pattern)
Match ID              Start    End  Matched text
------------------  -------  -----  --------------
res_lemma_noun_not        8     10  res populi
res_lemma_noun_not       30     32  rem uideam
res_lemma_noun_not       39     41  rebus certius
res_lemma_noun_not       57     59  rerum gestarum
res_lemma_noun_not       93     95  Res est
res_lemma_noun_not      213    215  rerum gestarum
res_lemma_noun_not      378    380  rerum salubre
res_lemma_noun_not      397    399  rei publicae
res_lemma_noun_not      422    424  res publica
res_lemma_noun_not      461    463  rerum minus
res_lemma_noun_not      504    506  rei absint

The Matcher allows for “fuzzy” matching based on Levenshtein distance (details in documentation, linked below)…

pattern = [{'LEMMA': 'res'}, {"TEXT": {"FUZZY": "public"}}]
tabulate_matches('res_public_fuzzy', pattern)
Match ID            Start    End  Matched text
----------------  -------  -----  --------------
res_public_fuzzy      397    399  rei publicae
res_public_fuzzy      422    424  res publica

Of course, you could search for res publica more directly with a two-lemma pattern…

pattern = [{'LEMMA': 'res'}, {'LEMMA': 'publicus'}]
tabulate_matches('res_publica_lemmas', pattern)
Match ID              Start    End  Matched text
------------------  -------  -----  --------------
res_publica_lemmas      397    399  rei publicae
res_publica_lemmas      422    424  res publica

The patterns are not limited to text search, that is they can be based entirely on annotation patterns. Here is a list of all NOUN-ADJ sequences in Livy’s “Praefatio”…

pattern = [{"POS": "NOUN"}, {"POS": "ADJ"}]
tabulate_matches('noun_adjs', pattern)
Match ID      Start    End  Matched text
----------  -------  -----  -------------------
noun_adjs         9     11  populi Romani
noun_adjs        39     41  rebus certius
noun_adjs        46     48  arte rudem
noun_adjs       127    129  origines proxima
noun_adjs       130    132  originibus minus
noun_adjs       206    208  urbem poeticis
noun_adjs       236    238  urbium augustiora
noun_adjs       259    261  populo Romano
noun_adjs       275    277  gentes humanae
noun_adjs       378    380  rerum salubre
noun_adjs       397    399  rei publicae
noun_adjs       405    407  inceptu foedum
noun_adjs       422    424  res publica
noun_adjs       430    432  exemplis ditior
noun_adjs       536    538  successus prosperos

As well as a list of all alliterative patterns, though with some creative regexing…

matcher = Matcher(nlp.vocab)

for letter in "abcdefghijklmnopqrstuvwxyz":
    pattern = [{"LOWER": {"REGEX": rf'\b{letter}.+?\b'}, "OP": "{2,}"}]
    matcher.add('alliterative_pairs', [pattern])

matches = matcher(doc)

matches_data = []

for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    matches_data.append((string_id, start, end, span.text))

print(tabulate(matches_data, headers=['Match ID', 'Start', 'End', 'Matched text']))
Match ID              Start    End  Matched text
------------------  -------  -----  --------------------------
alliterative_pairs        3      5  sim si
alliterative_pairs       13     15  satis scio
alliterative_pairs       17     19  si sciam
alliterative_pairs       23     25  quippe qui
alliterative_pairs       35     37  semper scriptores
alliterative_pairs       41     43  aliquid allaturos
alliterative_pairs       62     64  populi pro
alliterative_pairs      102    104  supra septingentesimum
alliterative_pairs      142    144  pridem praeualentis
alliterative_pairs      142    145  pridem praeualentis populi
alliterative_pairs      143    145  praeualentis populi
alliterative_pairs      155    157  praemium petam
alliterative_pairs      204    206  conditam condendamue
alliterative_pairs      278    280  aequo animo
alliterative_pairs      290    292  animaduersa aut
alliterative_pairs      292    294  existimata erunt
alliterative_pairs      346    348  magis magis
alliterative_pairs      364    366  nostra nec
alliterative_pairs      367    369  pati possumus
alliterative_pairs      367    370  pati possumus peruentum
alliterative_pairs      368    370  possumus peruentum
alliterative_pairs      387    389  in inlustri
alliterative_pairs      394    396  tibi tuae
alliterative_pairs      480    482  pereundi perdendi
alliterative_pairs      515    517  deorum dearum

References

spaCy Matcher