We can use LatinCy annotations as the basis for matching spans of tokens using spaCy’s Matcher. This includes basic Token attributes like ORTH, TEXT, NORM, and LOWER as well as those annotated by the LatinCy pipeline like LEMMA, POS, TAG, MORPH, DEP, and ENT_TYPE. There are also more general attributes like IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_UPPER, IS_TITLE, IS_PUNCT, IS_SPACE, and LIKE_NUM, among still others. The full list can be found here. Moreover, there are many operators, quantifiers, and other operations that can be using to create ever-increasingly complex patterns. The combinatorial patterns for using the Matcher are frankly enormous and so this should be considered only a most basic introduction for what is possible.
Here we run through a quick example based on finding variations of res and then by pattern-matching extensions of res publica using the Praefatio to Livy’s Ab urbe condita.
Facturusne operae pretium sim si a primordio urbis res populi Romani perscripserim nec satis scio nec , si sciam , dicere ausim , quippe qui cum ueterem tum uolgatam esse rem uideam , dum noui semper scriptores aut in rebus certius aliquid allaturos se aut scribendi arte rudem uetustatem superaturos credunt . Utcumque erit , iuuabit tamen rerum gestarum memoriae principis terrarum populi pro uirili parte et ipsum consuluisse ; et si in tanta scriptorum turba mea fama in obscuro sit , nobilitate ac magnitudine eorum me qui nomini officient meo consoler . Res est praeterea et immensi operis ,
The Matcher is initialized with the Vocab object from our loaded pipeline.
from spacy.matcher import Matchermatcher = Matcher(nlp.vocab)print(matcher)
<spacy.matcher.matcher.Matcher object at 0x109be6ef0>
We use the Matcher by adding patterns, quite sensibly using the add method. With the add method, we assign a match_id “name” for the pattern and the patterns themselves. The patters are lists of lists of dictionaries; those dictionaries are arranged sequentially by the tokens sequences we want to match, where the dictionary keys are the attributes and the dictionary values are the our specific terms to be matched for that attribute. So, in the example below, we are looking for any span of tokens in the provided Doc where the text attribute matches—and matches exactly—the string “res”. The Matcher returns the match_id as well as the start and end indices for each matched span.
Match ID Start End Matched text
----------- ------- ----- --------------
res_uncased 8 9 res
res_uncased 93 94 Res
res_uncased 422 423 res
Extending this logic even further, we can widen the search again by matching not only the token “Res”/“res” but all tokens for which the LatinCy lemmatizers have assigned the lemma “res”. This is done by using the LEMMA attribute.
Match ID Start End Matched text
---------- ------- ----- --------------
res_lemma 8 9 res
res_lemma 30 31 rem
res_lemma 39 40 rebus
res_lemma 57 58 rerum
res_lemma 93 94 Res
res_lemma 213 214 rerum
res_lemma 378 379 rerum
res_lemma 397 398 rei
res_lemma 422 423 res
res_lemma 461 462 rerum
res_lemma 504 505 rei
So far, all of our patterns have included only a single token. We can extend our search to include multiple sequential tokens by adding more dictionaries to the list of dictionaries. In the example below, we are looking for any span where the first token has the lemma “res” and is followed by any token with the POS of “NOUN”.
Match ID Start End Matched text
------------------ ------- ----- --------------
res_lemma_noun_not 8 10 res populi
res_lemma_noun_not 30 32 rem uideam
res_lemma_noun_not 39 41 rebus certius
res_lemma_noun_not 57 59 rerum gestarum
res_lemma_noun_not 93 95 Res est
res_lemma_noun_not 213 215 rerum gestarum
res_lemma_noun_not 378 380 rerum salubre
res_lemma_noun_not 397 399 rei publicae
res_lemma_noun_not 422 424 res publica
res_lemma_noun_not 461 463 rerum minus
res_lemma_noun_not 504 506 rei absint
The Matcher allows for “fuzzy” matching based on Levenshtein distance (details in documentation, linked below)…
Match ID Start End Matched text
---------------- ------- ----- --------------
res_public_fuzzy 397 399 rei publicae
res_public_fuzzy 422 424 res publica
Of course, you could search for res publica more directly with a two-lemma pattern…
Match ID Start End Matched text
------------------ ------- ----- --------------
res_publica_lemmas 397 399 rei publicae
res_publica_lemmas 422 424 res publica
The patterns are not limited to text search, that is they can be based entirely on annotation patterns. Here is a list of all NOUN-ADJ sequences in Livy’s “Praefatio”…
Match ID Start End Matched text
---------- ------- ----- -------------------
noun_adjs 9 11 populi Romani
noun_adjs 39 41 rebus certius
noun_adjs 46 48 arte rudem
noun_adjs 127 129 origines proxima
noun_adjs 130 132 originibus minus
noun_adjs 206 208 urbem poeticis
noun_adjs 236 238 urbium augustiora
noun_adjs 259 261 populo Romano
noun_adjs 275 277 gentes humanae
noun_adjs 378 380 rerum salubre
noun_adjs 397 399 rei publicae
noun_adjs 405 407 inceptu foedum
noun_adjs 422 424 res publica
noun_adjs 430 432 exemplis ditior
noun_adjs 536 538 successus prosperos
As well as a list of all alliterative patterns, though with some creative regexing…