Taking a glance at SpaCy for Natural Processing Language

Introduction

SpaCy is an open source library that I haven't tried out before writing this blog post. Sometimes is good to look at a tool with beginner's eyes, it's a fair assessment on how easy it is to use.

I do creative writing in my spare time so when I saw that SpaCy does Natural Language Processing I thought it might be useful for analysing my work and getting some insights.

This post is covering a few of the SpaCy features which I found useful.

How to install

SpaCy can be installed as a python library. Let's assume that we use poetry to install our dependencies:

poetry add spacy

Note: At the moment spacy requires python <3.13,>=3.9.

How to prime

I am going to start with a brief introduction on Natural Language Processing (NLP) because it's useful in this context. NLP is a subfield of Artificial Intelligence and involves analysing and understanding natural language. The aim for it is to produce insights based on patterns.

Some examples of insights that NLP can produce are:

The insights for NLP work on models. For example, there are different models for different languages. For this post I used the en_core_web_sm model, which is the default model for English at this point in time.

Download the model:

poetry run python -m spacy download en_core_web_sm

Note: If you use a system-wide python installation, or if you are in an activated virtual environment, you can just run: python -m spacy download en_core_web_sm

We should get an output like this which prompts us to do the next step:

 Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')

Let's load the model into the library from the python REPL (python interpreter):

>>>
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
>>>

Done. Now we are ready to explore the spacy features.

How to start processing text

So what is the nlp object that we just constructed above? Let's explore it in REPL:

>>> nlp
<spacy.lang.en.English object at 0x116a058b0>
>>>
>>> help(nlp)

A snippet of the output returned by the help command looks like this:

Help on English in module spacy.lang.en object:

class English(spacy.language.Language)
...
Methods inherited from spacy.language.Language:
...
 |  __call__(self, text: Union[str, spacy.tokens.doc.Doc], *, disable: Iterable[str] = [], component_cfg: Optional[Dict[str, Dict[str, Any]]] = None) -> spacy.tokens.doc.Doc
 |      Apply the pipeline to some text. The text can span multiple sentences,
 |      and can contain arbitrary whitespace. Alignment into the original string
 |      is preserved.
 |      ...
 |      RETURNS (Doc): A container for accessing the annotations.
 |
 |      DOCS: https://spacy.io/api/language#call

So at this point, even by skimming through the help provided on this object, we can make out that is callable, and it accepts some text. So let's call and see what we get:

>>>blog_doc = nlp("This is a post about the spacy library usage. Let's analyse the text in the post.")
>>>
>>> type(blog_doc)
<class 'spacy.tokens.doc.Doc'>

So we get a Doc object. What can we do with it?

>>>help(blog_doc)

The help output gives us further clues:

Help on Doc object:

class Doc(builtins.object)
 |  Doc(Vocab vocab, words=None, spaces=None, user_data=None, *, tags=None, pos=None, morphs=None, lemmas=None, heads=None, deps=None, sent_starts=None, ents=None)
 |  A sequence of Token objects. Access sentences and named entities, export
 |      annotations to numpy arrays, losslessly serialize to compressed binary
 |      strings. The `Doc` object holds an array of `TokenC` structs. The
 |      Python-level `Token` and `Span` objects are views of this array, i.e.
 |      they don't own the data themselves.
 |
 |      EXAMPLE:
 |          Construction 1
 |          >>> doc = nlp(u'Some text')
 |
 |          Construction 2
 |          >>> from spacy.tokens import Doc
 |          >>> doc = Doc(nlp.vocab, words=["hello", "world", "!"], spaces=[True, False, False])
 |
 |      DOCS: https://spacy.io/api/doc

Ok, so the Doc object contains a series of tokens, let's explore that:

>>> [token.text for token in blog_doc]
['This', 'is', 'a', 'post', 'about', 'the', 'spacy', 'library', 'usage', '.', 'Let', "'s", 'analyse', 'the', 'text', 'in', 'the', 'post', '.']

So just as easy as this, we can split the text into words and punctuation marks. What useful operations can we do on these tokens?

Let's say we want to:

  • count the frequency of the words, to see how much we are repeating ourselves
  • filter out the punctuation
  • count the number of sentences

If we explore a token object, we get more information about what we can do with it:

>>>dir(blog_doc[0])
[ ...'is_alpha', ...'is_digit', ... 'is_punct'...'is_stop'...]

There is a lot more functionality that the tokens offer, I just picked a few methods. Let's count the word frequency:

>>> words = [
...     token.text for token in blog_doc
...     if not token.is_punct
... ]
>>>
>>> words
['This', 'is', 'a', 'post', 'about', 'the', 'spacy', 'library', 'usage', 'Let', "'s", 'analyse', 'the', 'text', 'in', 'the', 'post']
>>>
>>> from collections import Counter
>>> print(Counter(words).most_common(2))
[('the', 3), ('post', 2)]
>>>

When we do analysis on large texts we probably want to remove the words that are necessary in order for the sentences to make sense from a grammatical point of view, but they are not significant. Those words are called "stop words" in NLP. A few examples in English would be: but, and, then, the.

>>> words = [
...     token.text for token in blog_doc
...     if not token.is_stop is not token.is_punct
... ]
>>> words
['post', 'spacy', 'library', 'usage', 'Let', 'analyse', 'text', 'post']
>>>

Now let's look at sentences:

>>>
>>> sentences = list(blog_doc.sents)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "spacy/tokens/doc.pyx", line 926, in sents
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: `nlp.add_pipe('sentencizer')`. Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting `doc[i].is_sent_start`.

Ok, so this doesn't work out of the box but the instructions are pretty clear:

>>> nlp.add_pipe('sentencizer')
>>> blog_doc = nlp("This is a post about the spacy library usage. Let's analyse the text in the post.")
>>> sentences = blog_doc.sents
>>> [sentence for sentence in sentences]
[This is a post about the spacy library usage., Let's analyse the text in the post.]

Note: We need to make sure we call again the nlp Language object with our text in order for the sentence recognizer to be applied.

Further thoughts

  • exploring a library step by step, by using the python REPL, is a very quick way to discover the library capabilities
  • SpaCy has a lot of dependencies, so you need to be aware of that if you want to package it and deploy it
  • Spacy has a lot more functionality. Here are some of the things it can do: