A quick tutorial on training NER models for the prose library
In this post, we’ll learn how to teach the
prose library to recognize a completely new entity label called
PRODUCT. This label will represent various brand names such as “Windows 10”.
To do this, we’ll being using an annotated data set produced by Prodigy, which is an “an annotation tool powered by active learning.”
The first step is to convert Prodigy’s output into a format that
prose can understand. After an annotation session, Prodigy produces a JSON Lines file containing annotations (in our case, we have a total of 1800) of the following format:
The only keys we’re interested in are
spans, which we need to populate the data structures required to train our model. More specifically, we need to turn our JSON Lines file into a slice of
Since Prodigy’s output and our expected input are so similar, this is fairly straightforward:
Training the Model
Now that we have our data ready to go, all we need to do is train and test a model.
Here’s the result of running the full script (which can be downloaded here):
$ time go run model.go
Correct (%): 0.822222
75.24s user 0.54s system 58.845 total
In approximately 60 minutes (counting the time you’ll need to spend annotating data with Prodigy), we trained and saved a new NER model that achieved an accuracy of 82.2% on our testing data.
We can now use this model by loading it from disk:
As you can see,
prose correctly labeled
Windows 10 with the newly-trained label
While this is an exciting step for the library, we see it as merely the beginning of the kind of NLP functionality we’d like to bring to Go. If you’d like to get involved, head over to the GitHub repository (stars are also highly appreciated!).