Methods for creating training data for SpaCy models?

Issue

I recently began a NLP journey using SpaCy, and I have ~5,500 strings which I want to label up. For the first 100, I did this using a spreadsheet with custom columns, which was then run through a script to generate Python dictionaries. In the sheet, I have strored the string, label type, label value. The script then works out the position of the label value from within the string.

It’s rather time consuming to product training data in this way, and it’s open to error.

Are there any tools available to assist with this? I literally just need the ability to highlight a substring, and then choose the label type. I could build it myself, but I feel it may already exist.

Solution

I’m one of the maintainers of spaCy and we’ve actually been thinking about this problem a lot! So we’ve built Prodigy, an annotation tool that integrates with spaCy and puts the model in the loop to help you train and evaluate models faster. It’s currently in beta, but you can sign up for a free invite. Prodigy takes a slightly different approach to the click-drag-highlight-select concept of other annotation tools. It uses the model in the loop to suggest annotations with the most relevant gradient for training, and only asks you for a simple binary feedback: accept or reject. This lets you move through examples quickly. As you annotate, the model in the loop is updated, and its predictions will influence what Prodigy asks next.

This works especially well if you’re looking to improve existing entity types present in your spaCy model, or if you’re working with a large corpus of example text you want to use for annotation.

If you’re looking for a tool more specifically for highlighting and annotating spans of text, you should also check out Brat. I’m not 100% sure what the output looks like, but you should definitely be able to convert it to spaCy’s training format. There’s also a trainable version of the displaCy ENT visualizer, developed by someone from the community.

Answered By – Ines Montani

Answer Checked By – Clifford M. (AngularFixing Volunteer)

Leave a Reply

Your email address will not be published.