Named Entity Recognition: Splitting data into test and train sets


When fitting a named entity recognition model, is it important to make sure that the entities that are in you training data do not repeat in your testing data? For example, if we have a relatively small data set and the goal is to identify person names. Now let us say we have 300 unique person names but would like to generalize our extraction to future data that may contain person names not in the 300 unique names we have in our data. Is it important to make sure that when we split the data into training and testing sets, that any of the 300 unique names not be found both in the training set as well as the testing set?


It is important that you have entities not in the training set to check that your model is generalizing, but usually you should have enough data and different values that with a random split you get a decent split even without checking to make sure it happens.

Answered By – polm23

Answer Checked By – Katrina (AngularFixing Volunteer)

Leave a Reply

Your email address will not be published.