Understand the key metrics of review analysis and how they are calculated

The NER API extracts meaningful keywords(entities), such as people, organizations, locations, and other specific types of entities from text in natural language.

Performance Metrics

NER uses 3 metrics to measure model performance.

1. Precision

The precision measures how precise the model is. We can calculate the value from the ratio between the correct prediction and predictions.

2. Recall

The recall measures the model's ability to find a category. We can calculate the value from the ratio between the correct prediction and the number of samples.

3. F1 Score

Precision and recall show the different aspects of our model. In some situations, we may want a model with excellent precision but does not care about a recall, and vice versa. A single metric will be handy if we want a model with good precision and recall simultaneously. F1 score with the following definition is one of the most famous metrics for that purpose:

F1 Score = 2 * Precision * Recall / (Precision + Recall)

In NER, accuracy is not used as a performance indicator for the following two reasons. Therefore, it is displayed as N/A value.

  1. Data Imbalance: In most NER datasets, the number of non-entity tokens (e.g., general words) is significantly higher than entity tokens. As a result, a model can achieve high accuracy by simply predicting that all tokens are non-entities. Therefore, accuracy does not properly reflect the model's performance due to data imbalance.

  2. Importance of Classification Errors: In NER, it is crucial to accurately classify the types of entities. For example, misclassifying a person's name as an organization name is a significant error. However, accuracy only evaluates whether the predictions match the correct answers, without considering the types of entities. This may lead to overlooking important classification errors.

How to Calculate Metric

NER training triggers the evaluation step when it finishes successfully, with the test set selected by the user. We chose two phrases, as an example, for the model evaluation.

The Metrics Details page below [Image 2] shows the performance of our model for each entity;

We just clipped just two entities which can be found in the training set. As you can see on the dashboard, the phrase A142 has 6 TECHNOLOGY entities, and the phrase A128 has 3 COUNTRY entities. This number is also shown on the Metric Details page in the Sample column.

On the Inference tab page, you can check the inference result of our model.

We can make the Confusion Matrix of the TECHNOLOGY entity only using phrase 142 [Image 4 below] because the TECHNOLOGY entity only occurs at phrase A142 for true entity and inference result. There are three TECHNOLOGY entities at the inference result of A142 and all of them are correct entities. But three TECHNOLOGY entities are missing from the result. Therefore, three true positives and three false negatives.

First, we can calculate the performance metrics by each entity as below.

We can calculate precision, recall, and F1-score from the above confusion matrix and the equations from the previous sections.

  • Precision = TP / (TP + FP) = 3 / (3 + 0) * 100 = 100%

  • Recall = TP / (TP + FN) = 3 / (3 + 3) * 100 = 50%

  • F1-score = 2 * Precision * Recall / (Precision + Recall) = 2 * 100 * 50 / (100 + 50) = 66.7%

In the same analogy, we can make the confusion matrix of the COUNTRY entity. There is one COUNTRY entity at the inference result, and this is a correct one. By the way, the result misses two COUNTRY entities. It means that the precision of the COUNTRY entity is 100%, but the recall is only 33.3%.

Using these entity-level metrics, we can calculate the model-level metrics using the weighted average of each metric. For example, the recall of the model is

Recall = (RecallTECHNOLOGY * SupportTECHNOLOGY + RecallCOUNTRY * SupportCOUNTRY) / Supporttotal = (50 * 6 + 33.3 * 3) / (6 + 3) = 44.4%

You can easily get the precision and F1 score of the model in the same way.

Entity-level Evaluation

For the results of recall and precision metrics, the following interpretations are possible.

Last updated