Review Analysis

Understand the key metrics of review analysis and how they are calculated

The review analysis API identifies the sentiment(label) of the given phrase for each category(type).

Sentiment labels can be either numeric or qualitative.

Example 1) -2, -1, 0, 1,2 Example 2) Negative, Neutral, Positive

Performance Metrics

Review Analysis uses 4 metrics to measure model performance.

Precision
Recall
F1-score
Accuracy

Basic Concepts

To measure the model performance of Review Analysis, we compare the predicted result by model with the ground truth via confusion matrix.

1. Ground Truth

Ground truth is a term of a reference point against which model predictions can be compared or assessed. It refers to the correct and verified labels or outcomes that a model aims to predict.

You can set one of three labels: POSITIVE, NEUTRAL, and NEGATIVE for each category TYPE as below.

If you don’t specify the label for a type, the label is filled to NONE, though you can not see the NONE label in the performance metric table.

Let's assume that the trained model returned the following inference results through the test sample data.

2. Predicted Result

The predicted result refers to the inference result that is produced as a result when test data is processed by the trained model.

From the tables of ground truth and predicted result, we can make a confusion matrix for each category.

3. Confusion Matrix

For example, the confusion matrix table for type 3 is as follows.

Using the confusion matrices we can easily calculate the traditional performance metrics as [Table 4] : precision, recall, f1-score, and accuracy.

How to Calculate Metric

We will cover you how to calculate the metric values on the ‘Model History’ page and ‘Metrics Details’ one by one.

First, we can check the metrics by each Type category as below. These metrics can be checked by pressing the 'Details' button for each model in the model history page.

1. Precision

Precision is a metric that measures the accuracy of a model. You can calculate a value from the ratio of the correct predictions for that category to the overall predictions for that category, i.e. the sum of a single row. In this example, the precision of the POSITIVE label is

100 * TRUE POSITIVE / POSITIVE PREDICTION = 100 * 2 / (2 + 0 + 2 + 0) = 50%

You can confirm this value if you choose the POSITIVE label from the drop-down menu at the top left corner of the 'Metrics Details' window.

The precision for NEUTRAL is 0 because the support (number of truth samples) is 0. Also, the precision for NEUTRAL is zero because the count of correct predictions is zero. However, the precision of NONE is 100%.

Using the values, we calculate a weighted average of precision for all labels:

(50 * 2 + 0 * 0 + 0 * 2 + 100 * 3) / (2 + 0 + 2 + 3) = 57.1%

Secondly, we can check the overall metrics by model. The metric values on the ‘Model History’ page are weighted averages over categories. As you can see, all category has the same support. So we can take a simple numeric average of them.

For example, overal precision of the model in this example is

(57.1 + 52.4 + 42.9) / 3 = 50.8%

2. Recall

Recall measures the ability of a model to identify categories. We can calculate this value from the ratio of the number of correct predictions in that category to the number of samples in that category, i.e. the sum of one column. In this example, the recall of the POSITIVE label is

100 * TRUE POSITIVE / POSITIVE LABELS = 100 * 2 / (2 + 0 + 0 + 0) = 100%

The recall for NEUTRAL is 0 because there is no support for this label. The recall for NEGATIVE is 0, too. However, the reason is not the same. The model failed to find any NEGATIVE label. In other words, recall for the NONE is 100%.

Using these results, we can calculate a weighted average of recall for all labels is

(100 * 2 + 0 * 0 + 0 * 2 + 100 + 3) / (2 + 0 + 2 + 3) = 71.4%

3. F1 Score

Precision and recall represent different aspects of a model. In some situations, we may want a model with good precision but not recall, or vice versa. A single metric is useful if you want a model that has both good precision and recall. The F1 score with the following definition is one of the most popular metrics for this purpose.

2 * precision * recall / (precision + recall).

In this example, the F1-score of the POSITIVE label is

2 * 50 * 100 / (50 + 100) = 66.7%

F1 scores for NEUTRAL and NEGATIVE are 0, but the F1 score for the NONE is 100.

Using these results, the weighted average of the F1 score for all labels is

(66.7 * 2 + 0 * 0 + 0 * 2 + 100 * 3) / (2 + 0 + 2 + 3) = 61.9%

4. Accuracy

We can calculate accuracy, which is the ratio of the number of correct predictions to the number of all samples. If you select 'ALL' from the menu in the top left corner, you will notice that no accuracy values appear. This means that the value cannot be defined in these circumstances. However, if you select 'ALL', you can now see the accuracy values..

In this example case, the accuracy value for Type 3 cateegory is

100 * (2 + 0 + 0 + 3) / 7 = 71.4%

Inference Basis and Confidence Level

There is a "View details" icon located to the right of the review sentence column in the data area of the dashboard and inference pages.

When you click on this icon, you can check the phrases (highlighted) that served as evidence for the inference by category that the model extracted, as well as the confidence value of the inference in the layer window that pops up.ima

Confidence is a measure of how reliable the model's prediction is. It outputs a value up to two decimal places between 0 and 1, and the higher the value, the higher the confidence in the inference can be interpreted.

Interpreting the category-based inference evidence and confidence for the example sentence above, the score for the "fit" category was calculated as 1 in the second row for the following phrase in the sentence:

Phrase serving as evidence: "I'm happy that the fit is good."
Confidence: 0.84 (89%)

Note that if the number of test data samples for the TYPE category is relatively small, the confidence value for the category-based inference may be slightly lower.

PreviousPerformance Metrics NextNER

Last updated 2 years ago

Was this helpful?