feat: add confusion matrix with precision, recall, and F1 score#14318
feat: add confusion matrix with precision, recall, and F1 score#14318Sagargupta16 wants to merge 2 commits intoTheAlgorithms:masterfrom
Conversation
Add classification evaluation metrics: - confusion_matrix: binary and multiclass support - precision: TP / (TP + FP) - recall (sensitivity): TP / (TP + FN) - f1_score: harmonic mean of precision and recall All functions include doctests.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Adds core classification evaluation utilities to machine_learning/, complementing the existing regression-focused metrics by providing confusion-matrix-based scoring.
Changes:
- Introduces a
confusion_matrix()implementation supporting binary and multiclass labels. - Adds binary/one-vs-rest
precision(),recall(), andf1_score()metrics (viapositive_label). - Includes doctest examples and a
__main__doctest runner for the new module.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| matrix = np.zeros((n, n), dtype=int) | ||
| for a, p in zip(actual, predicted): | ||
| matrix[class_to_index[a]][class_to_index[p]] += 1 |
There was a problem hiding this comment.
zip(actual, predicted) will silently drop extra items when the input lists have different lengths, producing an incorrect confusion matrix without any error. Add an explicit length check up-front (and raise ValueError) so mismatched inputs fail fast (similar to other ML metric functions in this repo).
| tp = sum( | ||
| 1 | ||
| for a, p in zip(actual, predicted) | ||
| if a == positive_label and p == positive_label | ||
| ) | ||
| fp = sum( | ||
| 1 | ||
| for a, p in zip(actual, predicted) | ||
| if a != positive_label and p == positive_label | ||
| ) | ||
| return tp / (tp + fp) if (tp + fp) > 0 else 0.0 |
There was a problem hiding this comment.
precision() iterates with zip(actual, predicted), so if the two inputs differ in length the computation is silently truncated. Consider validating equal lengths (and raising ValueError) before computing TP/FP so callers can’t get an incorrect metric without noticing.
| tp = sum( | ||
| 1 | ||
| for a, p in zip(actual, predicted) | ||
| if a == positive_label and p == positive_label | ||
| ) | ||
| fn = sum( | ||
| 1 | ||
| for a, p in zip(actual, predicted) | ||
| if a == positive_label and p != positive_label | ||
| ) | ||
| return tp / (tp + fn) if (tp + fn) > 0 else 0.0 |
There was a problem hiding this comment.
recall() has the same silent-truncation issue as precision() due to zip(actual, predicted). Add an explicit length check (raise ValueError) before computing TP/FN.
| return matrix | ||
|
|
||
|
|
||
| def precision(actual: list, predicted: list, positive_label: int = 1) -> float: |
There was a problem hiding this comment.
The positive_label: int = 1 type hint is overly restrictive: class labels are often strings or other hashable types, and precision/recall/f1_score work as long as positive_label is comparable to items in actual/predicted. Consider loosening the annotation (e.g., a TypeVar/Hashable) to avoid misleading API contracts and type checker errors.
| def precision(actual: list, predicted: list, positive_label: int = 1) -> float: | ||
| """ | ||
| Calculate precision: TP / (TP + FP). | ||
| Args: | ||
| actual: List of actual class labels. | ||
| predicted: List of predicted class labels. | ||
| positive_label: The label considered as positive class. | ||
There was a problem hiding this comment.
precision, recall, and f1_score are implemented as binary (or one-vs-rest via positive_label) metrics, but the docstrings don’t state this and could be interpreted as multiclass-averaged metrics. Clarify the behavior in the docstrings (and optionally add a doctest example showing one-vs-rest usage for a multiclass label set).
|
@copilot open a new pull request to apply changes based on the comments in this thread |
Describe your change:
Added classification evaluation metrics to
machine_learning/:confusion_matrix: Binary and multiclass supportprecision: TP / (TP + FP)recall: TP / (TP + FN)f1_score: Harmonic mean of precision and recallThe existing
scoring_functions.pyonly has regression metrics (MAE, MSE, RMSE). Classification metrics were missing.Checklist: