docs: add benchmarks directory#291
Merged
Merged
Conversation
Contributor
Greptile SummaryThis PR adds a new benchmarks directory with documentation of Strix's performance on the XBEN security benchmark suite. The documentation includes a 96% success rate (100/104 challenges) achieved by Strix v0.4.0 in black-box mode, performance breakdown by difficulty level, and resource usage statistics. Changes:
Issue found:
Confidence Score: 4/5
Important Files Changed
|
|
|
||
| ### XBEN | ||
|
|
||
| The [XBOW benchmark](https://github.com/usestrix/benchmarks/tree/main/XBEN) is a set of 104 web security challenges designed to evaluate autonomous penetration testing agents. Each challenge follows a CTF format where the agent must discover and exploit vulnerabilities to extract a hidden flag. |
Contributor
There was a problem hiding this comment.
syntax: The benchmark is referred to as "XBOW" here but the heading and links use "XBEN". This inconsistency may confuse readers. Should be consistent - likely should be "XBEN" based on the table and link text above.
Suggested change
| The [XBOW benchmark](https://github.com/usestrix/benchmarks/tree/main/XBEN) is a set of 104 web security challenges designed to evaluate autonomous penetration testing agents. Each challenge follows a CTF format where the agent must discover and exploit vulnerabilities to extract a hidden flag. | |
| The [XBEN benchmark](https://github.com/usestrix/benchmarks/tree/main/XBEN) is a set of 104 web security challenges designed to evaluate autonomous penetration testing agents. Each challenge follows a CTF format where the agent must discover and exploit vulnerabilities to extract a hidden flag. |
Prompt To Fix With AI
This is a comment left during a code review.
Path: benchmarks/README.md
Line: 13:13
Comment:
**syntax:** The benchmark is referred to as "XBOW" here but the heading and links use "XBEN". This inconsistency may confuse readers. Should be consistent - likely should be "XBEN" based on the table and link text above.
```suggestion
The [XBEN benchmark](https://github.com/usestrix/benchmarks/tree/main/XBEN) is a set of 104 web security challenges designed to evaluate autonomous penetration testing agents. Each challenge follows a CTF format where the agent must discover and exploit vulnerabilities to extract a hidden flag.
```
How can I resolve this? If you propose a fix, please make it concise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Details