litmus — System Prompt Tester

https://sharaj.pages.dev/

ExtensionDeveloper Tools

Item media 4 (screenshot) for litmus — System Prompt Tester

Item media 1 (screenshot) for litmus — System Prompt Tester

Item media 2 (screenshot) for litmus — System Prompt Tester

Item media 3 (screenshot) for litmus — System Prompt Tester

Overview

Test, analyze, and version system prompts against the LLM you ship on. BYOK, local-first, no backend.

litmus turns "does this prompt work?" from a gut feeling into a measured result — for plain prompts, tool calls, and multi-step agents alike. Paste a system prompt, pick the model you actually ship on, and choose what you're testing. Everything runs locally in your browser with your own API keys. There is no litmus backend, no account, and no tracking. ── TWO WAYS TO TEST ── 1) OUTPUT QUALITY — litmus analyzes your prompt, auto-writes a rigorous LLM-as-judge rubric per quality dimension, generates typical/edge/adversarial test cases, runs them on your target model, and scores each output. Then it proposes ranked fixes and can auto-apply them for the next pass. 2) TOOL & AGENT BEHAVIOR — define your tools (JSON schema) and litmus checks, deterministically (no LLM judge), that the model calls the right tool with valid arguments and avoids the ones it shouldn't. For agents, define a goal plus mock tools with scripted results (inject a failure to test recovery); litmus runs the model in a multi-step loop and scores the trajectory across goal completion, tool selection, argument validity, recovery, and efficiency. This mode skips the rubric steps — pick it on the first screen and go straight to your tests. ── WHAT YOU GET ── • Auto-generated rubrics and test cases — including tool tests proposed from your catalog. • Deterministic tool/agent checks that don't drift run-to-run. • Variance built in — run each case N times to see the spread (mean ± range), so a noisy score is visible, not hidden. • Speed measured live (time-to-first-byte, tokens/sec) for quality runs. • Versioning — every run is saved; reload any version, compare by dimension, export as Markdown or JSON. • Works with OpenAI, Anthropic, and Google targets. ── PRIVACY & CONTROL ── • Bring your own key (BYOK). Keys are stored only in your browser. • Local-first — no litmus servers. Your data goes only to the provider you choose, to run the test. Tools in agent runs are mocked — nothing real is executed. • No analytics, no ads, no account. • A spend cap you set blocks runs that would cost more than you want. ── GOOD FOR ── Prompt engineers and AI app developers who want to quickly verify a prompt, tool, or agent before shipping — without standing up a cloud eval platform. Pick a judge model different from your target to reduce self-preference bias and get more trustworthy quality scores.

0 out of 5
No ratings
Learn more about results and reviews.

Details

Version
1.4
Updated
July 13, 2026
Flag concern
Size
121KiB
Languages
English (United States)
Developer
Sharaj Rewoo
Hinjawadi Phase 1 Rd Pune, Pimpri-Chinchwad, Maharashtra 411057 IN
Website
Email
srewoo@gmail.com
Non-trader
This developer has not identified itself as a trader. For consumers in the European Union, please note that consumer rights do not apply to contracts between you and this developer.

Privacy

Manage extensions and learn how they're being used in your organization

The developer has disclosed that it will not collect or use your data. To learn more, see the developer’s privacy policy.

This developer declares that your data is

Not being sold to third parties, outside of the approved use cases
Not being used or transferred for purposes that are unrelated to the item's core functionality
Not being used or transferred to determine creditworthiness or for lending purposes

Support

For help with questions, suggestions, or problems, please open this page on your desktop browser