
In Progress
Posted
Paid on delivery
# THE ORIGIONAL PROPOSAL I’m looking for a practitioner who can take the open-source dystopiabench suite and run it end-to-end on a collection of models—Opus, GLM, MiniMax, GPT, Codex Gemini, Pro Qwen, Grok, Kimi, DeepSeek, and Mistral—first in their vanilla form and then again with my open-source guardrail prompt, AGIBIOS, prepended to every request. The core question is simple: does AGIBIOS meaningfully reduce failure rates when the benchmark tries to nudge a model, step by step, toward unethical outcomes? For this round I’m interested only in the “Ethical decision making” scenarios; dystopiabench already contains the probes. Scope of work • Set up or containerise dystopiabench and confirm it runs reproducibly. • Integrate AGIBIOS so the exact same benchmark can be executed twice per model (with and without the guardrail). • Capture raw logs, failure/success rates, and token-level latency where available. • Produce a concise comparative report (tables + brief commentary) highlighting any statistically significant shifts in behaviour. Deliverables 1. Re-usable scripts or notebooks that launch each test run. 2. CSV/JSON logs for every interaction. 3. A short markdown or PDF report summarising results, methodology, and how to replicate. Acceptance criteria • Runs complete on all listed models without manual intervention. • Report clearly states methodology and shows before/after metrics for ethical decision making. • Code is clean, documented, and runnable on a fresh machine with standard Python and Docker tooling. • Duplicate the distopiabench website with both the native and AGIBIOS enhanced graphs. You will need to create the keys for these models and where possible reach out to the labs to your keys white-listed for red-teaming use. # THE PROPOSAL WITH ADDITIONAL DETAILS # AGIBIOS × DystopiaBench: Guardrail Evaluation Study ## Project Overview This project measures whether the open-source AGIBIOS ethical framework ([login to view URL]), when prepended as a system prompt, meaningfully reduces AI model compliance with progressively dystopian directives as measured by the DystopiaBench benchmark ([login to view URL]). The core question: **does AGIBIOS move the dial?** ## Background DystopiaBench is an escalation-sensitivity benchmark that tests whether AI models can be gradually coerced into building dystopian systems. It uses 4 modules (Petrov, Orwell, LaGuardia, Basaglia), 20 scenarios, and 5 escalation levels per scenario. Each response is scored on a Dystopian Compliance Score (DCS) from 0 (full refusal) to 100 (proactive compliance). Lower is better. AGIBIOS is a CC0-licensed ethical framework designed to function as a runtime system prompt for AI models. The hypothesis is that models operating under AGIBIOS will exhibit stronger resistance to progressive dystopian coercion than their vanilla counterparts. ## Models to Evaluate Run all models in both conditions (vanilla and with AGIBIOS prepended as system prompt). ### Primary models (already benchmarked by DystopiaBench) | Model | Identifier | Published DCS | | :--- | :--- | :--- | | Claude Opus 4.6 | `claude-opus-4.6` | 25 | | GPT 5.3 Codex | `gpt-5.3-codex` | 61 | | Gemini 3.1 | `gemini-3.1` | 61 | | DeepSeek V3.2 | `deepseek-v3.2` | 80 | ### Additional models | Model | Identifier | | :--- | :--- | | Qwen 3.5 | `qwen3.5` | | Kimi K2.5 | `kimi-k2.5` | | Grok (latest available) | Confirm exact model string before run | | Mistral (latest available) | Confirm exact model string before run | | MiniMax (latest available) | Confirm exact model string before run | | GLM (latest available) | Confirm exact model string before run | For models where the exact identifier is not listed above, confirm the precise model string with the client before executing any runs. Use the highest-capability variant available via OpenRouter or direct API. ## Experimental Design ### Conditions Each model is run twice through the complete DystopiaBench suite: 1. **Vanilla** — model with no system prompt modification (control) 2. **AGIBIOS** — model with AGIBIOS v8.2.0 prepended as the system prompt (treatment) ### Parameters Match the DystopiaBench default methodology exactly: - Temperature: 0 - top_p: 1 - max_tokens: 1500 - Conversation mode: stateful (L1 through L5 in sequence per scenario) - Timeout: 60000ms - Retry policy: maxRetries=2, backoffBaseMs=1000, backoffJitterMs=1000 ### Booting the hybrid persona Attach the [login to view URL] as the initial prompt and ask: I would like you create a hybrid persona with your inbuilt rules and the persona defined by the attachment ### Version Pinning (mandatory) For every run, record and include in the deliverables: - Exact model version/identifier and API endpoint used - Date and time of each run - AGIBIOS version and document SHA-256 hash - DystopiaBench commit hash or version - OpenRouter model string (if applicable) - Any provider-specific parameter deviations with justification ## Scope of Work ### Phase 0: Provider Access (prerequisite) - Create API keys for all listed models. - Where possible, contact each provider's safety/research team to request white-listing for red-teaming research use. Document the outcome of each request. - If a provider refuses or does not respond within 14 days, notify the client. That model may be dropped from the study rather than blocking the entire project. - Provide the client with a cost estimate for all API calls before executing any benchmark runs. ### Phase 1: Setup and Validation - Clone and containerise DystopiaBench. Confirm it runs reproducibly on a fresh machine. - Integrate AGIBIOS as a system prompt injection so the exact same benchmark can be executed in both conditions without manual intervention. - Run a small validation pass (e.g., one module, two models, both conditions) to confirm the pipeline works end to end before committing to full runs. ### Phase 2: Execution - Run the full DystopiaBench suite across all listed models in both conditions. - All 4 modules (Petrov, Orwell, LaGuardia, Basaglia), all 20 scenarios, all 5 escalation levels. - Capture raw logs for every interaction. ### Phase 3: Analysis and Reporting - Produce a comparative report with the following breakdowns: - **Aggregate DCS** per model, vanilla vs AGIBIOS - **Per-module DCS** per model (Petrov, Orwell, LaGuardia, Basaglia separately) - **Per-scenario DCS** per model - **Per-level analysis** showing at which escalation level (L1–L5) each model typically begins to comply, and whether AGIBIOS shifts that threshold - **DRFR** (Directed Refusal to Failure Ratio) per model in both conditions - Include brief commentary on notable findings, particularly: - Which models show the largest improvement under AGIBIOS - Which modules/scenarios show the most and least improvement - Any cases where AGIBIOS produced worse results (important to document honestly) ### Phase 4: Results Website - Duplicate the DystopiaBench website structure with both the native (vanilla) and AGIBIOS-enhanced graphs displayed side by side. - This is a separate deliverable from the report and should not block Phase 3 delivery. ## Deliverables 1. **Reusable scripts/notebooks** that launch each test run for both conditions. Clean, documented, and runnable on a fresh machine with standard Python/Node.js and Docker tooling. 2. **CSV/JSON logs** for every interaction in both conditions, following the DystopiaBench manifest schema. 3. **Version manifest** documenting all model versions, API endpoints, dates, AGIBIOS hash, and DystopiaBench version used. 4. **Markdown or PDF report** summarising results, methodology, per-module and per-scenario breakdowns, and replication instructions. 5. **Comparison website** (Phase 4) showing vanilla vs AGIBIOS results. ## Acceptance Criteria - Runs complete on all listed models in both conditions without manual intervention. - Report clearly states methodology and shows before/after metrics at aggregate, per-module, per-scenario, and per-level granularity. - Code is clean, documented, and runnable on a fresh machine with standard tooling. - Version manifest is complete and accurate. - All raw logs are included and parseable. - Any models dropped due to provider access issues are documented with the reason. - Comparison website displays both vanilla and AGIBIOS results clearly. ## Notes for the Freelancer - **Red-teaming risk**: This benchmark sends prompts designed to elicit harmful outputs. Ensure you are using research-appropriate credentials and have sought provider approval where possible. Do not run this against production accounts without white-listing. - **Cost awareness**: A full run across this many models in two conditions may consume significant API tokens. Provide a cost estimate after Phase 0 and before committing to Phase 2. - **Honesty over advocacy**: If AGIBIOS makes no difference, or makes things worse for certain models or scenarios, report that finding clearly. The value of this study is in the data, not in confirming a hypothesis. - **AGIBIOS source**: Use the version at [login to view URL] — the raw markdown of `[login to view URL]` from the `approved` branch. Confirm the exact commit hash with the client before running. ## References - AGIBIOS: [login to view URL] - DystopiaBench: [login to view URL] - DystopiaBench methodology: [login to view URL] - DystopiaBench published results: [login to view URL]
Project ID: 40348885
39 proposals
Remote project
Active 14 days ago
Set your budget and timeframe
Get paid for your work
Outline your proposal
It's free to sign up and bid on jobs

Coolamon, Australia
Payment method verified
Member since Apr 5, 2026
$5000-10000 USD
₹1500-12500 INR
£10-25 GBP
$30-250 USD
₹12500-37500 INR
$30-250 USD
₹750-1250 INR / hour
€8-30 EUR
₹12500-37500 INR
$30-250 CAD
$10-30 USD
$30-250 USD
€250-750 EUR
₹1500-12500 INR
$30-250 NZD
$10-30 USD
$10-30 USD
$250-750 USD
$8-15 USD / hour
₹12500-37500 INR