AI Safety Researcher (Model Evaluations)
Probe Claude deployments for failure modes, build adversarial benchmarks, and partner with policy teams to mitigate risks before launch.
Fast Facts
Location: Remote - United States
Compensation: $200k-$260k target TC - equity & benefits
Application: Apply via Anthropic careers
Apply on AnthropicWhat You Will Own
- Design red-team frameworks against RLHF and constitutional AI systems.
- Collaborate with engineering to ship evaluation pipelines and dashboards.
- Publish internal reports that quantify risk and propose mitigations.
What Success Looks Like
- Critical risks are surfaced before external incidents occur.
- Evaluation coverage expands to new modalities with reproducible scripts.
- Cross-functional stakeholders trust your documentation and follow-up cadence.
Ideal Experience
- Background in ML research, applied cryptography, or adversarial ML.
- Comfort with Python, JAX/PyTorch, and large-scale data tooling.
- Experience presenting technical risk findings to leadership.
Toolkit
- Python, small eval harnesses, and Anthropic internal platforms.
- Weights & Biases or Arize for experiment tracking.
- Secure notebooks and GPU resources managed remotely.