AI Safety Researcher (Model Evaluations)

Probe Claude deployments for failure modes, build adversarial benchmarks, and partner with policy teams to mitigate risks before launch.

Fast Facts

Location: Remote - United States

Compensation: $200k-$260k target TC - equity & benefits

Application: Apply via Anthropic careers

Apply on Anthropic

What You Will Own

Design red-team frameworks against RLHF and constitutional AI systems.
Collaborate with engineering to ship evaluation pipelines and dashboards.
Publish internal reports that quantify risk and propose mitigations.

What Success Looks Like

Critical risks are surfaced before external incidents occur.
Evaluation coverage expands to new modalities with reproducible scripts.
Cross-functional stakeholders trust your documentation and follow-up cadence.

Ideal Experience

Background in ML research, applied cryptography, or adversarial ML.
Comfort with Python, JAX/PyTorch, and large-scale data tooling.
Experience presenting technical risk findings to leadership.

Toolkit

Python, small eval harnesses, and Anthropic internal platforms.
Weights & Biases or Arize for experiment tracking.
Secure notebooks and GPU resources managed remotely.

â† Back to Jobs