AI Safety Researcher (Model Evaluations)

Probe Claude deployments for failure modes, build adversarial benchmarks, and partner with policy teams to mitigate risks before launch.

Fast Facts

Location: Remote - United States

Compensation: $200k-$260k target TC - equity & benefits

Application: Apply via Anthropic careers

Apply on Anthropic

What You Will Own

  • Design red-team frameworks against RLHF and constitutional AI systems.
  • Collaborate with engineering to ship evaluation pipelines and dashboards.
  • Publish internal reports that quantify risk and propose mitigations.

What Success Looks Like

  • Critical risks are surfaced before external incidents occur.
  • Evaluation coverage expands to new modalities with reproducible scripts.
  • Cross-functional stakeholders trust your documentation and follow-up cadence.

Ideal Experience

  • Background in ML research, applied cryptography, or adversarial ML.
  • Comfort with Python, JAX/PyTorch, and large-scale data tooling.
  • Experience presenting technical risk findings to leadership.

Toolkit

  • Python, small eval harnesses, and Anthropic internal platforms.
  • Weights & Biases or Arize for experiment tracking.
  • Secure notebooks and GPU resources managed remotely.