Research Papers
What I'm reading
Foundational
Attention Is All You Need
Vaswani et al., 2017 — The paper that introduced the transformer architecture, foundational to everything in modern NLP and LLMs.
Adversarial ML
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou et al., 2023 — Demonstrates automated methods to generate adversarial suffixes that jailbreak aligned LLMs.
AI Safety
Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
Hubinger et al., 2024 — Shows that backdoor behaviors can persist through standard safety fine-tuning methods.