EleutherAI Blog8/12|Research & Papers Policy & Regulations

Pretraining Data Filtering for Open-Weight AI Safety

Announcing Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

💡

Why it matters

Ensuring the safety and integrity of open-weight AI models is critical as these systems become more powerful and accessible.

Key Points

1Pretraining data filtering to improve safety and robustness of open-weight LLMs
2Technique called 'Deep Ignorance' removes potentially harmful content from training data
3Aims to create tamper-resistant safeguards and prevent misuse of powerful AI models

Details

EleutherAI has developed a new data filtering technique called 'Deep Ignorance' to improve the safety and robustness of open-weight large language models (LLMs). The goal is to remove potentially harmful content from the pretraining data, building in tamper-resistant safeguards to prevent misuse of these powerful AI systems. By carefully curating the training data, the researchers hope to create LLMs that are more resistant to prompting for unsafe or unethical outputs, even when accessed by bad actors. This approach could have significant implications for the responsible development and deployment of advanced AI technologies.

Pretraining Data Filtering for Open-Weight AI Safety

Why it matters

Key Points

Details

Dive deeper

Related Articles

Reward Hacking Resarch Update

Attention Probes

Research Update: Applications of Local Volume Measurement

Studying inductive biases of random networks via local volu…

The Common Pile v0.1

Product Key Memory Sparse Coders

SAEs trained on the same data don’t learn the same features

Partially rewriting an LLM in natural language

Third-party evaluation to identify risks in LLMs’ training …

Mechanistic Anomaly Detection Research Update 2

AI Curator

Ask me anything about AI

Related Articles

Research Update: Applications of Local Volume Measurement

Studying inductive biases of random networks via local volu…

Product Key Memory Sparse Coders

SAEs trained on the same data don’t learn the same features

Partially rewriting an LLM in natural language

Third-party evaluation to identify risks in LLMs’ training …

Mechanistic Anomaly Detection Research Update 2