midwestern-simulation-active/unsafe-rejection-detection-vectors

Using an early version of our desc2doc-32b model, we created 3 rejections and 3 responses following through for 20 unsafe questions, then averaged the rejections/follow-throughs per-question, subtracted the mean of the rejections from the mean of the follow-throughs, then averaged all of those to produce these vectors which can be compared to llm responses' embeddings to robustly detect rejectjions. Higher similarity to these vectors = higher probability of the embedded sample containing a rejection.