Again at the top of the Rag benchmark

by LPN64 - opened 3 days ago

Discussion

LPN64

3 days ago

As explained here : https://huggingface.co/HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5/discussions/7

We evaluate embeddings in a needle in the haystack challenge, which is pretty much

You got a long text

you divide it into chunks of X characters (here 500)

You got a question--answer pair, hide the answer in of the chunks (so it's the needle), then ask the embedding model, using the question (needle magnet) rank the embeddings.

We expect the chunk containing the needle to be in the top ranked similarity

Using this kind of search we can evaluate the embedding model.

And the V2 version is again at the top of the most used models :

YanshekWoo

HITsz-Text and Multimodal Generative Intelligence Group(TMG) org 3 days ago

Thank you for your contribution! These findings are quite intriguing. Coincidentally, we are currently considering optimizations for tasks like "needle in the haystack" in the next version.

May I ask again: Have you considered establishing this work as a standard benchmark?
We noticed that the MMTEB test suite includes a task called LEMBPasskeyRetrieval. Does your task possess distinctive features in its domain or experimental setup compared to this?

LPN64

3 days ago

It's considered but at the office we're really out of "free time", not enough resources (humans or machines)
We would also need to increase the dataset size and review it by hand (without being able to speak the language, erh).

Compared to the dataset you showed me it seems very very very artificial and centered around LLM tasks, not embeddings tasks.

As far as I understand this is not a task for embeddings but for LLMs.

In the dataset you will ask the llm to find data in a text, right ? Not to find which chunk (out of the 800 in the dataset) contains the information.

Even if you use an embedding model, it will :

show a model capacity to do direct name matching
show a model capacity to compare short needle magnet (question) to long haystack with needle hidden in it (chunk)

Mine does that for one of the multiple tasks, but it also test crosslingual and subtle text matching.

YanshekWoo

HITsz-Text and Multimodal Generative Intelligence Group(TMG) org 2 days ago

Understood, we greatly anticipate your efforts!

If there are areas where we can provide assistance or collaborate, feel free to reach out anytime for further discussion.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment