Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
m-ric 
posted an update Sep 12, 2024
Post
795
𝗘𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗻𝗴 𝘆𝗼𝘂𝗿 𝗛𝗧𝗠𝗟 𝘄𝗲𝗯𝗽𝗮𝗴𝗲𝘀 𝘁𝗼 𝗺𝗮𝗿𝗸𝗱𝗼𝘄𝗻 𝗶𝘀 𝗻𝗼𝘄 𝗽𝗼𝘀𝘀𝗶𝗯𝗹𝗲 𝗲𝗻𝗱-𝘁𝗼-𝗲𝗻𝗱 𝘄𝗶𝘁𝗵 𝗮 𝘀𝗶𝗺𝗽𝗹𝗲 𝗟𝗟𝗠! 👏

Jina just released Reader-LM, that handles the whole pipeline of extracting markdown from HTML webpages.

A while ago, Jina had released a completely code-based deterministic program to do this extraction, based on some heuristics : e.g., “if the text is in a <p> tag, keep it, but if it’s hidden behind another, remove it”.

🤔 But they received complaints from readers: some found it too detailed, other not enough, depending on the pages.

➡️ So they decided, 𝗺𝗮𝘆𝗯𝗲 𝗵𝗲𝘂𝗿𝗶𝘀𝘁𝗶𝗰𝘀 𝘄𝗲𝗿𝗲 𝗻𝗼𝘁 𝗲𝗻𝗼𝘂𝗴𝗵: 𝗶𝗻𝘀𝘁𝗲𝗮𝗱, 𝘁𝗵𝗲𝘆 𝘁𝗿𝗶𝗲𝗱 𝘁𝗼 𝘁𝗿𝗮𝗶𝗻 𝗮 𝗟𝗟𝗠 𝘁𝗼 𝗱𝗼 𝘁𝗵𝗲 𝗰𝗼𝗺𝗽𝗹𝗲𝘁𝗲 𝗲𝘅𝘁𝗿𝗮𝗰𝘁𝗶𝗼𝗻. This LLM does not need to be very strong,but it should handle a very long context: it’s a challenging, “shallow-but-wide” architecture.

𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:
2️⃣ models: Reader-LM-0.5B and 1.5B
⚙️ Two stages of training: first, short and simple HTML to get the basics, then ramp up to longer and harder HTML up to 128k tokens
🔎 Use contrastive search for decoding: this empirically reduces “repeating output” issues
➡️ Their models beat much larger models at HTML extraction 🔥
🤗 Weights available on HF (sadly cc-by-nc license): jinaai/reader-lm-1.5b

HF Space to try it out: maxiw/HTML-to-Markdown

In this post