Post
4177
Let's pipe some 𝗱𝗮𝘁𝗮 𝗳𝗿𝗼𝗺 𝘁𝗵𝗲 𝘄𝗲𝗯 into our vector database, shall we?🤠
With 𝐢𝐧𝐠𝐞𝐬𝐭-𝐚𝐧𝐲𝐭𝐡𝐢𝐧𝐠 𝐯𝟏.𝟑.𝟎 (https://github.com/AstraBert/ingest-anything) you can now scrape content simply starting from URLs, extract the text from it, chunk it and put it into your favorite LlamaIndex-compatible database!🕸️
You can do it thanks to 𝗰𝗿𝗮𝘄𝗹𝗲𝗲 by Apify, an open-source crawling library for python and javascript that handles all the data flow from the web: ingest-anything then combines it with 𝗕𝗲𝗮𝘂𝘁𝗶𝗳𝘂𝗹𝗦𝗼𝘂𝗽, 𝗣𝗱𝗳𝗜𝘁𝗗𝗼𝘄𝗻 and 𝗣𝘆𝗠𝘂𝗣𝗱𝗳 to scrape HTML files, convert them to PDF and extract the text - hassle-free!😸
Check the attached code snippet if you're curious of knowing how to get started🎬
PS: Don't tell anybody, but this release also has another gem... It supports OpenAI models for agentic chunking, following the new releases of Chonkie🦛✨
If you don't want to miss out on the new features, leave us a little star on GitHub ➡️ https://github.com/AstraBert/ingest-anything
And join our discord community! ➡️ https://discord.gg/kDqHNjks
With 𝐢𝐧𝐠𝐞𝐬𝐭-𝐚𝐧𝐲𝐭𝐡𝐢𝐧𝐠 𝐯𝟏.𝟑.𝟎 (https://github.com/AstraBert/ingest-anything) you can now scrape content simply starting from URLs, extract the text from it, chunk it and put it into your favorite LlamaIndex-compatible database!🕸️
You can do it thanks to 𝗰𝗿𝗮𝘄𝗹𝗲𝗲 by Apify, an open-source crawling library for python and javascript that handles all the data flow from the web: ingest-anything then combines it with 𝗕𝗲𝗮𝘂𝘁𝗶𝗳𝘂𝗹𝗦𝗼𝘂𝗽, 𝗣𝗱𝗳𝗜𝘁𝗗𝗼𝘄𝗻 and 𝗣𝘆𝗠𝘂𝗣𝗱𝗳 to scrape HTML files, convert them to PDF and extract the text - hassle-free!😸
Check the attached code snippet if you're curious of knowing how to get started🎬
PS: Don't tell anybody, but this release also has another gem... It supports OpenAI models for agentic chunking, following the new releases of Chonkie🦛✨
If you don't want to miss out on the new features, leave us a little star on GitHub ➡️ https://github.com/AstraBert/ingest-anything
And join our discord community! ➡️ https://discord.gg/kDqHNjks