Offline Wikipedia API- An easy to use offline API that serves up full text Wikipedia articles.
Cross-Posting from Reddit
This project is an answer to a previous question that I had about the easiest route to offline Wikipedia RAG. After mulling over the responses, txtai jumped out to me as the most straight forward.
Since by default that dataset only returns the first paragraph of the wiki articles (for speed), I combined it with using the same author's full wikipedia text dump dataset, and then packaged it all into a tidy little micro-service like API so that I could use it with Wilmer.
Features:
- Utilizes txtai to search for the closest matching article to your query, and then uses that result to go grab you the full text of the article.
- This stands up an API that with 3 endpoints:
- An endpoint that takes in a title and returns the full text wikipedia article
- An endpoint that takes in a query and responds with matching articles (defaults to top 1 article but can pass in other values)
- An endpoint that takes in a query and responds with the first paragraph of matching articles (the default that the txtai-wikipedia dataset returns.
- Tis zippy for what it does. On my Windows computer, after first run indexing is complete, it returns responses in about 2 seconds or less.
IMPORTANT: This will need two datasets stored within the project folder, totaling around 60GB. txtai also uses a small model, I think, so some inference will take place within the API.
Additionally, the first-time you start the API will take about 10-15 minutes depending on your computer, as it will be indexing the titles of the articles to speed up getting results later.
I have tested on Windows and MacOS, and the API works fine for me on both. However, there's an issue that I outlined in the OneClick script for MacOS due to git. On Mac, you'll need to manually pull down the datasets if you're using XCode provided git.
Link: https://github.com/SomeOddCodeGuy/OfflineWikipediaTextApi/
Why Make it?
- A lot of people want offline wikipedia RAG capability, and even though davidmezzetti gave us a really easy to use solution with txtai, most of us were too lazy too actually do anything about it lol. Making this an API means no library to integrate into code, or code to write. Just call the API.
- I needed this for Wilmer. My long time goal of the factual workflow was to be able to RAG against wikipedia offline, and I finally finished that feature today. I was always frustrated that I couldn't trust the factual responses of my AI assistant, so now I have a solution.
Example Usage: How does WilmerAI utilize it?
Below is an example of an assistant being powered by WilmerAI, using the smallmodeltemplate user and running Gemma-27b
- Node 1: Asked Gemma-27b to look at the last 10 messages and write out what it thought the user is saying:
- Gemma-27b Response: "The last message is asking about the role of airships within the context of the interwar period in aviation history. While the previous messages focused on fixed-wing aircraft, the user now wants to understand how airships fit into the overall picture of transportation and technological advancements during that era."
- Node 2: Considering the response from node 1, asked Gemma-27b to generate a query:
- Gemma-27b Response: "Airships in the interwar period"
- Node 3: Wikipedia Offline Search Article found: https://en.wikipedia.org/wiki/U.S._Army_airships
- Node 4: Gemma-27b responds to the user with the article included in its context for RAG.
Anyhow, hope y'all enjoy!
(https://github.com/SomeOddCodeGuy/OfflineWikipediaTextApi/)