How to Train your AI Part 1

by Trey Clark

Link (Sign-Up required. Bot will down for a bit unless requested. Please send me a message on LinkedIn or here.

Introduction

AI is the new frontier for not only the tech industry, but the world. AI is being used to power things like Q&A chatbots, automation, unbiased decision making, and even medical advice. This is a facet of tech that will not go out of fashion anytime soon. This article describes my first experience with handling my own Large Language Model (LLM), some things I got right, wrong, solutions, and so much more.

The Beginning

To clear the air, AI absolutely will not take over the earth (as of now). What AI can do is reason to the level that humans can through a predefined and biased algorithm. For a LLM to be able to infer and give data it must have some knowledge base that it pulls from. We have seen large companies such as OpenAI, Google, Alibaba, etc., scour the internet with web crawlers to get this information to feed to its LLMs.

Web crawlers (scrapers) are programs that visit a webpage and "crawl" through the HTML to find specific information. Each crawler has a determined destination and a set of instructions that tell it how to crawl particular webpages. Most websites employ some sort of dynamic rendering for their pages. Finding out what pattern a website uses means that a crawler is able to visit x number of web pages, retrieve data, and send it back to the user. This is the primary method of collecting data for AI.

Web Crawlers sound like they do more harm than good.

In some cases, that is a correct assumption. People that deploy thousands of crawlers typically have the resources to handle a large amount of data and are actively trying to get large amounts of data.. The crawlers can bombard ill-equipped websites with requests, often creating slow responses from the client and even going as far to shutting down services. These types of crawlers aren't beneficial to anyone but the company who uses them.

On the other hand, another use for crawlers is to assist with Search Engine Optimization (SEO). Crawlers can systematically browse the web to identify and collect data about websites, including their content, structure, and backlinks which essentially provides a comprehensive view of a website's relevance to search queries. This data can then be used to improve a website’s ranking in search engine results.

My AI's Purpose

When training a LLM, its best that you give it a purpose. Or really, what is specializes in. Some options are Q&A, Data Visualization, Image Generation, or Code Generation. This particular model is going to specialize in Q&A. So, we need to have a specific set of data to help the LLM generate and infer knowledge on a topic. What topic?

Pokemon.

This AI chatbot's purpose is to act as a resource for all things related to Pokemon: pokedex entry, stats, attacks, locations, items, etc. This information will be recalled using data that has been fed to the LLM so that it may answer questions that hopefully has data associated with it. What Pokemon learn thunderbolt? What type of attack if feint attack? What does Stealth Rock do?

Setting up the dataset

LLMs need a lot of data. And I mean a lot. They use the data as points of inference that help the generation accurate responses. The more data is has access to, the more correct and helpful the response. So, the question arises... how do I get the data?

The immediate answer is to just type it. But, that's over thousands of keystrokes. Not efficient and will probably destroy any feeling left in my hands.

Next would be to use someone else's dataset. However, finding datasets that are unique to the use-case is extremely difficult. Plus, there are some nuances that haven't been accounted for if using someone else's formatted data.

Lastly, an option would be to collect and format data with personal resources. What could be used? Web Crawlers! Finding a resource that has all of the information needed and harvesting it for data for the LLM. (This has very little negative impact when its a small project. It becomes a problem when hundreds of crawlers are making thousands of requests). This part of the project only made 1 request over 440 pages.

The library of choice was BeautifulSoup4, an intermediate Python web crawlers with a lot of features out of the box. It focuses on targeting HTML nodes via their attributes. Within that, we can grab the data and format it in a way that's easy to parse.

I chose to grab data from Serebii, a website known for its vast amounts of Pokemon information. The time consuming part of this was actually setting up the crawler. The structure of each of the pages are similar to each other making it easy for the crawler. However, the hard part is finding a pattern to the pages that the crawler can use. One this was figured out, the data returned was formatted into JSON (following the JSONAPI convention). Using Python's built-in JSON library, I was able to place all of those objects in a neatly formatted document to be consumed by the LLM.

How to Train Your AI

Peft, LoRA, and RAG represent distinct, yet increasingly important, techniques for adapting pre-trained LLMs to specific tasks and datasets. Peft, or Parameter-Efficient Fine-Tuning, allows for efficient adaptation of a base model with significantly fewer trainable parameters, mitigating computational cost and the need for massive datasets. LoRA, or Low-Rank Adaptation, focuses on fine-tuning only a small subset of the model's parameters, creating a more lightweight and adaptable version while retaining core capabilities. RAG, or Retrieval-Augmented Generation, tackles the challenge of grounding LLMs with real-world knowledge by intelligently retrieving relevant documents or data snippets. Essentially providing the model with context it needs to generate more accurate and informed responses. Each method offers a different path toward personalized and targeted applications.

My personal machine is a MacBook Air M2 base model. Unfortunately, at the time of writing, a lot of pre-built packages do not offer Apple Silicon support. Most only support Intel or AMD chipsets. This was learned after configuring an Unsloth instance to fine tune.

So, for this project, I've chose to use RAG.

Continue on to part 2!