Want to learn more about AI Engineering Patterns? Join me on Twitter or Newsletter.
Language Models do better when they're focused.
One strategy is to pass a relevant subset (chunk) of your full data. There are many ways to chunk text.
This is an tool to understand different chunking/splitting strategies.
Language Models have context windows. This is the length of text that they can process in a single pass.
Although context lengths are getting larger, it has been shown that language models increase performance on tasks when they are given less (but more relevant) information.
But which relevant subset of data do you pick? This is easy when a human is doing it by hand, but turns out it is difficult to instruct a computer to do this.
One common way to do this is by chunking, or subsetting, your large data into smaller pieces. In order to do this you need to pick a chunk strategy.
Pick different chunking strategies above to see how they impact the text, add your own text if you'd like.
You'll see different colors that represent different chunks. This could be chunk 1. This could be chunk 2, sometimes a chunk will change in the middle of a sentence (this isn't great). If any chunks have overlapping text, those will appear in orange.
Chunk Size: The length (in characters) of your end chunks
Chunk Overlap (Green): The amount of overlap or cross over sequential chunks share
Notes: *Text splitters trim the whitespace on the end of the js, python, and markdown splitters which is why the text jumps around, *Overlap is locked at <50% of chunk size *Simple analytics (privacy friendly) used to understand my hosting bill.
For implementations of text splitters, view LangChain (py, js) & Llama Index (py, js)
MIT License, Opened Sourced, PRs Welcome
Made with ❤️ by Greg Kamradt