ChunkViz v0.1

Want to learn more about AI Engineering Patterns? Join me on Twitter or Newsletter.


Language Models do better when they're focused.

One strategy is to pass a relevant subset (chunk) of your full data. There are many ways to chunk text.

This is an tool to understand different chunking/splitting strategies.

Explain like I'm 5...

Total Characters: 0
Number of chunks: 0
Average chunk size: NaN

What's going on here?

Language Models have context windows. This is the length of text that they can process in a single pass.
Although context lengths are getting larger, it has been shown that language models increase performance on tasks when they are given less (but more relevant) information.

But which relevant subset of data do you pick? This is easy when a human is doing it by hand, but turns out it is difficult to instruct a computer to do this.

One common way to do this is by chunking, or subsetting, your large data into smaller pieces. In order to do this you need to pick a chunk strategy.

Pick different chunking strategies above to see how they impact the text, add your own text if you'd like.

You'll see different colors that represent different chunks. This could be chunk 1. This could be chunk 2, sometimes a chunk will change in the middle of a sentence (this isn't great). If any chunks have overlapping text, those will appear in orange.

Chunk Size: The length (in characters) of your end chunks

Chunk Overlap (Green): The amount of overlap or cross over sequential chunks share

Notes: *Text splitters trim the whitespace on the end of the js, python, and markdown splitters which is why the text jumps around, *Overlap is locked at <50% of chunk size *Simple analytics (privacy friendly) used to understand my hosting bill.

For implementations of text splitters, view LangChain (py, js) & Llama Index (py, js)

MIT License, Opened Sourced, PRs Welcome

Made with ❤️ by Greg Kamradt