Meaning has shape.

An LLM only knows what it memorized. RAG gives it eyes.
Hands-on session · 3rd year AIML

From LLMs to RAG

Why your chatbot lies, and how we make it tell the truth.
Your speaker
Srikanth Doddi
SD

Srikanth Doddi

Architect · Digital Engineering  ·  OSI Digital, Hyderabad

I design & build AI/ML, full-stack and cloud systems, and lead the team that ships them across R&D, delivery and presales.

9years
building

AI / ML & GenAI

RAG, LLMs, embeddings, vision; architected end-to-end

Full-stack

React · Angular · Node · Python / FastAPI

Cloud & platform

AWS · Azure · GCP · IaC · Kubernetes

Lead · R&D · Presales

~20 engineers, delivery & solutioning

Education
B.TechComputer Science & EngineeringSri Vasavi Engineering College
M.TechData Science & AIBITS Pilani
Stack I live in PythonPyTorchLangChainRAGVector DBsFastAPIReactNode.jsAWSAzureGCPDockerKubernetesTerraform
By the time we're done

You'll know why your chatbot lies, and you'll have watched the fix get built.

That's the promise. Here's how we get there.
How the next 3 hours run

One hour to understand. Two to build.

1

How GenAI broke through

2

Where it is now

3

Why LLMs lie

4

RAG: chunk → embed → retrieve → generate

5

Where RAG breaks

6

We build one, live

How we all ended up here

In two months, AI went from research to everywhere.

ChatGPT reached 100 million users faster than any app in history.

5 days
to its first 1 million users
~2 months
to 100 million users
Time to 100 million users, shorter is faster
ChatGPT
~2 months
TikTok
9 months
Instagram
2.5 years
Sources: UBS / Reuters / Sensor Tower estimates, widely reported. Figures approximate, refresh if you cite live numbers.
Why it broke through

The tech wasn't new. The access was.

Transformers shipped in 2017. The capability built quietly for years. What changed in 2022 wasn't invention, it was access. Three things converged:

Actually capable

Good enough to feel like magic, not a toy, it held a conversation and wrote code.

Zero friction

Free, in a browser, no signup, no manual. You just typed a sentence.

Conversational

Plain language in, plain language out. Your grandmother could use it, and did.

Where GenAI is right now · mid-2026

From "can it generate?" to "can it do the work?"

Agentic AI Likely

Models plan, call tools, and recover from failure, not just chat. ~40% of enterprise apps embedding agents by end of 2026.

RAG went enterprise Certain

From a trick to governed "truth systems" with cited, auditable answers. Today's topic.

Multimodal by default Likely

Text, image, audio and documents in one workflow.

Small + on-device Likely

Efficient models run locally; frontier models reserved for hard reasoning.

Synthesis of mid-2026 trend reports (IBM, MIT SMR, vendor analyses). REFRESH the day before, model names and numbers move fast.
The problem

LLMs are confident liars.

It only knows its training data, frozen at a cutoff. Ask about your data or recent events and it won't refuse, it guesses, fluently, and sounds just as sure as when it's right.

LIVE, DO THIS ON STAGE, DON'T DESCRIBE IT:

Ask a plain LLM something it cannot know: a detail from your college's own handbook, or a very recent event after its cutoff. Watch it answer confidently and wrongly. That failure is the entire reason RAG exists.

You've learned embeddings, here's what they're FOR

Meaning becomes geometry.

Every sentence is a point; similar meaning lands close. Embed a question and the nearest points are the passages most likely to answer it. The green question drops in, beams shoot to its neighbours, that is retrieval, live.

attendance clusterfees clusterthe question
The fix · four moves

RAG: give the model an open book before it answers.

1

Chunk

Split docs into passages
~200-500 tokens each, with overlap
2

Embed

Text becomes a vector
Each chunk to ~384-3072 floats
3

Retrieve

Find nearest vectors
Top-k (k=3-5) by cosine
4

Generate

Answer from context
Answer from them, not memory

Closed-book vs open-book. Steps 1-2 happen once (indexing). Steps 3-4 happen on every question. Next: watch each step happen.

Step 1 · Chunk

Slice the document into passages.

One long document is too big to search. We cut it into bite-sized passages, with a little overlap so a sentence is never split across the seam.

passage Apassage Bpassage Coverlap
Inside Step 2 · before embedding

First, the text becomes tokens.

minimum5021 attendance9314 to1037 sit4490 exams8829
[ 384 floats ]

The model never sees letters, only these token IDs. They flow through the network, and the final layer collapses the whole sequence into one vector. (Long or rare words split into sub-word pieces.)

Step 2 · Embed

Each passage becomes a point in space.

An embedding model reads a passage and emits a long list of numbers, a vector. That vector is a single point. Passages about the same thing land in the same neighbourhood.

passageits 384 numbersthe point it becomes
Step 2 · continued

A vector is just numbers. Stack them in a DB.

chunk_07[ 0.00 0.00 0.00 0.00 0.00 0.00 ]… × 384

Each passage becomes a fixed list of floats, a point in 384-dimensional space. Every passage is one row, and the rows pour into the vector database: stored, indexed, ready to search in milliseconds.

passage inits numbersstored & indexed
Step 3 · Retrieve

Embed the question. Grab the nearest points.

The question becomes a point too. We measure distance to every stored point and pull the closest k, those are the passages most likely to hold the answer. The vector DB does this in milliseconds.

questiontop-k retrieved
Step 4 · Generate

Hand the passages to the LLM. Answer from them.

The retrieved passages and the question flow into the model as context. It writes the answer from that text, grounded, specific, and able to cite its source, instead of guessing from memory.

retrieved contextthe modelgrounded answer
RAG botgrounded answer
The whole system

Two pipelines, one shared memory.

Indexing fills the Vector DB once. Every question embeds, reads the nearest passages from that same store, and the LLM answers. The model is never retrained.

It's smaller than you think

The entire retrieve-and-answer loop.

# 1. INDEX (once)chunks = split(docs, size=400, overlap=50)index  = embed(chunks)        # -> vectors # 2. ANSWER (every question)def answer(q):    qv      = embed(q)    hits    = index.top_k(qv, k=4)   # cosine    context = "\n".join(hits)    return llm(f"Answer using ONLY:\n{context}\n\nQ: {q}")

No fine-tuning

The model is never retrained, you only change the prompt.

No GPU

Embed a few docs + call an API, runs on Colab's free tier.

Swap any piece

Chunk size, embed model, k, LLM, the shape stays the same.

What grounding actually buys you

Same question. Same model. One has the page.

Q: "What's the minimum attendance to be allowed to sit for semester exams?"
Plain LLMclosed book
Fluent, plausible, and just a generic guess. It doesn't know YOUR college's rule, the condonation cutoff, or the exceptions, and nothing in the answer warns you of that.
RAG botopen book
Grounded in your actual rulebook. Specific, and it cites the page, so a student can verify it on the spot.
Replace [XX]% and the source filename with your college's real attendance rule before the talk, so a student can't catch a wrong number.
RAG isn't the only knob

Prompt it, show it, retrieve it, or retrain it.

Zero-shot

Just ask. Fast and free, but ungrounded.

prompt
Q
effort ▁▁▁
Few-shot

A few examples in the prompt, teaches format & style.

prompt
ex
ex
Q
effort ▂▁▁
RAG

Retrieve facts into the prompt, grounds in your data.

prompt
doc
doc
Q
effort ▃▃▁
Fine-tune

Retrain the weights. Powerful, but expensive.

model
Δθ
effort ▇▇▇

Few-shot teaches HOW to answer. RAG supplies WHAT to answer from. Often you stack them.

First, the vocabulary

Chunk, embedding, vector, what's what?

CHUNK
"minimum attendance to sit exams"
A slice of text, what a human reads.
embedding
model
VECTOR (the embedding)
[ 0.21 -0.83 0.44 0.07 … ]
A list of numbers, one point in space.

A chunk is the words. A vector is the numbers. Embedding is the model that turns one into the other, and people usually say "embedding" to mean the vector it produces.

The math, made simple

How a chunk becomes one vector.

1
minattendtositexam
Split into tokens
2
Each token → a vector; attention mixes in its neighbours
3
Average them → the chunk's one vector

Every token starts as a generic vector. Attention rewrites each one using the words around it, so "bank" by "river" ≠ "bank" by "money". Averaging collapses the whole passage into a single point, trained so similar meanings land in similar directions.

Zoom in · one token, the actual numbers

A token's vector is looked up, not computed.

"attendance"
↓ find its ID
9314
EMBEDDING TABLE · 30,522 tokens × 384 numbers (learned in training)
9312exam0.44  0.10  −0.33  0.51 …
9313fees−0.20  0.71  0.05  −0.12 …
9314attendance0.12  −0.45  0.88  0.03 …
9315leave0.33  −0.12  0.40  0.62 …
grab row 9314 →
attendance's vector
[ 0.12  −0.45  0.88  0.03 … −0.21 ]
384 numbers

The starting vector isn't calculated, it's looked up. The model learned one row of numbers per token during training. (Formally: a one-hot token ID × the embedding matrix selects exactly that row.) Attention then rewrites those numbers using the surrounding words.

The actual formulas

From tokens to one vector, the math.

① Attention, mix in context
softmax( Q Kᵀ / √d ) · V
each token's vector becomes a weighted blend of all the others
② Mean-pool, collapse to one
v = ( h₁ + h₂ + … + hₙ ) / n
average the n contextual token vectors
③ Normalize, unit length
v̂ = v / ‖v‖
so cosine similarity becomes a plain dot product
WORKED EXAMPLE · 2-D for clarity
h₁ = [ 0.2,  0.8 ]
h₂ = [ 0.4,  0.6 ]
h₃ = [ 0.6,  0.4 ]
pool → v = [ (0.2+0.4+0.6)/3, (0.8+0.6+0.4)/3 ] = [ 0.40, 0.60 ]
‖v‖ = √(0.40² + 0.60²) = √0.52 ≈ 0.72
v̂ = [ 0.40/0.72, 0.60/0.72 ] = [ 0.55, 0.83 ]
That's the chunk's final embedding. Real models do exactly this, just in 384+ dimensions.
How we measure "close"

Closeness = cosine similarity.

cos θ = (A · B) / (|A| |B|)
cos θ = ,

It's just the angle between two vectors. Same direction → +1.0 (same meaning). Right angle → 0 (unrelated). Opposite → −1. Retrieval scores every passage this way and keeps the top few.

the questiona passagethe angle θ
⚠ The insight most tutorials skip

RAG fails at retrieval, not generation.

attendance %
medical leave
fees due
re-eval
scholarship
hostel
exam dates
condonation
grading
Query: "minimum attendance to sit exams" → retrieval lights the wrong passages; the right one (condonation) barely makes the cut.

Chunking

One chunk holds the whole exam-rules page: attendance, fees, re-eval, all 5 topics. The question needs 1; the other 4 are noise that drowns the match.

Retrieval

"attendance leave" and "medical leave" embed close together. Top-k pulls the medical-leave rule for an attendance question. The right passage is never seen.

Ranking

Right passage is in the index, but ranks #7 when k=4. It existed, it just didn't make the cut. Raising k adds noise; that's the tradeoff.

Fixing the #1 failure

Chunking: how you cut matters most.

Fixed-size

Every N tokens. Dead simple, but it slices sentences across the cut.

By sentence

Break on sentence boundaries. Clean, but chunk sizes vary a lot.

Semantic

Cut where the topic shifts. Each chunk = one coherent idea. Best recall.

Recursive

Paragraphs → sentences → words, until it fits. A solid default.

Bad chunks are the #1 cause of bad retrieval. Start with recursive; switch to semantic if recall lags. Always overlap ~10-15%.

Better retrieval

Hybrid search, then re-rank.

The question splits two ways: vector search for meaning (the smooth stream) and keyword search for exact terms, names, codes, IDs that embeddings blur (the blocky stream). Both pour into one candidate pool, then a cross-encoder re-ranks it down to the top few.

Vector · meaningKeyword · exactTop-4 re-ranked
Did it actually work?

You can't improve what you don't measure.

Recall@k
88%

Of your test questions, how often is the right passage in the top-k retrieved?

Faithfulness
94%

Is every claim in the answer actually supported by the retrieved text? Catches hallucination.

Answer relevance
91%

Does the answer address what was asked, not just sound related?

Keep a golden set, fixed question → expected-answer pairs, and re-run it on every change. Add human spot-checks. Numbers catch regressions before users do.

Beyond the demo

What it takes to run RAG in production.

🎯 Retrieval quality

  • Smart chunking, semantic splits, right size + overlap
  • Pick the embedding model for your domain
  • Re-rank the top-k with a cross-encoder
  • Hybrid search: keyword + vector
  • Measure recall@k, did we fetch the right passage?

⚙️ Operations

  • Latency, cache embeddings & answers
  • Cost, batch calls, right-size the model
  • Freshness, re-index when documents change
  • Monitor queries, retrieved chunks, answers
  • Scale the vector store (ANN index, sharding)

🛡 Trust & safety

  • Always cite sources, keep answers verifiable
  • Refuse when no good context is found
  • Respect document access control & PII
  • Guard against prompt injection hidden in docs
  • Human eval on a golden question set
Now we build

A RAG bot, built live from an empty notebook.

Our document set

A bot that answers questions over a document the base model can't know, so you can SEE grounding work and verify the answers yourselves.

Default pick: your college's exam-rules / handbook PDF (relatable, the model genuinely doesn't know it, and the room can fact-check it live). Easy swaps: the AIML syllabus, a club's docs, or any PDF you bring.

exam-rules.pdf

The stack

Google Colabno installs
Groqfast, free LLM
Embedding modeltext → vectors
Vector storein-memory
~60 lines of Pythontotal
VERIFY before the talk: Groq free-tier limits + that Colab/Groq are reachable on the venue network.
How you take part

Three ways in: pick the one your setup allows.

1

Everyone

Watch it built live, call out the next step. No device needed.

2

On your phone

Scan the QR for the hosted bot. Try to trip it up.

3

On a laptop

Open the Colab notebook with your Groq key, run as we go.

Didn't set up a Groq key? No problem, you'll leave with the notebook and a link to do it at home.

Your turn

Questions, then we build it together.

Ask anythingChunking, embeddings, cost, or your own use-case
Open the notebookColab + your Groq key, we run it live
Build with meBring a PDF, we ground a bot on it
Slides & notebook: meaninghasshape.srikanthdoddi.com
01 / 34
→ / Space to advance · F fullscreen