Meaning has shape.

An LLM only knows what it memorized. RAG gives it eyes.

Hands-on session · 3rd year AIML

From LLMs to RAG

Why your chatbot lies, and how we make it tell the truth.

Srikanth DoddiArchitect, Digital Engineering · OSI Digital

Your speaker

SD

Srikanth Doddi

Architect · Digital Engineering · OSI Digital, Hyderabad

I design & build AI/ML, full-stack and cloud systems, and lead the team that ships them across R&D, delivery and presales.

9years
building

AI / ML & GenAI

RAG, LLMs, embeddings, vision; architected end-to-end

Full-stack

React · Angular · Node · Python / FastAPI

Cloud & platform

AWS · Azure · GCP · IaC · Kubernetes

Lead · R&D · Presales

~20 engineers, delivery & solutioning

Education

B.TechComputer Science & EngineeringSri Vasavi Engineering College

M.TechData Science & AIBITS Pilani

Stack I live in PythonPyTorchLangChainRAGVector DBsFastAPIReactNode.jsAWSAzureGCPDockerKubernetesTerraform

By the time we're done

You'll know why your chatbot lies, and you'll have watched the fix get built.

That's the promise. Here's how we get there.

How the next 3 hours run

One hour to understand. Two to build.

1

How GenAI broke through

2

Where it is now

3

Why LLMs lie

4

RAG: chunk → embed → retrieve → generate

5

Where RAG breaks

6

We build one, live

How we all ended up here

In two months, AI went from research to everywhere.

ChatGPT reached 100 million users faster than any app in history.

5 days

to its first 1 million users

~2 months

to 100 million users

Time to 100 million users, shorter is faster

ChatGPT

~2 months

TikTok

9 months

Instagram

2.5 years

Sources: UBS / Reuters / Sensor Tower estimates, widely reported. Figures approximate, refresh if you cite live numbers.

Why it broke through

The tech wasn't new. The access was.

Transformers shipped in 2017. The capability built quietly for years. What changed in 2022 wasn't invention, it was access. Three things converged:

Actually capable

Good enough to feel like magic, not a toy, it held a conversation and wrote code.

Zero friction

Free, in a browser, no signup, no manual. You just typed a sentence.

Conversational

Plain language in, plain language out. Your grandmother could use it, and did.

Where GenAI is right now · mid-2026

From "can it generate?" to "can it do the work?"

Agentic AI Likely

Models plan, call tools, and recover from failure, not just chat. ~40% of enterprise apps embedding agents by end of 2026.

RAG went enterprise Certain

From a trick to governed "truth systems" with cited, auditable answers. Today's topic.

Multimodal by default Likely

Text, image, audio and documents in one workflow.

Small + on-device Likely

Efficient models run locally; frontier models reserved for hard reasoning.

Synthesis of mid-2026 trend reports (IBM, MIT SMR, vendor analyses). REFRESH the day before, model names and numbers move fast.

The problem

LLMs are confident liars.

It only knows its training data, frozen at a cutoff. Ask about your data or recent events and it won't refuse, it guesses, fluently, and sounds just as sure as when it's right.

LIVE, DO THIS ON STAGE, DON'T DESCRIBE IT:

Ask a plain LLM something it cannot know: a detail from your college's own handbook, or a very recent event after its cutoff. Watch it answer confidently and wrongly. That failure is the entire reason RAG exists.

You've learned embeddings, here's what they're FOR

Meaning becomes geometry.

Every sentence is a point; similar meaning lands close. Embed a question and the nearest points are the passages most likely to answer it. The green question drops in, beams shoot to its neighbours, that is retrieval, live.

attendance clusterfees clusterthe question

The fix · four moves

RAG: give the model an open book before it answers.

1

Chunk

Split docs into passages

~200-500 tokens each, with overlap

2

Embed

Text becomes a vector

Each chunk to ~384-3072 floats

3

Retrieve

Find nearest vectors

Top-k (k=3-5) by cosine

4

Generate

Answer from context

Answer from them, not memory

Closed-book vs open-book. Steps 1-2 happen once (indexing). Steps 3-4 happen on every question. Next: watch each step happen.

Step 1 · Chunk

Slice the document into passages.

One long document is too big to search. We cut it into bite-sized passages, with a little overlap so a sentence is never split across the seam.

passage Apassage Bpassage Coverlap

Inside Step 2 · before embedding

First, the text becomes tokens.

minimum5021 attendance9314 to1037 sit4490 exams8829

→

[ 384 floats ]

The model never sees letters, only these token IDs. They flow through the network, and the final layer collapses the whole sequence into one vector. (Long or rare words split into sub-word pieces.)

Step 2 · Embed

Each passage becomes a point in space.

An embedding model reads a passage and emits a long list of numbers, a vector. That vector is a single point. Passages about the same thing land in the same neighbourhood.

passageits 384 numbersthe point it becomes

Step 2 · continued

A vector is just numbers. Stack them in a DB.

chunk_07→[ 0.00 0.00 0.00 0.00 0.00 0.00 ]… × 384

Each passage becomes a fixed list of floats, a point in 384-dimensional space. Every passage is one row, and the rows pour into the vector database: stored, indexed, ready to search in milliseconds.

passage inits numbersstored & indexed

Step 3 · Retrieve

Embed the question. Grab the nearest points.

The question becomes a point too. We measure distance to every stored point and pull the closest k, those are the passages most likely to hold the answer. The vector DB does this in milliseconds.

questiontop-k retrieved

Step 4 · Generate

Hand the passages to the LLM. Answer from them.

The retrieved passages and the question flow into the model as context. It writes the answer from that text, grounded, specific, and able to cite its source, instead of guessing from memory.

retrieved contextthe modelgrounded answer

✓RAG botgrounded answer

The whole system

Two pipelines, one shared memory.

Indexing fills the Vector DB once. Every question embeds, reads the nearest passages from that same store, and the LLM answers. The model is never retrained.

It's smaller than you think

The entire retrieve-and-answer loop.

# 1. INDEX (once)chunks = split(docs, size=400, overlap=50)index  = embed(chunks)        # -> vectors # 2. ANSWER (every question)def answer(q):    qv      = embed(q)    hits    = index.top_k(qv, k=4)   # cosine    context = "\n".join(hits)    return llm(f"Answer using ONLY:\n{context}\n\nQ: {q}")

No fine-tuning

The model is never retrained, you only change the prompt.

No GPU

Embed a few docs + call an API, runs on Colab's free tier.

Swap any piece

Chunk size, embed model, k, LLM, the shape stays the same.

What grounding actually buys you

Same question. Same model. One has the page.

Q: "What's the minimum attendance to be allowed to sit for semester exams?"

✕Plain LLMclosed book

Fluent, plausible, and just a generic guess. It doesn't know YOUR college's rule, the condonation cutoff, or the exceptions, and nothing in the answer warns you of that.

✓RAG botopen book

Grounded in your actual rulebook. Specific, and it cites the page, so a student can verify it on the spot.

Replace [XX]% and the source filename with your college's real attendance rule before the talk, so a student can't catch a wrong number.

RAG isn't the only knob

Prompt it, show it, retrieve it, or retrain it.

Zero-shot

Just ask. Fast and free, but ungrounded.

prompt

Q

effort ▁▁▁

Few-shot

A few examples in the prompt, teaches format & style.

prompt

ex

Q

effort ▂▁▁

RAG

Retrieve facts into the prompt, grounds in your data.

prompt

doc

Q

effort ▃▃▁

Fine-tune

Retrain the weights. Powerful, but expensive.

model

Δθ

effort ▇▇▇

Few-shot teaches HOW to answer. RAG supplies WHAT to answer from. Often you stack them.

First, the vocabulary

Chunk, embedding, vector, what's what?

CHUNK

"minimum attendance to sit exams"

A slice of text, what a human reads.

embedding
model

→

VECTOR (the embedding)

[ 0.21 -0.83 0.44 0.07 … ]

A list of numbers, one point in space.

A chunk is the words. A vector is the numbers. Embedding is the model that turns one into the other, and people usually say "embedding" to mean the vector it produces.

The math, made simple

How a chunk becomes one vector.

1

minattendtositexam

Split into tokens

→

2

Each token → a vector; attention mixes in its neighbours

→

3

Average them → the chunk's one vector

Every token starts as a generic vector. Attention rewrites each one using the words around it, so "bank" by "river" ≠ "bank" by "money". Averaging collapses the whole passage into a single point, trained so similar meanings land in similar directions.

Zoom in · one token, the actual numbers

A token's vector is looked up, not computed.

"attendance"

↓ find its ID

9314

EMBEDDING TABLE · 30,522 tokens × 384 numbers (learned in training)

9312exam0.44 0.10 −0.33 0.51 …

9313fees−0.20 0.71 0.05 −0.12 …

9314attendance0.12 −0.45 0.88 0.03 …

9315leave0.33 −0.12 0.40 0.62 …

grab row 9314 →

attendance's vector

[ 0.12 −0.45 0.88 0.03 … −0.21 ]

384 numbers

The starting vector isn't calculated, it's looked up. The model learned one row of numbers per token during training. (Formally: a one-hot token ID × the embedding matrix selects exactly that row.) Attention then rewrites those numbers using the surrounding words.

The actual formulas

From tokens to one vector, the math.

① Attention, mix in context

softmax( Q Kᵀ / √d ) · V

each token's vector becomes a weighted blend of all the others

② Mean-pool, collapse to one

v = ( h₁ + h₂ + … + hₙ ) / n

average the n contextual token vectors

③ Normalize, unit length

v̂ = v / ‖v‖

so cosine similarity becomes a plain dot product

WORKED EXAMPLE · 2-D for clarity

h₁ = [ 0.2, 0.8 ]

h₂ = [ 0.4, 0.6 ]

h₃ = [ 0.6, 0.4 ]

pool → v = [ (0.2+0.4+0.6)/3, (0.8+0.6+0.4)/3 ] = [ 0.40, 0.60 ]

‖v‖ = √(0.40² + 0.60²) = √0.52 ≈ 0.72

v̂ = [ 0.40/0.72, 0.60/0.72 ] = [ 0.55, 0.83 ]

That's the chunk's final embedding. Real models do exactly this, just in 384+ dimensions.

How we measure "close"

Closeness = cosine similarity.

cos θ = (A · B) / (|A| |B|)

cos θ = ,

It's just the angle between two vectors. Same direction → +1.0 (same meaning). Right angle → 0 (unrelated). Opposite → −1. Retrieval scores every passage this way and keeps the top few.

the questiona passagethe angle θ

⚠ The insight most tutorials skip

RAG fails at retrieval, not generation.

attendance %

medical leave

fees due

re-eval

scholarship

hostel

exam dates

condonation

grading

Query: "minimum attendance to sit exams" → retrieval lights the wrong passages; the right one (condonation) barely makes the cut.

Chunking

One chunk holds the whole exam-rules page: attendance, fees, re-eval, all 5 topics. The question needs 1; the other 4 are noise that drowns the match.

Retrieval

"attendance leave" and "medical leave" embed close together. Top-k pulls the medical-leave rule for an attendance question. The right passage is never seen.

Ranking

Right passage is in the index, but ranks #7 when k=4. It existed, it just didn't make the cut. Raising k adds noise; that's the tradeoff.

Fixing the #1 failure

Chunking: how you cut matters most.

Fixed-size

Every N tokens. Dead simple, but it slices sentences across the cut.

By sentence

Break on sentence boundaries. Clean, but chunk sizes vary a lot.

Semantic

Cut where the topic shifts. Each chunk = one coherent idea. Best recall.

Recursive

Paragraphs → sentences → words, until it fits. A solid default.

Bad chunks are the #1 cause of bad retrieval. Start with recursive; switch to semantic if recall lags. Always overlap ~10-15%.

Better retrieval

Hybrid search, then re-rank.

The question splits two ways: vector search for meaning (the smooth stream) and keyword search for exact terms, names, codes, IDs that embeddings blur (the blocky stream). Both pour into one candidate pool, then a cross-encoder re-ranks it down to the top few.

Vector · meaningKeyword · exactTop-4 re-ranked

Did it actually work?

You can't improve what you don't measure.

Recall@k

88%

Of your test questions, how often is the right passage in the top-k retrieved?

Faithfulness

94%

Is every claim in the answer actually supported by the retrieved text? Catches hallucination.

Answer relevance

91%

Does the answer address what was asked, not just sound related?

Keep a golden set, fixed question → expected-answer pairs, and re-run it on every change. Add human spot-checks. Numbers catch regressions before users do.

Beyond the demo

What it takes to run RAG in production.

🎯 Retrieval quality

Smart chunking, semantic splits, right size + overlap
Pick the embedding model for your domain
Re-rank the top-k with a cross-encoder
Hybrid search: keyword + vector
Measure recall@k, did we fetch the right passage?

⚙️ Operations

Latency, cache embeddings & answers
Cost, batch calls, right-size the model
Freshness, re-index when documents change
Monitor queries, retrieved chunks, answers
Scale the vector store (ANN index, sharding)

🛡 Trust & safety

Always cite sources, keep answers verifiable
Refuse when no good context is found
Respect document access control & PII
Guard against prompt injection hidden in docs
Human eval on a golden question set

Now we build

A RAG bot, built live from an empty notebook.

Our document set

A bot that answers questions over a document the base model can't know, so you can SEE grounding work and verify the answers yourselves.

Default pick: your college's exam-rules / handbook PDF (relatable, the model genuinely doesn't know it, and the room can fact-check it live). Easy swaps: the AIML syllabus, a club's docs, or any PDF you bring.

exam-rules.pdf

The stack

Google Colabno installs

Groqfast, free LLM

Embedding modeltext → vectors

Vector storein-memory

~60 lines of Pythontotal

VERIFY before the talk: Groq free-tier limits + that Colab/Groq are reachable on the venue network.

How you take part

Three ways in: pick the one your setup allows.

1

Everyone

Watch it built live, call out the next step. No device needed.

2

On your phone

Scan the QR for the hosted bot. Try to trip it up.

3

On a laptop

Open the Colab notebook with your Groq key, run as we go.

Didn't set up a Groq key? No problem, you'll leave with the notebook and a link to do it at home.

Your turn

Questions, then we build it together.

Ask anythingChunking, embeddings, cost, or your own use-case

Open the notebookColab + your Groq key, we run it live

Build with meBring a PDF, we ground a bot on it

Slides & notebook: meaninghasshape.srikanthdoddi.com