I design & build AI/ML, full-stack and cloud systems, and lead the team that ships them across R&D, delivery and presales.
RAG, LLMs, embeddings, vision; architected end-to-end
React · Angular · Node · Python / FastAPI
AWS · Azure · GCP · IaC · Kubernetes
~20 engineers, delivery & solutioning
ChatGPT reached 100 million users faster than any app in history.
Transformers shipped in 2017. The capability built quietly for years. What changed in 2022 wasn't invention, it was access. Three things converged:
Good enough to feel like magic, not a toy, it held a conversation and wrote code.
Free, in a browser, no signup, no manual. You just typed a sentence.
Plain language in, plain language out. Your grandmother could use it, and did.
Models plan, call tools, and recover from failure, not just chat. ~40% of enterprise apps embedding agents by end of 2026.
From a trick to governed "truth systems" with cited, auditable answers. Today's topic.
Text, image, audio and documents in one workflow.
Efficient models run locally; frontier models reserved for hard reasoning.
It only knows its training data, frozen at a cutoff. Ask about your data or recent events and it won't refuse, it guesses, fluently, and sounds just as sure as when it's right.
Ask a plain LLM something it cannot know: a detail from your college's own handbook, or a very recent event after its cutoff. Watch it answer confidently and wrongly. That failure is the entire reason RAG exists.
Every sentence is a point; similar meaning lands close. Embed a question and the nearest points are the passages most likely to answer it. The green question drops in, beams shoot to its neighbours, that is retrieval, live.
Closed-book vs open-book. Steps 1-2 happen once (indexing). Steps 3-4 happen on every question. Next: watch each step happen.
One long document is too big to search. We cut it into bite-sized passages, with a little overlap so a sentence is never split across the seam.
The model never sees letters, only these token IDs. They flow through the network, and the final layer collapses the whole sequence into one vector. (Long or rare words split into sub-word pieces.)
An embedding model reads a passage and emits a long list of numbers, a vector. That vector is a single point. Passages about the same thing land in the same neighbourhood.
Each passage becomes a fixed list of floats, a point in 384-dimensional space. Every passage is one row, and the rows pour into the vector database: stored, indexed, ready to search in milliseconds.
The question becomes a point too. We measure distance to every stored point and pull the closest k, those are the passages most likely to hold the answer. The vector DB does this in milliseconds.
The retrieved passages and the question flow into the model as context. It writes the answer from that text, grounded, specific, and able to cite its source, instead of guessing from memory.
Indexing fills the Vector DB once. Every question embeds, reads the nearest passages from that same store, and the LLM answers. The model is never retrained.
# 1. INDEX (once)chunks = split(docs, size=400, overlap=50)index = embed(chunks) # -> vectors # 2. ANSWER (every question)def answer(q): qv = embed(q) hits = index.top_k(qv, k=4) # cosine context = "\n".join(hits) return llm(f"Answer using ONLY:\n{context}\n\nQ: {q}")
The model is never retrained, you only change the prompt.
Embed a few docs + call an API, runs on Colab's free tier.
Chunk size, embed model, k, LLM, the shape stays the same.
Just ask. Fast and free, but ungrounded.
A few examples in the prompt, teaches format & style.
Retrieve facts into the prompt, grounds in your data.
Retrain the weights. Powerful, but expensive.
Few-shot teaches HOW to answer. RAG supplies WHAT to answer from. Often you stack them.
A chunk is the words. A vector is the numbers. Embedding is the model that turns one into the other, and people usually say "embedding" to mean the vector it produces.
Every token starts as a generic vector. Attention rewrites each one using the words around it, so "bank" by "river" ≠ "bank" by "money". Averaging collapses the whole passage into a single point, trained so similar meanings land in similar directions.
The starting vector isn't calculated, it's looked up. The model learned one row of numbers per token during training. (Formally: a one-hot token ID × the embedding matrix selects exactly that row.) Attention then rewrites those numbers using the surrounding words.
It's just the angle between two vectors. Same direction → +1.0 (same meaning). Right angle → 0 (unrelated). Opposite → −1. Retrieval scores every passage this way and keeps the top few.
One chunk holds the whole exam-rules page: attendance, fees, re-eval, all 5 topics. The question needs 1; the other 4 are noise that drowns the match.
"attendance leave" and "medical leave" embed close together. Top-k pulls the medical-leave rule for an attendance question. The right passage is never seen.
Right passage is in the index, but ranks #7 when k=4. It existed, it just didn't make the cut. Raising k adds noise; that's the tradeoff.
Every N tokens. Dead simple, but it slices sentences across the cut.
Break on sentence boundaries. Clean, but chunk sizes vary a lot.
Cut where the topic shifts. Each chunk = one coherent idea. Best recall.
Paragraphs → sentences → words, until it fits. A solid default.
Bad chunks are the #1 cause of bad retrieval. Start with recursive; switch to semantic if recall lags. Always overlap ~10-15%.
The question splits two ways: vector search for meaning (the smooth stream) and keyword search for exact terms, names, codes, IDs that embeddings blur (the blocky stream). Both pour into one candidate pool, then a cross-encoder re-ranks it down to the top few.
Of your test questions, how often is the right passage in the top-k retrieved?
Is every claim in the answer actually supported by the retrieved text? Catches hallucination.
Does the answer address what was asked, not just sound related?
Keep a golden set, fixed question → expected-answer pairs, and re-run it on every change. Add human spot-checks. Numbers catch regressions before users do.
A bot that answers questions over a document the base model can't know, so you can SEE grounding work and verify the answers yourselves.
Default pick: your college's exam-rules / handbook PDF (relatable, the model genuinely doesn't know it, and the room can fact-check it live). Easy swaps: the AIML syllabus, a club's docs, or any PDF you bring.
Watch it built live, call out the next step. No device needed.
Scan the QR for the hosted bot. Try to trip it up.
Open the Colab notebook with your Groq key, run as we go.
Didn't set up a Groq key? No problem, you'll leave with the notebook and a link to do it at home.