Intro to LLMs

Domain experts can make models better before model weights change.

This course teaches the mechanics behind that claim. Start with a visible neural network, move into GPT-2, then use cost, evals, context, and prompt optimization to decide when training is worth it.

Thesis

A content designer, PM, support lead, salesperson, or workflow owner often knows more about the task than the person who can train the model. University makes that expertise legible: define the task, inspect outputs, write rubrics, shape context, and only then escalate to training.

rung 1

capture

rung 2

rubric

rung 3

prompt

rung 4

structure

rung 5

route

rung 6

SFT

rung 7

serve

Order

The course now starts with Neural Network Playground before GPT-2. First, learners see features, weights, training, overfitting, and model size. Then Transformer Explainer shows how language becomes tokens, embeddings, attention, logits, and sampled text.

Curriculum

Why domain experts matter

Understand how model behavior can improve before anyone changes weights.

Domain experts can shape context, evals, rubrics, examples, and task interfaces before training starts.
Technical teams often reach for temperature, raw benchmarks, or model choice before task evidence.
Product judgment belongs in the rubric: define success, read outputs, write examples, and decide what a good answer means.

What machine learning is

Separate hand-written rules from functions learned from examples.

Data, labels, features, loss functions, train/test splits, and held-out evaluation.
Use Neural Network Playground to see how data transforms create learnable patterns.
Why models generalize, why they overfit, and why product teams need evaluation.

How a neural network works

Build an intuition for weights, neurons, layers, parameters, and training.

A weight is the strength of a connection; a parameter count is the number of learned knobs.
Training is the process of reducing error, not a model realizing it is wrong.
Local minima, hyperparameters, and capability jumps: why small networks can struggle before finding a better strategy.

Scaling laws and cost

Connect model size, data, training time, hardware, latency, and unit economics.

Bigger models, more data, and longer training usually help, but each gain costs more than the last.
Orders of magnitude: why a million, billion, and trillion are not just bigger versions of the same number.
Why input, output, cached tokens, memory bandwidth, and model size show up as product cost.

Tokens and embeddings

Understand why tokens are vocabulary units, not meaning by themselves.

Tokenization turns text into pieces the model can process; those pieces do not inherently carry semantic meaning.
Embeddings let the model place tokens in a learned space where relationships can emerge.
Why tokenizer changes affect cost, quality, prompt behavior, and weird failures like letter counting.

Transformers and generation

Trace one prompt through embeddings, attention, logits, and next-token prediction.

Use Transformer Explainer to watch GPT-2 process tokens inside the browser.
Self-attention is where token relationships begin shaping meaning; causal masking prevents peeking at future tokens.
Temperature, top-k, and top-p change sampling behavior, but the real goal is raising the probability of the right next token.

Optimization without retraining

Use context and evaluation before paying for model training.

Context engineering can move an answer from unlikely to obvious without changing weights.
Prompt search, examples, schemas, tool descriptions, routing, and response constraints are all cheaper than retraining.
A good prompt optimizer runs candidates against rubrics and held-out tasks instead of relying on taste alone.

Post-training and specialist models

Know when to move from prompt optimization to SFT, adapters, rewards, and routing.

SFT teaches imitation from good traces; adapters can change a smaller set of weights than full retraining.
Teacher-student distillation can move expensive frontier behavior into a cheaper specialist route.
Serving validation matters: a higher score must also survive latency, cost, monitoring, and rollback.

The Understudy loop

Capture repeated LLM work, evaluate it, optimize it, and only then climb the training ladder.

Capture source prompts, model outputs, rubrics, and examples while keeping sensitive data local by default.
Generate prompt families, score them, preserve winners, and inspect failure cases.
Escalate only when the cheaper rung stops working: prompt, structure, route, SFT, RL, then serving validation.

Demos

first demo

Neural Network Playground

Train a small neural network in the browser while changing datasets, features, layers, activations, and regularization.

open demo

second demo

Transformer Explainer

Trace tokens through embeddings, attention, transformer blocks, logits, and sampling.

open lab

first-party demo

Tokens, context, and temperature

Type a support prompt, inspect token chunks, move temperature, and see why context changes the distribution before sampling does.

open demo

first-party demo

Scaling intuition

Build intuition for orders of magnitude, model size, sequence length, and why scaling wins get expensive quickly.

open demo

first-party demo

Embedding neighborhoods

Click a hand-built embedding space, compare nearest neighbors, and see why relationships can behave like directions.

open demo

first-party demo

KV cache conveyor

Watch a repeated prefix become cached keys and values, then compare fresh context with cached continuation.

open demo

first-party demo

Prompt optimizer walkthrough

Start with one weak notification prompt, score variants against a rubric, and inspect the winning prompt.

open demo

The demos are self-hosted copies of TensorFlow Playground and Transformer Explainer. Upstream license notices and local hosting notes are preserved in this app.

Economics

Model behavior is only half the product decision. This calculator makes the unit economics visible: input tokens, output tokens, cached tokens, and call volume all compound into product margin.

cost calculator

price shape

illustrative, last checked May 2026

input tokenscached tokensoutput tokenscalls / dayinput $ / 1Moutput $ / 1M

cached $ / 1M

each call

$0.01

daily

$118

monthly

$3,546

Glossary

Token

A model vocabulary piece. It might be a word, part of a word, a space, punctuation, or another learned fragment.

Embedding

A learned numeric representation where token relationships can start carrying useful meaning.

Attention

A mechanism for deciding which earlier tokens matter when updating the current token representation.

Parameter

A learned knob in the model. Larger networks have more knobs and more possible strategies.

Temperature

A sampling control that changes how sharply the model favors high-probability next tokens. It is not a fix for unclear context.

Context engineering

Changing the information and structure given to the model so the desired answer becomes more likely.

Post-training

Improving behavior after pretraining through prompts, examples, SFT, adapters, preferences, rewards, or routing.

KV cache

A saved representation of previous token computation that lets a model continue a generation without rereading the whole prefix.

Info Snacks

Reflective prompt optimization

A family of methods that improve prompts by scoring, reflecting on failures, and writing better variants.

Open weights

A model you can run, inspect, host, and adapt yourself.

Inference

Running a trained model to produce useful output.

Evals

A repeatable way to turn model behavior into evidence.

Model routing

Choosing the right model path for each request.

Context windows

The amount of material a model can consider at once.

Recursive Language Model

A model loop that improves output through propose, critique, revise, and verify steps.

Context rot

When accumulated context makes the model's job harder instead of easier.

Hallucination

Confident output that is not grounded in the task, sources, tools, or real world.

Time to first token

How long the user waits before streaming begins.

Tokens per second

How quickly a model generates after output begins.

Prompt caching

Reusing computation for repeated prompt prefixes.

Distillation

Teaching a cheaper model to imitate useful behavior from a stronger one.

Tool calling

Structured model output that asks software to take an action.

Latency

The time between request and useful output.

NVIDIA A100

The older workhorse GPU for many production inference and training jobs.

NVIDIA H100

The Hopper GPU that became the default high-end AI accelerator.

NVIDIA H200

A Hopper-generation GPU with much more memory for larger models and context.

NVIDIA B200 and GB200

Blackwell-generation hardware for frontier-scale training and serving.

Tools

Tokenization viewerPaste text, inspect token pieces, and see context shift the next-token distribution before temperature changes sampling.live

Scaling intuitionCompare millions, billions, and trillions with seconds, model size, active tokens, and relative compute.live

Embedding neighborhoodsClick through a toy embedding space to inspect neighbors, clusters, and vector analogies.live

KV cache conveyorAnimate prefix processing, cached keys and values, and the difference between fresh input and cached continuation.live

Prompt optimizer walkthroughRun a canned Understudy task, generate prompt families, score outputs with a rubric, and export the winning prompt.live

Final Lab

The practical endpoint is an Understudy run that works on a fresh machine: install prerequisites, choose a provider key, generate prompt candidates, score them with a rubric, inspect failures, and export the winning prompt. The browser walkthrough is live now; the local CLI lab should become the capstone.

step 1

setup

step 2

optimize

step 3

export

Feedback

This is intentionally soft-launched. The most useful feedback is where the mental model breaks: which demo feels too abstract, which concept needs a bridge, and which exercise would make the lesson stick.

send feedback

Sources

TensorFlow PlaygroundInteractive neural-network visualization, Apache 2.0 licensed and self-hosted with analytics removed.Transformer ExplainerInteractive transformer visualization, MIT licensed and self-hosted with visible attribution.Transformer Explainer paperCho, Kim, Karpekov, Helbling, Wang, Lee, Hoover, and Chau, IEEE VIS 2024.Attention Is All You NeedOriginal transformer architecture paper.Understudy agentLocal-first capture, evaluation, and optimization workflow behind this curriculum.