Clawdemy Lessons

Clawdemy LessonsFree AI literacy for everyday users. Bite-size narrated lessons that turn fear into fluency, one topic at a time.https://clawdemy.org/enCopyright © 2026 RBJ Global LLC. All rights reserved.ClawdemyFree AI literacy for everyday users. Bite-size narrated lessons that turn fear into fluency, one topic at a time.Clawdemyhello@clawdemy.orgfalseepisodic3f43f058-13af-56c1-a345-a3986aa324d5yesarrpodcastAI-authored commits and PRshttps://clawdemy.org/lessons/git-workflow/ai-authored-commits-and-prs/lesson/https://clawdemy.org/lessons/git-workflow/ai-authored-commits-and-prs/lesson/Lesson 15 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). What changes in a git workflow when an AI agent is the one typing the code. Co-authorship conventions including the standard Co-Authored-By trailer and the "Generated with Claude Code" marker. What human review specifically looks for in AI-authored diffs. How PR descriptions and release notes evolve to acknowledge AI contributions honestly. By the end of L15 you can author commits and PRs with AI contributions using clear conventions and review AI-authored work with the specific failure modes in mind.Thu, 11 Jun 2026 00:00:00 GMTClawdemy1:00:00falseLesson 15 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). What changes in a git workflow when an AI agent is the one typing the code. Co-authorship conventions including the standard Co-Authored-By trailer and the "Generated with Claude Code" marker. What human review specifically looks for in AI-authored diffs. How PR descriptions and release notes evolve to acknowledge AI contributions honestly. By the end of L15 you can author commits and PRs with AI contributions using clear conventions and review AI-authored work with the specific failure modes in mind.Branches as parallel work surfaceshttps://clawdemy.org/lessons/git-workflow/branches/lesson/https://clawdemy.org/lessons/git-workflow/branches/lesson/Lesson 5 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The lesson that opens Phase 2. You learn what a branch actually is (a movable pointer to a commit), how to create and switch branches, and why git's branching model is what makes collaboration possible. The snapshot mental model from L1 starts to pay off most here.Thu, 11 Jun 2026 00:00:00 GMTClawdemy23:00falseLesson 5 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The lesson that opens Phase 2. You learn what a branch actually is (a movable pointer to a commit), how to create and switch branches, and why git's branching model is what makes collaboration possible. The snapshot mental model from L1 starts to pay off most here.Cherry-pick and stashhttps://clawdemy.org/lessons/git-workflow/cherry-pick-and-stash/lesson/https://clawdemy.org/lessons/git-workflow/cherry-pick-and-stash/lesson/Lesson 11 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Two surgical tools every working developer reaches for. Cherry-pick copies a single commit from one branch onto another, the canonical way to backport hotfixes. Stash saves in-progress work without committing, the safety net for context switches. By the end of L11 you can backport a fix across branches, set aside dirty work to handle an emergency, and recover safely from both.Thu, 11 Jun 2026 00:00:00 GMTClawdemy1:05:00falseLesson 11 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Two surgical tools every working developer reaches for. Cherry-pick copies a single commit from one branch onto another, the canonical way to backport hotfixes. Stash saves in-progress work without committing, the safety net for context switches. By the end of L11 you can backport a fix across branches, set aside dirty work to handle an emergency, and recover safely from both.Commit hygienehttps://clawdemy.org/lessons/git-workflow/commit-hygiene/lesson/https://clawdemy.org/lessons/git-workflow/commit-hygiene/lesson/Lesson 3 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The lesson where mechanics become discipline. You can commit, now you learn to commit WELL, meaningful messages, atomic scope, the staging area as a thinking tool, and the Conventional Commits convention that most professional teams use.Thu, 11 Jun 2026 00:00:00 GMTClawdemy20:00falseLesson 3 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The lesson where mechanics become discipline. You can commit, now you learn to commit WELL, meaningful messages, atomic scope, the staging area as a thinking tool, and the Conventional Commits convention that most professional teams use.Merge conflictshttps://clawdemy.org/lessons/git-workflow/merge-conflicts/lesson/https://clawdemy.org/lessons/git-workflow/merge-conflicts/lesson/Lesson 7 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Merge conflicts are mechanical, not catastrophic. You learn why they happen, how to read git's conflict markers, the step-by-step resolution process, the five conflict types (textual, logical, semantic, delete-modify, rename-edit), and when to abort and retry. By the end you can resolve any conflict you'll encounter in two-person collaboration.Thu, 11 Jun 2026 00:00:00 GMTClawdemy40:00falseLesson 7 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Merge conflicts are mechanical, not catastrophic. You learn why they happen, how to read git's conflict markers, the step-by-step resolution process, the five conflict types (textual, logical, semantic, delete-modify, rename-edit), and when to abort and retry. By the end you can resolve any conflict you'll encounter in two-person collaboration.Multi-agent integration patternshttps://clawdemy.org/lessons/git-workflow/multi-agent-integration-patterns/lesson/https://clawdemy.org/lessons/git-workflow/multi-agent-integration-patterns/lesson/Lesson 14 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Three patterns for integrating work from parallel AI agents. Shared origin (simplest, most common). Per-agent fork (strongest isolation). Shared worktrees (fastest iteration, lead controls everything). How the lead orchestrates a fleet, runs integration, catches semantic conflicts that git cannot see, and produces a clean integration branch. By the end of L14 you can launch a multi-agent fleet, choose the right integration pattern, and avoid the failure modes that bite teams new to AI-collaborative development.Thu, 11 Jun 2026 00:00:00 GMTClawdemy1:20:00falseLesson 14 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Three patterns for integrating work from parallel AI agents. Shared origin (simplest, most common). Per-agent fork (strongest isolation). Shared worktrees (fastest iteration, lead controls everything). How the lead orchestrates a fleet, runs integration, catches semantic conflicts that git cannot see, and produces a clean integration branch. By the end of L14 you can launch a multi-agent fleet, choose the right integration pattern, and avoid the failure modes that bite teams new to AI-collaborative development.Pull requestshttps://clawdemy.org/lessons/git-workflow/pull-requests/lesson/https://clawdemy.org/lessons/git-workflow/pull-requests/lesson/Lesson 6 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Pull requests are where modern engineering culture lives. You learn the mechanical flow (push branch, open PR, address review, merge), how to write a PR description that respects the reviewer's time, the three merge strategies, and the etiquette of giving and receiving code review.Thu, 11 Jun 2026 00:00:00 GMTClawdemy35:00falseLesson 6 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Pull requests are where modern engineering culture lives. You learn the mechanical flow (push branch, open PR, address review, merge), how to write a PR description that respects the reviewer's time, the three merge strategies, and the etiquette of giving and receiving code review.Rebase, deeperhttps://clawdemy.org/lessons/git-workflow/rebase-deeper/lesson/https://clawdemy.org/lessons/git-workflow/rebase-deeper/lesson/Lesson 12 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Pre-lesson orientation for L12. Scope, learning outcomes, prerequisites, reading map, and what L12 deliberately does not cover.Thu, 11 Jun 2026 00:00:00 GMTClawdemy1:10:00falseLesson 12 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Pre-lesson orientation for L12. Scope, learning outcomes, prerequisites, reading map, and what L12 deliberately does not cover.Releases and tagshttps://clawdemy.org/lessons/git-workflow/releases-and-tags/lesson/https://clawdemy.org/lessons/git-workflow/releases-and-tags/lesson/Lesson 10 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). How specific commits become formal releases. You learn what a git tag is, the difference between lightweight and annotated tags, semantic versioning (semver), how to write release notes, and how releases work across the four workflows from L9. By the end you can mark, push, and announce a release for any project.Thu, 11 Jun 2026 00:00:00 GMTClawdemy35:00falseLesson 10 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). How specific commits become formal releases. You learn what a git tag is, the difference between lightweight and annotated tags, semantic versioning (semver), how to write release notes, and how releases work across the four workflows from L9. By the end you can mark, push, and announce a release for any project.Remotes and forkshttps://clawdemy.org/lessons/git-workflow/remotes-and-forks/lesson/https://clawdemy.org/lessons/git-workflow/remotes-and-forks/lesson/Lesson 8 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). How branches travel between machines and between repositories. You learn what a remote is, how push and fetch and pull work, the difference between origin and upstream, the fork-based contribution model used by open-source projects, and how to safely force-push when you need to. Closes Phase 2.Thu, 11 Jun 2026 00:00:00 GMTClawdemy40:00falseLesson 8 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). How branches travel between machines and between repositories. You learn what a remote is, how push and fetch and pull work, the difference between origin and upstream, the fork-based contribution model used by open-source projects, and how to safely force-push when you need to. Closes Phase 2.Team workflows: GitHub Flow, GitFlow, Trunk-based, Forkinghttps://clawdemy.org/lessons/git-workflow/team-workflows/lesson/https://clawdemy.org/lessons/git-workflow/team-workflows/lesson/Lesson 9 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The four production team workflows that build on the primitives from Phase 2. You learn what each one prescribes, when it fits, and how to choose. You see the role of branch protection rules and CI gates. By the end you can read any company's git workflow documentation and know what's prescribed versus optional, including how to set up your own open-source project for outside contributions.Thu, 11 Jun 2026 00:00:00 GMTClawdemy45:00falseLesson 9 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The four production team workflows that build on the primitives from Phase 2. You learn what each one prescribes, when it fits, and how to choose. You see the role of branch protection rules and CI gates. By the end you can read any company's git workflow documentation and know what's prescribed versus optional, including how to set up your own open-source project for outside contributions.The future of git in an AI worldhttps://clawdemy.org/lessons/git-workflow/the-future-of-git-in-an-ai-world/lesson/https://clawdemy.org/lessons/git-workflow/the-future-of-git-in-an-ai-world/lesson/Lesson 16 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The closing lesson. A grounded look at where git might evolve as AI authorship becomes routine. What new primitives might emerge from the patterns we already see. Which fundamentals will not change. How to stay calm and grounded as the tooling shifts under your feet. By the end of L16 you have a calibrated sense of what to watch for, what to ignore, and why the snapshot model you learned in L1 is still the right mental model fifteen lessons later.Thu, 11 Jun 2026 00:00:00 GMTClawdemy42:00falseLesson 16 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The closing lesson. A grounded look at where git might evolve as AI authorship becomes routine. What new primitives might emerge from the patterns we already see. Which fundamentals will not change. How to stay calm and grounded as the tooling shifts under your feet. By the end of L16 you have a calibrated sense of what to watch for, what to ignore, and why the snapshot model you learned in L1 is still the right mental model fifteen lessons later.Undoing thingshttps://clawdemy.org/lessons/git-workflow/undoing-things/lesson/https://clawdemy.org/lessons/git-workflow/undoing-things/lesson/Lesson 4 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The lesson that closes Phase 1. You learn to recover from mistakes safely. Discard working changes. Unstage staged changes. Undo commits without losing work. The reflog as your safety net. By the end you have confident solo git workflow.Thu, 11 Jun 2026 00:00:00 GMTClawdemy27:00falseLesson 4 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The lesson that closes Phase 1. You learn to recover from mistakes safely. Discard working changes. Unstage staged changes. Undo commits without losing work. The reflog as your safety net. By the end you have confident solo git workflow.Why git existshttps://clawdemy.org/lessons/git-workflow/why-git-exists/lesson/https://clawdemy.org/lessons/git-workflow/why-git-exists/lesson/Lesson 1 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The problem version control solves, the snapshot mental model that underpins git, and why every later lesson in this track depends on getting this foundation right. L1 is command-free by design: it builds the mental model that makes every command in L2 onward coherent.Thu, 11 Jun 2026 00:00:00 GMTClawdemy13:00falseLesson 1 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The problem version control solves, the snapshot mental model that underpins git, and why every later lesson in this track depends on getting this foundation right. L1 is command-free by design: it builds the mental model that makes every command in L2 onward coherent.Worktrees and parallel agentshttps://clawdemy.org/lessons/git-workflow/worktrees-and-parallel-agents/lesson/https://clawdemy.org/lessons/git-workflow/worktrees-and-parallel-agents/lesson/Lesson 13 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Pre-lesson orientation for L13. Scope, learning outcomes, prerequisites, reading map, and what L13 deliberately does not cover.Thu, 11 Jun 2026 00:00:00 GMTClawdemy1:00:00falseLesson 13 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). Pre-lesson orientation for L13. Scope, learning outcomes, prerequisites, reading map, and what L13 deliberately does not cover.Your first repohttps://clawdemy.org/lessons/git-workflow/your-first-repo/lesson/https://clawdemy.org/lessons/git-workflow/your-first-repo/lesson/Lesson 2 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The hands-on lesson where the snapshot model meets actual commands. You will feel the pain of manual tracking for ten minutes, then meet git init as relief, then learn the working-staging-repo mental model that makes commit hygiene make sense.Thu, 11 Jun 2026 00:00:00 GMTClawdemy22:00falseLesson 2 of Track 7 (Git Workflow: From Solo to Multi-Agent Teams). The hands-on lesson where the snapshot model meets actual commands. You will feel the pain of manual tracking for ten minutes, then meet git init as relief, then learn the working-staging-repo mental model that makes commit hygiene make sense.API keys and the OAuth pathhttps://clawdemy.org/lessons/getting-started/api-keys-and-provider-oauth/lesson/https://clawdemy.org/lessons/getting-started/api-keys-and-provider-oauth/lesson/How Clawless connects to AI providers. What an API key actually is, where yours lives once you save it, the BYOK billing model with no Clawless markup, and the OAuth path that lets ChatGPT subscribers skip per-token charges on OpenAI models.Fri, 05 Jun 2026 00:00:00 GMTClawdemy10:00falseHow Clawless connects to AI providers. What an API key actually is, where yours lives once you save it, the BYOK billing model with no Clawless markup, and the OAuth path that lets ChatGPT subscribers skip per-token charges on OpenAI models.CostGuard and where your data goeshttps://clawdemy.org/lessons/getting-started/costguard-and-privacy-posture/lesson/https://clawdemy.org/lessons/getting-started/costguard-and-privacy-posture/lesson/The two anxieties of the first week with Clawless. CostGuard is the spending safety net that watches your BYOK usage against a monthly cap. The data path is your computer, the AI provider, your computer, with no Clawless server holding your conversations.Fri, 05 Jun 2026 00:00:00 GMTClawdemy12:00falseThe two anxieties of the first week with Clawless. CostGuard is the spending safety net that watches your BYOK usage against a monthly cap. The data path is your computer, the AI provider, your computer, with no Clawless server holding your conversations.Your first conversation and picking a modelhttps://clawdemy.org/lessons/getting-started/first-conversation-and-model-selector/lesson/https://clawdemy.org/lessons/getting-started/first-conversation-and-model-selector/lesson/The first hands-on Clawless lesson. Send your first message, find the model picker in the dock row, switch models mid-conversation without losing your place, and use the provider-prefixed pattern to reach off-list models. The capability the rest of the track sits on.Fri, 05 Jun 2026 00:00:00 GMTClawdemy11:00falseThe first hands-on Clawless lesson. Send your first message, find the model picker in the dock row, switch models mid-conversation without losing your place, and use the provider-prefixed pattern to reach off-list models. The capability the rest of the track sits on.How Clawless remembers (and forgets)https://clawdemy.org/lessons/getting-started/memory-system-overview/lesson/https://clawdemy.org/lessons/getting-started/memory-system-overview/lesson/A tour of the memory system. The distinction between conversation history and memory, the four tiers (Pinned, Insights, General, Decayed), the three pathways memories get in, the Memory panel where you control them, and the privacy rule for what should and should not be saved.Fri, 05 Jun 2026 00:00:00 GMTClawdemy11:00falseA tour of the memory system. The distinction between conversation history and memory, the four tiers (Pinned, Insights, General, Decayed), the three pathways memories get in, the Memory panel where you control them, and the privacy rule for what should and should not be saved.AI governance: the policy layer above any individual deploymenthttps://clawdemy.org/lessons/ai-safety-and-alignment/ai-governance/lesson/https://clawdemy.org/lessons/ai-safety-and-alignment/ai-governance/lesson/Lesson 9 of Track 23, the track's closing lesson. Hendrycks Ch 8 brings governance as the layer outside any individual AI system. Four-layer taxonomy: corporate, national, international, compute. Each layer with its mechanisms, its strengths, its limits. Why compute governance has become the field's central lever (compute is physical, excludable, quantifiable). The L9 capability is to situate one real governance proposal inside the taxonomy. Track closure: what the nine lessons produced and what the track does not pretend to be.Thu, 04 Jun 2026 00:00:00 GMTClawdemy14:00falseLesson 9 of Track 23, the track's closing lesson. Hendrycks Ch 8 brings governance as the layer outside any individual AI system. Four-layer taxonomy: corporate, national, international, compute. Each layer with its mechanisms, its strengths, its limits. Why compute governance has become the field's central lever (compute is physical, excludable, quantifiable). The L9 capability is to situate one real governance proposal inside the taxonomy. Track closure: what the nine lessons produced and what the track does not pretend to be.AI safety as a field: what it studies and why it is a discipline, not a stancehttps://clawdemy.org/lessons/ai-safety-and-alignment/ai-safety-as-a-field/lesson/https://clawdemy.org/lessons/ai-safety-and-alignment/ai-safety-as-a-field/lesson/Opener of Track 23 (AI Safety and Alignment). Frames AI safety as a field with a subject (catastrophic AI risks), a vocabulary (the four-bucket typology, the specification-vs-proxy-gaming distinctions), a method (descriptive, attributed, cross-disciplinary), and a connection to neighboring fields (safety engineering, complex systems, governance). The capability bar is the paragraph-write: state what AI safety studies and why it is a discipline rather than a stance, in roughly 6-8 sentences.Thu, 04 Jun 2026 00:00:00 GMTClawdemy13:00falseOpener of Track 23 (AI Safety and Alignment). Frames AI safety as a field with a subject (catastrophic AI risks), a vocabulary (the four-bucket typology, the specification-vs-proxy-gaming distinctions), a method (descriptive, attributed, cross-disciplinary), and a connection to neighboring fields (safety engineering, complex systems, governance). The capability bar is the paragraph-write: state what AI safety studies and why it is a discipline rather than a stance, in roughly 6-8 sentences.Beneficial AI and machine ethics: moral uncertainty as the substratehttps://clawdemy.org/lessons/ai-safety-and-alignment/beneficial-ai-and-machine-ethics/lesson/https://clawdemy.org/lessons/ai-safety-and-alignment/beneficial-ai-and-machine-ethics/lesson/Lesson 7 of Track 23, opener of Phase 3 (ethics and governance). Hendrycks Ch 6 turns from 'what fails' to 'what are we trying to do?' Moral uncertainty as the foundational concept: the field does not have a single correct ethical framework to hand to an AI system. Three strategies (My Favorite Theory, expected choiceworthiness, moral parliament). Social welfare functions (utilitarian vs prioritarian) as aggregation tools. Cost-benefit analysis blind spots. Fairness criteria not jointly satisfiable. The L4 callback: outer alignment is harder because there is no single goal to capture.Thu, 04 Jun 2026 00:00:00 GMTClawdemy13:00falseLesson 7 of Track 23, opener of Phase 3 (ethics and governance). Hendrycks Ch 6 turns from 'what fails' to 'what are we trying to do?' Moral uncertainty as the foundational concept: the field does not have a single correct ethical framework to hand to an AI system. Three strategies (My Favorite Theory, expected choiceworthiness, moral parliament). Social welfare functions (utilitarian vs prioritarian) as aggregation tools. Cost-benefit analysis blind spots. Fairness criteria not jointly satisfiable. The L4 callback: outer alignment is harder because there is no single goal to capture.Collective action and multi-agent dynamics: when many AI systems share an environmenthttps://clawdemy.org/lessons/ai-safety-and-alignment/collective-action-and-multi-agent-dynamics/lesson/https://clawdemy.org/lessons/ai-safety-and-alignment/collective-action-and-multi-agent-dynamics/lesson/Lesson 8 of Track 23. Hendrycks Ch 7 takes the multi-stakeholder framing L7 introduced and works it at the formal level. Game theory as the analytic tool: Nash equilibria that are Pareto inefficient, prisoner's dilemmas. Three collective-action failure modes (race to the bottom, free rider, escalation). Four cooperation mechanisms (reciprocity, reputation, group selection, institutional) with AI-specific limits. The cooperation tension: mechanisms designed to align AIs with humans can produce AI-AI coalitions that marginalize humans. Evolutionary pressures formalize L2's natural-selection bucket.Thu, 04 Jun 2026 00:00:00 GMTClawdemy14:00falseLesson 8 of Track 23. Hendrycks Ch 7 takes the multi-stakeholder framing L7 introduced and works it at the formal level. Game theory as the analytic tool: Nash equilibria that are Pareto inefficient, prisoner's dilemmas. Three collective-action failure modes (race to the bottom, free rider, escalation). Four cooperation mechanisms (reciprocity, reputation, group selection, institutional) with AI-specific limits. The cooperation tension: mechanisms designed to align AIs with humans can produce AI-AI coalitions that marginalize humans. Evolutionary pressures formalize L2's natural-selection bucket.Complex systems and emergent risk: why correct components produce incorrect systemshttps://clawdemy.org/lessons/ai-safety-and-alignment/complex-systems/lesson/https://clawdemy.org/lessons/ai-safety-and-alignment/complex-systems/lesson/Lesson 6 of Track 23, the lesson that closes Phase 2. Hendrycks Ch 5 brings in the complex-systems framing: a system assembled from individually-correct components can still produce behavior the designers did not predict and cannot easily prevent. The chapter draws on the normal-accident-theory lineage (Perrow 1984). Four properties (emergence, nonlinearity, feedback loops, tight coupling). Why L5's Swiss-cheese composition rule breaks when layers are not genuinely independent. AI-specific patterns: tightly-coupled-to-environment deployments, multi-agent emergence, emergent capabilities, model monoculture. Capability is to propose system-structure design changes that reduce complex-systems risk without addressing any component-level bug.Thu, 04 Jun 2026 00:00:00 GMTClawdemy13:00falseLesson 6 of Track 23, the lesson that closes Phase 2. Hendrycks Ch 5 brings in the complex-systems framing: a system assembled from individually-correct components can still produce behavior the designers did not predict and cannot easily prevent. The chapter draws on the normal-accident-theory lineage (Perrow 1984). Four properties (emergence, nonlinearity, feedback loops, tight coupling). Why L5's Swiss-cheese composition rule breaks when layers are not genuinely independent. AI-specific patterns: tightly-coupled-to-environment deployments, multi-agent emergence, emergent capabilities, model monoculture. Capability is to propose system-structure design changes that reduce complex-systems risk without addressing any component-level bug.The four catastrophic risk categorieshttps://clawdemy.org/lessons/ai-safety-and-alignment/four-catastrophic-risk-categories/lesson/https://clawdemy.org/lessons/ai-safety-and-alignment/four-catastrophic-risk-categories/lesson/Lesson 2 of Track 23 (AI Safety and Alignment). Takes the four buckets named in L1 and works each in detail: malicious use, AI race, organizational risks, rogue AIs. Each bucket comes with its sub-mechanisms, the historical analogies Hendrycks anchors against, and the intervention levers that move the dial inside that bucket. The capability is the classify-and-defend three-step move: name the bucket, name the sub-mechanism, name the lever.Thu, 04 Jun 2026 00:00:00 GMTClawdemy14:00falseLesson 2 of Track 23 (AI Safety and Alignment). Takes the four buckets named in L1 and works each in detail: malicious use, AI race, organizational risks, rogue AIs. Each bucket comes with its sub-mechanisms, the historical analogies Hendrycks anchors against, and the intervention levers that move the dial inside that bucket. The capability is the classify-and-defend three-step move: name the bucket, name the sub-mechanism, name the lever.Monitoring and robustness: two halves of the deployment-time safety problemhttps://clawdemy.org/lessons/ai-safety-and-alignment/monitoring-and-robustness/lesson/https://clawdemy.org/lessons/ai-safety-and-alignment/monitoring-and-robustness/lesson/Lesson 3 of Track 23 (AI Safety and Alignment), opener of Phase 2. Hendrycks Ch 3.2 + 3.3 split the deployment-time safety surface into two halves. Robustness covers system-side failures (the model breaks under conditions not seen in training: adversarial, distribution-shift, prompt-injection, trojan, Goodhart). Monitoring covers observation-side failures (operators do not notice: interpretability gaps, anomaly-detection lag, confabulated explanations, sandbagging-prone evaluations). Both halves are needed; Swiss-cheese intuition. Capability is the four-step classify-and-defend on incident reports.Thu, 04 Jun 2026 00:00:00 GMTClawdemy13:00falseLesson 3 of Track 23 (AI Safety and Alignment), opener of Phase 2. Hendrycks Ch 3.2 + 3.3 split the deployment-time safety surface into two halves. Robustness covers system-side failures (the model breaks under conditions not seen in training: adversarial, distribution-shift, prompt-injection, trojan, Goodhart). Monitoring covers observation-side failures (operators do not notice: interpretability gaps, anomaly-detection lag, confabulated explanations, sandbagging-prone evaluations). Both halves are needed; Swiss-cheese intuition. Capability is the four-step classify-and-defend on incident reports.Safety engineering for AI systems: borrowing the toolkithttps://clawdemy.org/lessons/ai-safety-and-alignment/safety-engineering/lesson/https://clawdemy.org/lessons/ai-safety-and-alignment/safety-engineering/lesson/Lesson 5 of Track 23. Hendrycks Ch 4 reaches into safety engineering (the field that grew up around nuclear plants, aviation, chemical processing) and asks what tools transfer to AI. Nines of reliability as the quantitative metric, eight safe-design principles (defense in depth as the centerpiece), tail events as the failure shape that dominates expected harm. Swiss-cheese composition as the unifying intuition. Capability is the move from vocabulary to use: pick one tool, name one deployment decision, show how the tool constrains the decision.Thu, 04 Jun 2026 00:00:00 GMTClawdemy14:00falseLesson 5 of Track 23. Hendrycks Ch 4 reaches into safety engineering (the field that grew up around nuclear plants, aviation, chemical processing) and asks what tools transfer to AI. Nines of reliability as the quantitative metric, eight safe-design principles (defense in depth as the centerpiece), tail events as the failure shape that dominates expected harm. Swiss-cheese composition as the unifying intuition. Capability is the move from vocabulary to use: pick one tool, name one deployment decision, show how the tool constrains the decision.The alignment problem: three failure modes that sit underneath robustness and monitoringhttps://clawdemy.org/lessons/ai-safety-and-alignment/the-alignment-problem/lesson/https://clawdemy.org/lessons/ai-safety-and-alignment/the-alignment-problem/lesson/Lesson 4 of Track 23. Hendrycks Ch 3.4 takes the substrate under L3 head-on: even a perfectly robust and perfectly monitored system can be pursuing the wrong objective. Three named failure modes (specification gaming, proxy gaming, deceptive alignment), each with a worked toy example. The inner-vs-outer alignment frame as the unifying decomposition. The structural reason deceptive alignment is the hardest of the three. Why alignment is what the rest of the track keeps returning to.Thu, 04 Jun 2026 00:00:00 GMTClawdemy14:00falseLesson 4 of Track 23. Hendrycks Ch 3.4 takes the substrate under L3 head-on: even a perfectly robust and perfectly monitored system can be pursuing the wrong objective. Three named failure modes (specification gaming, proxy gaming, deceptive alignment), each with a worked toy example. The inner-vs-outer alignment frame as the unifying decomposition. The structural reason deceptive alignment is the hardest of the three. Why alignment is what the rest of the track keeps returning to.Shipping a Claude applicationhttps://clawdemy.org/lessons/building-with-claude/shipping-a-claude-application/lesson/https://clawdemy.org/lessons/building-with-claude/shipping-a-claude-application/lesson/Lesson 12 of Track 22 (Building with Claude), the track closer. The five production disciplines (cost monitoring with the Usage and Cost Admin API; latency budgets per surface; eval-set discipline; rollout via feature flags + canary + A/B + rehearsed rollback; lifecycle handling per Anthropic's deprecation policy). The Usage and Cost Admin API endpoints (GET /v1/organizations/usage_report/messages and GET /v1/organizations/cost_report; Admin API key sk-ant-admin... required, distinct from regular API keys; bucket widths 1m/1h/1d; group by model + workspace_id + api_key_id + service_tier + context_window + inference_geo + speed; data appears within about 5 minutes; Priority Tier costs use a different billing model and never appear in the Cost endpoint; NOT available on Claude Platform on AWS or individual accounts). Anthropic's deprecation policy (Active → Legacy → Deprecated → Retired; at least 60 days notice for publicly released models; Console Usage Export for audit; temperature/top_p/top_k return 400 on Opus 4.7+ at non-default values). The rollout checklist that pulls L1-L11 plus the five disciplines into a single deploy gate.Thu, 04 Jun 2026 00:00:00 GMTClawdemy16:00falseLesson 12 of Track 22 (Building with Claude), the track closer. The five production disciplines (cost monitoring with the Usage and Cost Admin API; latency budgets per surface; eval-set discipline; rollout via feature flags + canary + A/B + rehearsed rollback; lifecycle handling per Anthropic's deprecation policy). The Usage and Cost Admin API endpoints (GET /v1/organizations/usage_report/messages and GET /v1/organizations/cost_report; Admin API key sk-ant-admin... required, distinct from regular API keys; bucket widths 1m/1h/1d; group by model + workspace_id + api_key_id + service_tier + context_window + inference_geo + speed; data appears within about 5 minutes; Priority Tier costs use a different billing model and never appear in the Cost endpoint; NOT available on Claude Platform on AWS or individual accounts). Anthropic's deprecation policy (Active → Legacy → Deprecated → Retired; at least 60 days notice for publicly released models; Console Usage Export for audit; temperature/top_p/top_k return 400 on Opus 4.7+ at non-default values). The rollout checklist that pulls L1-L11 plus the five disciplines into a single deploy gate.Subagents and Claude Managed Agentshttps://clawdemy.org/lessons/building-with-claude/subagents-and-managed-agents/lesson/https://clawdemy.org/lessons/building-with-claude/subagents-and-managed-agents/lesson/Lesson 11 of Track 22 (Building with Claude). Two Anthropic-specific primitives for lesson 9's patterns 4 (orchestrator-workers) and 6 (autonomous agent). Subagents are separate agent instances your main agent can spawn for focused subtasks (Claude Agent SDK; programmatic via agents parameter on query() / filesystem-based in .claude/agents/ / built-in general-purpose), with context isolation (only the final message returns to the parent), parallelization (concurrent subagents), specialized instructions (per-subagent system prompt), and tool restrictions (per-subagent tools whitelist); per-subagent model is a cost lever (smaller cheaper model for some subagents). Claude Managed Agents is the Anthropic-hosted harness (managed-agents-2026-04-01 beta endpoint family with three POST endpoints /v1/agents + /v1/environments + /v1/sessions and an SSE event stream): Anthropic provides the agent loop, sandbox (cloud or self-hosted), tool execution, runtime, and stateful session storage; you POST user events and receive agent events. NOT ZDR or HIPAA BAA eligible (stateful by design). Decision frame: self-built L8 loop for control + ZDR; Subagents inside that loop for orchestrator-workers + parallelization + cost; Managed Agents for long-running asynchronous work when you skip the harness.Thu, 04 Jun 2026 00:00:00 GMTClawdemy16:00falseLesson 11 of Track 22 (Building with Claude). Two Anthropic-specific primitives for lesson 9's patterns 4 (orchestrator-workers) and 6 (autonomous agent). Subagents are separate agent instances your main agent can spawn for focused subtasks (Claude Agent SDK; programmatic via agents parameter on query() / filesystem-based in .claude/agents/ / built-in general-purpose), with context isolation (only the final message returns to the parent), parallelization (concurrent subagents), specialized instructions (per-subagent system prompt), and tool restrictions (per-subagent tools whitelist); per-subagent model is a cost lever (smaller cheaper model for some subagents). Claude Managed Agents is the Anthropic-hosted harness (managed-agents-2026-04-01 beta endpoint family with three POST endpoints /v1/agents + /v1/environments + /v1/sessions and an SSE event stream): Anthropic provides the agent loop, sandbox (cloud or self-hosted), tool execution, runtime, and stateful session storage; you POST user events and receive agent events. NOT ZDR or HIPAA BAA eligible (stateful by design). Decision frame: self-built L8 loop for control + ZDR; Subagents inside that loop for orchestrator-workers + parallelization + cost; Managed Agents for long-running asynchronous work when you skip the harness.Brief: Challenges and open problems (closes Phase 3 and Track 18)https://clawdemy.org/lessons/deep-reinforcement-learning/challenges-and-open-problems/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/challenges-and-open-problems/lesson/Editorial brief for Lesson 18 of Track 18. The final lesson. Four open frontiers (sample efficiency, safety and alignment, generalization, real-world deployment), how each maps onto T18 algorithms, the tensions across frontiers, T18 syllabus recap, and where the track fits in the curriculum.Thu, 04 Jun 2026 00:00:00 GMTClawdemy14:00falseEditorial brief for Lesson 18 of Track 18. The final lesson. Four open frontiers (sample efficiency, safety and alignment, generalization, real-world deployment), how each maps onto T18 algorithms, the tensions across frontiers, T18 syllabus recap, and where the track fits in the curriculum.Brief: Explorationhttps://clawdemy.org/lessons/deep-reinforcement-learning/exploration/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/exploration/lesson/Editorial brief for Lesson 16 of Track 18. Three exploration families, the easy-vs-hard exploration distinction as the dominant decision criterion, and the RND-on-Montezuma's-Revenge breakthrough as the field's clearest result.Thu, 04 Jun 2026 00:00:00 GMTClawdemy14:00falseEditorial brief for Lesson 16 of Track 18. Three exploration families, the easy-vs-hard exploration distinction as the dominant decision criterion, and the RND-on-Montezuma's-Revenge breakthrough as the field's clearest result.Brief: Multi-task RL and meta-RLhttps://clawdemy.org/lessons/deep-reinforcement-learning/multi-task-meta-rl/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/multi-task-meta-rl/lesson/Editorial brief for Lesson 17 of Track 18. Multi-task RL vs meta-RL distinction, three meta-RL families (MAML, RL², PEARL), and the foundation-model connection as the modern-AI parallel.Thu, 04 Jun 2026 00:00:00 GMTClawdemy13:00falseEditorial brief for Lesson 17 of Track 18. Multi-task RL vs meta-RL distinction, three meta-RL families (MAML, RL², PEARL), and the foundation-model connection as the modern-AI parallel.Brief: Offline RL algorithms (BCQ, CQL, IQL)https://clawdemy.org/lessons/deep-reinforcement-learning/offline-rl-algorithms/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/offline-rl-algorithms/lesson/Editorial brief for Lesson 15 of Track 18. Second of two offline-RL lessons. Three algorithm families that fix the L14 failure by different mechanisms. Decision rubric for which to pick when. BC sanity check as universal baseline.Thu, 04 Jun 2026 00:00:00 GMTClawdemy14:00falseEditorial brief for Lesson 15 of Track 18. Second of two offline-RL lessons. Three algorithm families that fix the L14 failure by different mechanisms. Decision rubric for which to pick when. BC sanity check as universal baseline.Brief: Offline RL, the problemhttps://clawdemy.org/lessons/deep-reinforcement-learning/offline-rl-problem/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/offline-rl-problem/lesson/Editorial brief for Lesson 14 of Track 18. The first of two offline-RL lessons. Defines the setting (fixed dataset, no further interaction), names the failure mode (extrapolation error compounded by Bellman propagation), and sets up the next lesson (BCQ / CQL / IQL as the three families of fixes).Thu, 04 Jun 2026 00:00:00 GMTClawdemy14:00falseEditorial brief for Lesson 14 of Track 18. The first of two offline-RL lessons. Defines the setting (fixed dataset, no further interaction), names the failure mode (extrapolation error compounded by Bellman propagation), and sets up the next lesson (BCQ / CQL / IQL as the three families of fixes).Diffusion models II, training and samplinghttps://clawdemy.org/lessons/generative-models-and-diffusion/diffusion-ii-training-and-sampling/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/diffusion-ii-training-and-sampling/lesson/Lesson 13 of Track 19 (Generative Models and Diffusion). DDPM from L12 sampled in a thousand Markov-chain steps; that is too slow for production. This lesson covers the two moves that made diffusion practical: DDIM (a deterministic non-Markovian sampler that uses the same trained network with far fewer steps) and classifier-free guidance (the conditioning trick behind every modern text-to-image system). Closes with the latency-quality Pareto frontier that governs every production diffusion deployment.Thu, 04 Jun 2026 00:00:00 GMTClawdemy14:00falseLesson 13 of Track 19 (Generative Models and Diffusion). DDPM from L12 sampled in a thousand Markov-chain steps; that is too slow for production. This lesson covers the two moves that made diffusion practical: DDIM (a deterministic non-Markovian sampler that uses the same trained network with far fewer steps) and classifier-free guidance (the conditioning trick behind every modern text-to-image system). Closes with the latency-quality Pareto frontier that governs every production diffusion deployment.Score-based diffusion via SDEs, the unifying viewhttps://clawdemy.org/lessons/generative-models-and-diffusion/score-based-diffusion-via-sdes/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/score-based-diffusion-via-sdes/lesson/Lesson 14 of Track 19 (Generative Models and Diffusion). L11 derived denoising score matching, L12 derived the DDPM Markov chain, L13 derived DDIM, and they all converged on the same noise-prediction loss with the same trained network. This lesson writes the continuous-time stochastic differential equation that underlies all three: the forward chain as a discretization of a forward SDE, the reverse chain as the reverse SDE, the noise predictor as the score function up to a scalar, and the probability flow ODE as the deterministic sampler with tractable likelihood evaluation.Thu, 04 Jun 2026 00:00:00 GMTClawdemy17:00falseLesson 14 of Track 19 (Generative Models and Diffusion). L11 derived denoising score matching, L12 derived the DDPM Markov chain, L13 derived DDIM, and they all converged on the same noise-prediction loss with the same trained network. This lesson writes the continuous-time stochastic differential equation that underlies all three: the forward chain as a discretization of a forward SDE, the reverse chain as the reverse SDE, the noise predictor as the score function up to a scalar, and the probability flow ODE as the deterministic sampler with tractable likelihood evaluation.The four-paradigm landscape and where modern systems sithttps://clawdemy.org/lessons/generative-models-and-diffusion/the-four-paradigm-landscape/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/the-four-paradigm-landscape/lesson/Lesson 15 of Track 19 (Generative Models and Diffusion), the capstone of the track. Returns to the four-paradigm map from lesson 1 with every paradigm's training objective, sampling procedure, and trade-off characterized in full from the thirteen intervening lessons. Places modern systems (autoregressive language models, latent diffusion image generators, GAN-based face generators, video diffusion, and multimodal hybrids) on the map explicitly. The deliverable of the track is paradigm fluency: identify training objective + sampling procedure + trade-offs from any system release.Thu, 04 Jun 2026 00:00:00 GMTClawdemy15:00falseLesson 15 of Track 19 (Generative Models and Diffusion), the capstone of the track. Returns to the four-paradigm map from lesson 1 with every paradigm's training objective, sampling procedure, and trade-off characterized in full from the thirteen intervening lessons. Places modern systems (autoregressive language models, latent diffusion image generators, GAN-based face generators, video diffusion, and multimodal hybrids) on the map explicitly. The deliverable of the track is paradigm fluency: identify training objective + sampling procedure + trade-offs from any system release.Agent Skills and Claude Codehttps://clawdemy.org/lessons/building-with-claude/agent-skills-and-claude-code/lesson/https://clawdemy.org/lessons/building-with-claude/agent-skills-and-claude-code/lesson/Lesson 10 of Track 22 (Building with Claude). The durable-instructions layer for the L9 patterns, plus a worked agent harness that uses it. Agent Skills are filesystem-resident packages (a SKILL.md with YAML frontmatter plus optional additional .md files and scripts) Claude reads on-demand. The progressive-disclosure architecture (Level 1 metadata always loaded around 100 tokens per Skill; Level 2 SKILL.md body loaded when triggered under 5K tokens; Level 3 bundled resources loaded as needed effectively unlimited). Three surfaces (Claude API with skill_id in container plus three beta headers; Claude Code via filesystem at ~/.claude/skills/ or .claude/skills/; claude.ai via upload). Claude Code as the worked agent: an agentic coding tool available in terminal, IDE, desktop, and browser; reads CLAUDE.md for project context; uses MCP from L6 natively; hooks for before/after actions; sub-agents and the Agent SDK for custom workflows. How Skills + Claude Code together let any of the L9 patterns become durable, shareable, version-controlled team artifacts.Wed, 03 Jun 2026 00:00:00 GMTClawdemy15:00falseLesson 10 of Track 22 (Building with Claude). The durable-instructions layer for the L9 patterns, plus a worked agent harness that uses it. Agent Skills are filesystem-resident packages (a SKILL.md with YAML frontmatter plus optional additional .md files and scripts) Claude reads on-demand. The progressive-disclosure architecture (Level 1 metadata always loaded around 100 tokens per Skill; Level 2 SKILL.md body loaded when triggered under 5K tokens; Level 3 bundled resources loaded as needed effectively unlimited). Three surfaces (Claude API with skill_id in container plus three beta headers; Claude Code via filesystem at ~/.claude/skills/ or .claude/skills/; claude.ai via upload). Claude Code as the worked agent: an agentic coding tool available in terminal, IDE, desktop, and browser; reads CLAUDE.md for project context; uses MCP from L6 natively; hooks for before/after actions; sub-agents and the Agent SDK for custom workflows. How Skills + Claude Code together let any of the L9 patterns become durable, shareable, version-controlled team artifacts.From single call to agent loophttps://clawdemy.org/lessons/building-with-claude/from-single-call-to-agent-loop/lesson/https://clawdemy.org/lessons/building-with-claude/from-single-call-to-agent-loop/lesson/Lesson 8 of Track 22 (Building with Claude). Phase 3 opener. The transition from a one-shot single-call pattern to a multi-turn loop where the model decides what to do next. The workflow-vs-agent distinction with Anthropic's verbatim definitions (workflow: predefined code paths; agent: dynamically directs its own processes and tool usage). The augmented LLM building block (retrieval + tools + memory) underneath both. The canonical loop in 30 lines (a while bounded by max_iterations that calls messages.create, appends the assistant turn, then dispatches on stop_reason). The full stop_reason vocabulary the loop handles (end_turn, tool_use, pause_turn from L5, max_tokens, stop_sequence, model_context_window_exceeded from L7, the compaction value from L7, and refusal for safety declines with stop_details.category). tool_choice for steering (auto / any / tool / none) and its small but real token-cost difference. The four loop disciplines (hard max_iterations cap, tool inventory is the surface area, the L7 cost-and-staleness levers stay engaged, explicit stop_reason dispatch). The post's framework guidance: start by using LLM APIs directly.Wed, 03 Jun 2026 00:00:00 GMTClawdemy15:00falseLesson 8 of Track 22 (Building with Claude). Phase 3 opener. The transition from a one-shot single-call pattern to a multi-turn loop where the model decides what to do next. The workflow-vs-agent distinction with Anthropic's verbatim definitions (workflow: predefined code paths; agent: dynamically directs its own processes and tool usage). The augmented LLM building block (retrieval + tools + memory) underneath both. The canonical loop in 30 lines (a while bounded by max_iterations that calls messages.create, appends the assistant turn, then dispatches on stop_reason). The full stop_reason vocabulary the loop handles (end_turn, tool_use, pause_turn from L5, max_tokens, stop_sequence, model_context_window_exceeded from L7, the compaction value from L7, and refusal for safety declines with stop_details.category). tool_choice for steering (auto / any / tool / none) and its small but real token-cost difference. The four loop disciplines (hard max_iterations cap, tool inventory is the surface area, the L7 cost-and-staleness levers stay engaged, explicit stop_reason dispatch). The post's framework guidance: start by using LLM APIs directly.Model Context Protocolhttps://clawdemy.org/lessons/building-with-claude/model-context-protocol/lesson/https://clawdemy.org/lessons/building-with-claude/model-context-protocol/lesson/Lesson 6 of Track 22 (Building with Claude). The cross-provider tool layer. What MCP is (an open standard for connecting AI applications to external systems, governed at modelcontextprotocol.io, not Anthropic-proprietary); the MCP connector pattern (Claude as the MCP client, talking directly to remote MCP servers from inside a Messages API call, no MCP client code in your application); the request shape (mcp_servers array + mcp_toolset in the tools array; the mcp-client-2025-11-20 beta header; allowlist/denylist/defer_loading configuration); the response shape (mcp_tool_use + mcp_tool_result inline blocks); the decision frame for when to reach for MCP versus inline tool definitions from lessons 4 and 5; the connector's active limits (tools only, HTTPS only, not on Bedrock or Vertex AI, not ZDR eligible, OAuth is yours); and the cross-provider value of speaking one tool protocol across many model providers.Wed, 03 Jun 2026 00:00:00 GMTClawdemy14:00falseLesson 6 of Track 22 (Building with Claude). The cross-provider tool layer. What MCP is (an open standard for connecting AI applications to external systems, governed at modelcontextprotocol.io, not Anthropic-proprietary); the MCP connector pattern (Claude as the MCP client, talking directly to remote MCP servers from inside a Messages API call, no MCP client code in your application); the request shape (mcp_servers array + mcp_toolset in the tools array; the mcp-client-2025-11-20 beta header; allowlist/denylist/defer_loading configuration); the response shape (mcp_tool_use + mcp_tool_result inline blocks); the decision frame for when to reach for MCP versus inline tool definitions from lessons 4 and 5; the connector's active limits (tools only, HTTPS only, not on Bedrock or Vertex AI, not ZDR eligible, OAuth is yours); and the cross-provider value of speaking one tool protocol across many model providers.Prompt caching and context managementhttps://clawdemy.org/lessons/building-with-claude/prompt-caching-and-context-management/lesson/https://clawdemy.org/lessons/building-with-claude/prompt-caching-and-context-management/lesson/Lesson 7 of Track 22 (Building with Claude). Phase 2 closer. The cost-and-staleness levers for a tool-heavy session. Prompt caching (cache_control on the system prompt, tool definitions across L4 + L5 + L6, and stable message content; 5-minute and 1-hour TTLs; 1.25x / 2.0x write multipliers and the 0.1x cache-hit price; minimum 4,096 token floor on Opus 4.7 and Haiku 4.5; the four-breakpoint maximum). The context-window picture (1M tokens on current Opus and Sonnet 4.x, 200K elsewhere; context rot as a real ceiling; the model_context_window_exceeded stop reason). Server-side compaction (compact_20260112; default trigger 150K input tokens; the recommended default for long-running sessions; cache the system prompt separately so it survives). Context editing (clear_tool_uses_20250919 for tool-result clearing in agentic workflows; clear_thinking_20251015 for extended-thinking sessions). The unifying frame: cache the stable parts, compact the long parts, clear the heavy parts, reach for each deliberately.Wed, 03 Jun 2026 00:00:00 GMTClawdemy15:00falseLesson 7 of Track 22 (Building with Claude). Phase 2 closer. The cost-and-staleness levers for a tool-heavy session. Prompt caching (cache_control on the system prompt, tool definitions across L4 + L5 + L6, and stable message content; 5-minute and 1-hour TTLs; 1.25x / 2.0x write multipliers and the 0.1x cache-hit price; minimum 4,096 token floor on Opus 4.7 and Haiku 4.5; the four-breakpoint maximum). The context-window picture (1M tokens on current Opus and Sonnet 4.x, 200K elsewhere; context rot as a real ceiling; the model_context_window_exceeded stop reason). Server-side compaction (compact_20260112; default trigger 150K input tokens; the recommended default for long-running sessions; cache the system prompt separately so it survives). Context editing (clear_tool_uses_20250919 for tool-result clearing in agentic workflows; clear_thinking_20251015 for extended-thinking sessions). The unifying frame: cache the stable parts, compact the long parts, clear the heavy parts, reach for each deliberately.Six effective-agent patternshttps://clawdemy.org/lessons/building-with-claude/six-effective-agent-patterns/lesson/https://clawdemy.org/lessons/building-with-claude/six-effective-agent-patterns/lesson/Lesson 9 of Track 22 (Building with Claude). The canonical shapes the L8 loop substrate takes. Five workflow patterns from the Anthropic engineering post 'Building Effective AI Agents' (Schluntz + Zhang, 2024-12-19) plus the open-ended agent itself, in order of increasing model autonomy: prompt chaining (decompose into a fixed sequence; trade latency for accuracy); routing (classify and direct to a specialized followup; the cleanest place to apply lesson 3's effort dial); parallelization (sectioning for independent subtasks; voting for diverse outputs on the same task); orchestrator-workers (a central LLM dynamically breaks down and delegates); evaluator-optimizer (generator and critic in a loop); the autonomous agent (open-ended problems where steps cannot be hardcoded). For each: the verbatim Anthropic definition, when to use, named examples from the post, the sketch on the L8 loop substrate, and the trade-off that decides whether the pattern fits. Plus the four-question decision tree, composition patterns, and the cross-pattern principle (the post's summary thesis): success is not about the most sophisticated system; it is about the right system for your needs.Wed, 03 Jun 2026 00:00:00 GMTClawdemy16:00falseLesson 9 of Track 22 (Building with Claude). The canonical shapes the L8 loop substrate takes. Five workflow patterns from the Anthropic engineering post 'Building Effective AI Agents' (Schluntz + Zhang, 2024-12-19) plus the open-ended agent itself, in order of increasing model autonomy: prompt chaining (decompose into a fixed sequence; trade latency for accuracy); routing (classify and direct to a specialized followup; the cleanest place to apply lesson 3's effort dial); parallelization (sectioning for independent subtasks; voting for diverse outputs on the same task); orchestrator-workers (a central LLM dynamically breaks down and delegates); evaluator-optimizer (generator and critic in a loop); the autonomous agent (open-ended problems where steps cannot be hardcoded). For each: the verbatim Anthropic definition, when to use, named examples from the post, the sketch on the L8 loop substrate, and the trade-off that decides whether the pattern fits. Plus the four-question decision tree, composition patterns, and the cross-pattern principle (the post's summary thesis): success is not about the most sophisticated system; it is about the right system for your needs.Choosing your model and the effort dialhttps://clawdemy.org/lessons/building-with-claude/choosing-your-model-and-the-effort-dial/lesson/https://clawdemy.org/lessons/building-with-claude/choosing-your-model-and-the-effort-dial/lesson/Lesson 3 of Track 22 (Building with Claude), closes Phase 1. The model-selection conversation: the three current families (Opus 4.8 at $5/$25 per MTok as the current flagship, Sonnet 4.6 at $3/$15, Haiku 4.5 at $1/$5; Opus 4.7 supported as a legacy 4.7 deployment with the same pricing and posture as 4.8) and how to pick among them, the model-ID convention (dateless pinned snapshots on the 4.6 generation and later, date-suffixed canonical IDs on earlier ones), the effort parameter that controls per-call token spending across all the models that support it (NOT Haiku 4.5), adaptive thinking (the new mode on Opus 4.8 / Opus 4.7 / Sonnet 4.6 / Opus 4.6) versus the older manual extended-thinking mode (still on Haiku 4.5, deprecated on Sonnet 4.6 / Opus 4.6, NOT supported on Opus 4.8 or 4.7), and a worked cost example (100k daily calls all-Opus $1,000 across 2-call classifier+answer, Haiku-classifier-plus-Sonnet-answer mix $480, about 52 percent cheaper). Per the docs at this lesson's drafting; check the Anthropic Models overview at platform.claude.com for the current published lineup.Wed, 27 May 2026 00:00:00 GMTClawdemy14:00falseLesson 3 of Track 22 (Building with Claude), closes Phase 1. The model-selection conversation: the three current families (Opus 4.8 at $5/$25 per MTok as the current flagship, Sonnet 4.6 at $3/$15, Haiku 4.5 at $1/$5; Opus 4.7 supported as a legacy 4.7 deployment with the same pricing and posture as 4.8) and how to pick among them, the model-ID convention (dateless pinned snapshots on the 4.6 generation and later, date-suffixed canonical IDs on earlier ones), the effort parameter that controls per-call token spending across all the models that support it (NOT Haiku 4.5), adaptive thinking (the new mode on Opus 4.8 / Opus 4.7 / Sonnet 4.6 / Opus 4.6) versus the older manual extended-thinking mode (still on Haiku 4.5, deprecated on Sonnet 4.6 / Opus 4.6, NOT supported on Opus 4.8 or 4.7), and a worked cost example (100k daily calls all-Opus $1,000 across 2-call classifier+answer, Haiku-classifier-plus-Sonnet-answer mix $480, about 52 percent cheaper). Per the docs at this lesson's drafting; check the Anthropic Models overview at platform.claude.com for the current published lineup.Server-side tools and built-inshttps://clawdemy.org/lessons/building-with-claude/server-side-tools-and-built-ins/lesson/https://clawdemy.org/lessons/building-with-claude/server-side-tools-and-built-ins/lesson/Lesson 5 of Track 22 (Building with Claude). The other half of L4's client-vs-server distinction. Three categories of Anthropic-provided tools: server tools that Anthropic executes (web_search at $0.01 per search with always-on citations, code_execution that is free when web_search_20260209 or web_fetch_20260209 is in the same request, web_fetch for retrieving a specific URL); Anthropic-schema client tools you still execute but do not have to author the schema for (bash, computer use as a beta with sandbox-the-environment discipline, memory, text_editor); and tool_search (the scale tool that dynamically loads tools from a catalog of up to 10,000 with two variants, regex and BM25). The server_tool_use response shape that differs from L4's tool_use (results inline, no round-trip in your code), the pause_turn stop reason for server-side multi-iteration loops, and the pricing stack (standard tokens + per-tool fees + server-tool results in context).Wed, 27 May 2026 00:00:00 GMTClawdemy14:00falseLesson 5 of Track 22 (Building with Claude). The other half of L4's client-vs-server distinction. Three categories of Anthropic-provided tools: server tools that Anthropic executes (web_search at $0.01 per search with always-on citations, code_execution that is free when web_search_20260209 or web_fetch_20260209 is in the same request, web_fetch for retrieving a specific URL); Anthropic-schema client tools you still execute but do not have to author the schema for (bash, computer use as a beta with sandbox-the-environment discipline, memory, text_editor); and tool_search (the scale tool that dynamically loads tools from a catalog of up to 10,000 with two variants, regex and BM25). The server_tool_use response shape that differs from L4's tool_use (results inline, no round-trip in your code), the pause_turn stop reason for server-side multi-iteration loops, and the pricing stack (standard tokens + per-tool fees + server-tool results in context).The Messages API in productionhttps://clawdemy.org/lessons/building-with-claude/the-messages-api-in-production/lesson/https://clawdemy.org/lessons/building-with-claude/the-messages-api-in-production/lesson/Lesson 2 of Track 22 (Building with Claude). The gap between a one-shot script (lesson 1) and code you can put behind real users. Streaming for interactive UIs and long generations (Python *client.messages.stream(...)* as a context manager, TypeScript *.stream(...).on('text', ...)* as an event-emitter, both with full-message helpers); the stop_reason vocabulary for non-tool-using calls (end_turn, max_tokens, stop_sequence, tool_use, refusal for safety declines with stop_details.category) with forward-refs to L5 pause_turn and L7 model_context_window_exceeded / compaction; the small error-code map (429 rate limits, 500/529 platform errors, 413 too-large) and how to classify each for retry; what the official SDKs do for you automatically (the canonical retryable set: connection errors, 408, 409, 429, and any 5xx; about two retries with exponential backoff) and what you still own (idempotency on the tool side); the request_id you must log from day one; the Message Batches API (50 percent cheaper, async, finishes in under an hour) for high-volume non-interactive work.Wed, 27 May 2026 00:00:00 GMTClawdemy13:00falseLesson 2 of Track 22 (Building with Claude). The gap between a one-shot script (lesson 1) and code you can put behind real users. Streaming for interactive UIs and long generations (Python *client.messages.stream(...)* as a context manager, TypeScript *.stream(...).on('text', ...)* as an event-emitter, both with full-message helpers); the stop_reason vocabulary for non-tool-using calls (end_turn, max_tokens, stop_sequence, tool_use, refusal for safety declines with stop_details.category) with forward-refs to L5 pause_turn and L7 model_context_window_exceeded / compaction; the small error-code map (429 rate limits, 500/529 platform errors, 413 too-large) and how to classify each for retry; what the official SDKs do for you automatically (the canonical retryable set: connection errors, 408, 409, 429, and any 5xx; about two retries with exponential backoff) and what you still own (idempotency on the tool side); the request_id you must log from day one; the Message Batches API (50 percent cheaper, async, finishes in under an hour) for high-volume non-interactive work.Tool use, the foundationhttps://clawdemy.org/lessons/building-with-claude/tool-use-the-foundation/lesson/https://clawdemy.org/lessons/building-with-claude/tool-use-the-foundation/lesson/Lesson 4 of Track 22 (Building with Claude), opens Phase 2. The jump from one-shot calls to letting Claude reach beyond its training corpus through function calls. Three-field tool definition (name + description + input_schema as JSON Schema), the four-step request-response loop (app sends tools → model returns tool_use block with stop_reason:tool_use → app executes → app sends tool_result back), the two ordering rules the API enforces with 400 on violation (tool_result must immediately follow tool_use; tool_result blocks come FIRST in the user message's content array, text comes AFTER), tool_choice options (auto / any / tool / none), parallel tool use (multiple tool_use blocks fan out, one user message with N tool_result blocks fans in), error handling with is_error and the docs' instructive-error rule, the token cost overhead (a few hundred system-prompt tokens per the Tool use overview's per-model table; any and specific-tool modes slightly higher than auto and none), the client-vs-server distinction (this lesson is client tools; lesson 5 is server tools and Anthropic-schema client tools like bash and computer use), and the structured-outputs feature (separate from tool use, sometimes confused with it). The primitive every Phase 2 and Phase 3 lesson extends.Wed, 27 May 2026 00:00:00 GMTClawdemy13:00falseLesson 4 of Track 22 (Building with Claude), opens Phase 2. The jump from one-shot calls to letting Claude reach beyond its training corpus through function calls. Three-field tool definition (name + description + input_schema as JSON Schema), the four-step request-response loop (app sends tools → model returns tool_use block with stop_reason:tool_use → app executes → app sends tool_result back), the two ordering rules the API enforces with 400 on violation (tool_result must immediately follow tool_use; tool_result blocks come FIRST in the user message's content array, text comes AFTER), tool_choice options (auto / any / tool / none), parallel tool use (multiple tool_use blocks fan out, one user message with N tool_result blocks fans in), error handling with is_error and the docs' instructive-error rule, the token cost overhead (a few hundred system-prompt tokens per the Tool use overview's per-model table; any and specific-tool modes slightly higher than auto and none), the client-vs-server distinction (this lesson is client tools; lesson 5 is server tools and Anthropic-schema client tools like bash and computer use), and the structured-outputs feature (separate from tool use, sometimes confused with it). The primitive every Phase 2 and Phase 3 lesson extends.Your first Claude API callhttps://clawdemy.org/lessons/building-with-claude/your-first-claude-api-call/lesson/https://clawdemy.org/lessons/building-with-claude/your-first-claude-api-call/lesson/Lesson 1 of Track 22 (Building with Claude), the track opener. The jump from chatting with Claude in a browser to building with Claude from your own code. A working first API call (one cURL, one Python), the structured response object (id, content as an array of blocks, stop_reason, usage), how multi-turn conversations actually work (the API is stateless, you send the full history), and the system parameter that separates instructions from messages. The smallest end-to-end pattern every later lesson in Track 22 extends.Wed, 27 May 2026 00:00:00 GMTClawdemy12:00falseLesson 1 of Track 22 (Building with Claude), the track opener. The jump from chatting with Claude in a browser to building with Claude from your own code. A working first API call (one cURL, one Python), the structured response object (id, content as an array of blocks, stop_reason, usage), how multi-turn conversations actually work (the API is stateless, you send the full history), and the system parameter that separates instructions from messages. The smallest end-to-end pattern every later lesson in Track 22 extends.Securing agents: defending against an attackerhttps://clawdemy.org/lessons/ai-agents-and-tool-use/securing-agents/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/securing-agents/lesson/Lesson 11 of Track 20 (AI Agents and Tool Use), and the closer of Phase 3. Lesson 10 made an agent trustworthy in the absence of attackers. This one takes up the other half: an agent under attack. The lesson names the three principal attack categories (hijacking the agent's goal, abusing the agent's tools, exfiltrating data through it), traces each to the structural fact that text and data share one channel into the model, builds the defense-in-depth toolkit (capability gating, input handling, output validation and routing, sandboxing, human-in-the-loop, tamper-evident audit logs), and stays honest about the fact that no combination of defenses eliminates the attack surface. Closes the track.Tue, 26 May 2026 00:00:00 GMTClawdemy11:00falseLesson 11 of Track 20 (AI Agents and Tool Use), and the closer of Phase 3. Lesson 10 made an agent trustworthy in the absence of attackers. This one takes up the other half: an agent under attack. The lesson names the three principal attack categories (hijacking the agent's goal, abusing the agent's tools, exfiltrating data through it), traces each to the structural fact that text and data share one channel into the model, builds the defense-in-depth toolkit (capability gating, input handling, output validation and routing, sandboxing, human-in-the-loop, tamper-evident audit logs), and stays honest about the fact that no combination of defenses eliminates the attack surface. Closes the track.Multimodal agents in productionhttps://clawdemy.org/lessons/multimodal-ai/multimodal-agents-in-production/lesson/https://clawdemy.org/lessons/multimodal-ai/multimodal-agents-in-production/lesson/Lesson 9 of Track 24 (Multimodal AI), in Phase 4 (Advanced multimodal directions). Phases 2 through 4 covered architectures, generative models, and frontier directions. Real systems also have to ship and survive contact with millions of users. This lesson covers what changes when multimodal AI lives inside a shipping product: the gap between benchmark performance and real-world usability, RL co-design with the product (RLHF and RLAIF as practical levers), the asymmetric-verification idea, the production-specific constraints multimodal raises (variable input sizes, tool-use latency budgets, output-streaming quirks), and the discipline of separating what engineering settles from what engineering only informs.Tue, 26 May 2026 00:00:00 GMTClawdemy13:00falseLesson 9 of Track 24 (Multimodal AI), in Phase 4 (Advanced multimodal directions). Phases 2 through 4 covered architectures, generative models, and frontier directions. Real systems also have to ship and survive contact with millions of users. This lesson covers what changes when multimodal AI lives inside a shipping product: the gap between benchmark performance and real-world usability, RL co-design with the product (RLHF and RLAIF as practical levers), the asymmetric-verification idea, the production-specific constraints multimodal raises (variable input sizes, tool-use latency budgets, output-streaming quirks), and the discipline of separating what engineering settles from what engineering only informs.Where multimodal AI is goinghttps://clawdemy.org/lessons/multimodal-ai/where-multimodal-ai-is-going/lesson/https://clawdemy.org/lessons/multimodal-ai/where-multimodal-ai-is-going/lesson/Lesson 10 of Track 24 (Multimodal AI), the closer of Phase 4 (Advanced multimodal directions) and the closer of the whole track. Nine lessons walked from 'what multimodal AI actually is' through encode-then-fuse, native multimodal, reasoning with tools, image and video generation, JEPA and world modeling, scientific applications, and production engineering. This closer steps back and names six cross-cutting threads that run through them all (the unifying patterns of the field as it stands in 2026), surfaces what the track did NOT cover, and points to where the field is going. Pairs with the L1 opener as the second Clawdemy-authored bookend; together they frame the structural-mirror arc of the eight CS25-mapped technical lessons.Tue, 26 May 2026 00:00:00 GMTClawdemy13:00falseLesson 10 of Track 24 (Multimodal AI), the closer of Phase 4 (Advanced multimodal directions) and the closer of the whole track. Nine lessons walked from 'what multimodal AI actually is' through encode-then-fuse, native multimodal, reasoning with tools, image and video generation, JEPA and world modeling, scientific applications, and production engineering. This closer steps back and names six cross-cutting threads that run through them all (the unifying patterns of the field as it stands in 2026), surfaces what the track did NOT cover, and points to where the field is going. Pairs with the L1 opener as the second Clawdemy-authored bookend; together they frame the structural-mirror arc of the eight CS25-mapped technical lessons.Data, part 2, filtering, deduplication, mixing, synthetichttps://clawdemy.org/lessons/build-an-llm-from-scratch/data-filtering/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/data-filtering/lesson/Lesson 12 of Track 15. The later stages of the funnel from lesson 11: heuristic and classifier filtering, exact / near-duplicate / substring deduplication, mixing weights (increasingly learned rather than hand-tuned), and the fast-growing category of synthetic data. Taught technical-not-legal throughout: legal and policy debates about training data are out of scope here.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 12 of Track 15. The later stages of the funnel from lesson 11: heuristic and classifier filtering, exact / near-duplicate / substring deduplication, mixing weights (increasingly learned rather than hand-tuned), and the fast-growing category of synthetic data. Taught technical-not-legal throughout: legal and policy debates about training data are out of scope here.Data, part 1, sources and datasetshttps://clawdemy.org/lessons/build-an-llm-from-scratch/data-sources/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/data-sources/lesson/Lesson 11 of Track 15. Where the trillions of training tokens scaling laws demand actually come from. Six source categories (web crawls, wikis, books, code, math/academic, social/forum), the reference open datasets (The Pile, RedPajama, FineWeb, RefinedWeb), the 50-to-1000x raw-to-final funnel, and the sampling-weight intuitions that shape what the model becomes good at. Taught technical-not-legal throughout: legal and policy debates around training data are out of scope here.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 11 of Track 15. Where the trillions of training tokens scaling laws demand actually come from. Six source categories (web crawls, wikis, books, code, math/academic, social/forum), the reference open datasets (The Pile, RedPajama, FineWeb, RefinedWeb), the 50-to-1000x raw-to-final funnel, and the sampling-weight intuitions that shape what the model becomes good at. Taught technical-not-legal throughout: legal and policy debates around training data are out of scope here.Evaluation, measuring a language modelhttps://clawdemy.org/lessons/build-an-llm-from-scratch/evaluation/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/evaluation/lesson/Lesson 10 of Track 15. Scaling laws predict loss; what you care about is capability. This lesson covers the four benchmark formats (multiple-choice, executable, instruction-following, open-ended), the four reasons evaluation is hard (construct validity, contamination, format sensitivity, open-ended scoring), the practical defenses against each, and the layered pragmatic stack modern LLM teams actually run. The discipline of treating any single number with suspicion is what bridges loss to capability honestly.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 10 of Track 15. Scaling laws predict loss; what you care about is capability. This lesson covers the four benchmark formats (multiple-choice, executable, instruction-following, open-ended), the four reasons evaluation is hard (construct validity, contamination, format sensitivity, open-ended scoring), the practical defenses against each, and the layered pragmatic stack modern LLM teams actually run. The discipline of treating any single number with suspicion is what bridges loss to capability honestly.How models run on hardware, GPUs and TPUshttps://clawdemy.org/lessons/build-an-llm-from-scratch/gpus-and-tpus/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/gpus-and-tpus/lesson/Lesson 5 of Track 15, opening Phase 2. Phase 1 built the model; this phase makes it run fast. The lesson opens the chip itself: how a GPU executes math (SIMT, streaming multiprocessors, tensor cores), the memory hierarchy that decides whether the math is fed (HBM, SRAM, registers), how TPUs differ (systolic arrays), and why hardware shapes architecture choices in lesson 2's terms.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 5 of Track 15, opening Phase 2. Phase 1 built the model; this phase makes it run fast. The lesson opens the chip itself: how a GPU executes math (SIMT, streaming multiprocessors, tensor cores), the memory hierarchy that decides whether the math is fed (HBM, SRAM, registers), how TPUs differ (systolic arrays), and why hardware shapes architecture choices in lesson 2's terms.Inference, serving a trained model fasthttps://clawdemy.org/lessons/build-an-llm-from-scratch/inference/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/inference/lesson/Lesson 8 of Track 15, closing Phase 2. Inference is a different cost problem than training: mostly memory bandwidth in decode, not compute. This lesson covers the prefill/decode split, the KV cache as the central object, and the techniques that turn memory-bound decode into something efficient: continuous batching, paged attention, speculative decoding, and quantization, plus a note on how parallelism shows up differently at inference than at training.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 8 of Track 15, closing Phase 2. Inference is a different cost problem than training: mostly memory bandwidth in decode, not compute. This lesson covers the prefill/decode split, the KV cache as the central object, and the techniques that turn memory-bound decode into something efficient: continuous batching, paged attention, speculative decoding, and quantization, plus a note on how parallelism shows up differently at inference than at training.Writing fast kernels, Triton and XLAhttps://clawdemy.org/lessons/build-an-llm-from-scratch/kernels-triton-xla/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/kernels-triton-xla/lesson/Lesson 6 of Track 15. The code-level lever for raising arithmetic intensity from lesson 2. What a kernel is, why fusing operations is the single biggest performance lever (keep intermediates in SRAM/registers, round-trip HBM once), and the two practical paths: Triton (write block-level kernels in Python; the compiler handles warps/registers/tiling) and XLA (a graph compiler that fuses standard ops automatically). FlashAttention as the worked example: same math, ~2-4x faster, large memory savings.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 6 of Track 15. The code-level lever for raising arithmetic intensity from lesson 2. What a kernel is, why fusing operations is the single biggest performance lever (keep intermediates in SRAM/registers, round-trip HBM once), and the two practical paths: Triton (write block-level kernels in Python; the compiler handles warps/registers/tiling) and XLA (a graph compiler that fuses standard ops automatically). FlashAttention as the worked example: same math, ~2-4x faster, large memory savings.Training across many devices, parallelismhttps://clawdemy.org/lessons/build-an-llm-from-scratch/parallelism/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/parallelism/lesson/Lesson 7 of Track 15, collapsing CS336 Lectures 7 and 8. Lesson 2's 16N memory accounting already exceeds one GPU; frontier models are far larger. This lesson covers the three classic parallelism schemes (data, tensor, pipeline), the modern sharded variant (FSDP/ZeRO), the within-node vs across-nodes placement rules, and how 3D parallelism combines all of them for frontier-scale training. The lesson-2 accounting becomes an actionable cluster configuration.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 7 of Track 15, collapsing CS336 Lectures 7 and 8. Lesson 2's 16N memory accounting already exceeds one GPU; frontier models are far larger. This lesson covers the three classic parallelism schemes (data, tensor, pipeline), the modern sharded variant (FSDP/ZeRO), the within-node vs across-nodes placement rules, and how 3D parallelism combines all of them for frontier-scale training. The lesson-2 accounting becomes an actionable cluster configuration.Post-training, SFT and RLHFhttps://clawdemy.org/lessons/build-an-llm-from-scratch/post-training-sft-rlhf/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/post-training-sft-rlhf/lesson/Lesson 13 of Track 15. How a pretrained base model becomes a usable assistant. Supervised fine-tuning on instruction-response data, then preference tuning on `(prompt, A, B, preferred)` data via RLHF (reward model + PPO) or its simpler successor DPO (closed-form-derived loss; no reward model, no RL step; modern default). Taught technical-primer throughout: what the methods do mechanically, with no contested-alignment-as-safety framing.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 13 of Track 15. How a pretrained base model becomes a usable assistant. Supervised fine-tuning on instruction-response data, then preference tuning on `(prompt, A, B, preferred)` data via RLHF (reward model + PPO) or its simpler successor DPO (closed-form-derived loss; no reward model, no RL step; modern default). Taught technical-primer throughout: what the methods do mechanically, with no contested-alignment-as-safety framing.Reasoning and alignment, RL with verifiable rewardshttps://clawdemy.org/lessons/build-an-llm-from-scratch/reasoning-rl/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/reasoning-rl/lesson/Lesson 14 of Track 15, the track capstone. Builds on CS336 Lecture 16 (post-training RLVR); the RL-as-systems framing is the lesson's own synthesis. RL with verifiable rewards (RLVR) replaces RLHF's learned reward model with a verifiable check (math grader, code tests, puzzle validator); GRPO is the modern algorithm, in TRL alongside SFTTrainer/DPOTrainer. DeepSeek R1 and Open R1 are the landscape anchors. RL at LLM scale is mostly a systems problem (sample + verify + train workers), and the lesson closes the track with the synthesis: you can now build the whole pipeline, and the durable method outlasts the next frontier. Taught technical-primer; contested alignment debates out of scope.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 14 of Track 15, the track capstone. Builds on CS336 Lecture 16 (post-training RLVR); the RL-as-systems framing is the lesson's own synthesis. RL with verifiable rewards (RLVR) replaces RLHF's learned reward model with a verifiable check (math grader, code tests, puzzle validator); GRPO is the modern algorithm, in TRL alongside SFTTrainer/DPOTrainer. DeepSeek R1 and Open R1 are the landscape anchors. RL at LLM scale is mostly a systems problem (sample + verify + train workers), and the lesson closes the track with the synthesis: you can now build the whole pipeline, and the durable method outlasts the next frontier. Taught technical-primer; contested alignment debates out of scope.Scaling laws, predicting what bigger gets youhttps://clawdemy.org/lessons/build-an-llm-from-scratch/scaling-laws/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/scaling-laws/lesson/Lesson 9 of Track 15, opening Phase 3. Scaling laws turn the budget question (bigger model or more data?) from folklore into arithmetic. This lesson collapses CS336 Lectures 9 and 11 per Phase 0: the power-law form, the Kaplan-to-Chinchilla shift (D ~ 20N tokens per parameter), how the laws turn a fixed compute budget into an optimal (N, D), and how inference cost pushes modern open models past Chinchilla-optimal in practice.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 9 of Track 15, opening Phase 3. Scaling laws turn the budget question (bigger model or more data?) from folklore into arithmetic. This lesson collapses CS336 Lectures 9 and 11 per Phase 0: the power-law form, the Kaplan-to-Chinchilla shift (D ~ 20N tokens per parameter), how the laws turn a fixed compute budget into an optimal (N, D), and how inference cost pushes modern open models past Chinchilla-optimal in practice.Overfitting and the bias-variance tradeoffhttps://clawdemy.org/lessons/classical-machine-learning/bias-variance-tradeoff/lesson/https://clawdemy.org/lessons/classical-machine-learning/bias-variance-tradeoff/lesson/Lesson 13 of Track 10 (Classical Machine Learning), the opener of Phase 4 (Knowing whether your model is any good). We have casually mentioned overfitting many times; now we make it precise. There are two distinct ways a model can fail to generalize, and they pull in opposite directions. This lesson names them (bias and variance), shows why making one smaller usually makes the other larger, teaches the foundational diagnostic in machine learning (reading training and test error together), and folds in regularization as the standard low-variance dial.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 13 of Track 10 (Classical Machine Learning), the opener of Phase 4 (Knowing whether your model is any good). We have casually mentioned overfitting many times; now we make it precise. There are two distinct ways a model can fail to generalize, and they pull in opposite directions. This lesson names them (bias and variance), shows why making one smaller usually makes the other larger, teaches the foundational diagnostic in machine learning (reading training and test error together), and folds in regularization as the standard low-variance dial.Reading the results: the confusion matrix, precision, recall, and ROChttps://clawdemy.org/lessons/classical-machine-learning/classification-metrics/lesson/https://clawdemy.org/lessons/classical-machine-learning/classification-metrics/lesson/Lesson 15 of Track 10 (Classical Machine Learning), closing Phase 4 (Knowing whether your model is any good) and closing the track. Accuracy is the metric beginners reach for first, and on imbalanced data it lies catastrophically. This lesson covers the metrics that tell the truth: the confusion matrix and its derived precision and recall, the threshold tradeoff that connects to the logistic-regression dial from lesson 4, and the ROC curve with its AUC summary. With it, you can read a classifier honestly and pick the right metric for the problem in front of you.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 15 of Track 10 (Classical Machine Learning), closing Phase 4 (Knowing whether your model is any good) and closing the track. Accuracy is the metric beginners reach for first, and on imbalanced data it lies catastrophically. This lesson covers the metrics that tell the truth: the confusion matrix and its derived precision and recall, the threshold tradeoff that connects to the logistic-regression dial from lesson 4, and the ROC curve with its AUC summary. With it, you can read a classifier honestly and pick the right metric for the problem in front of you.Train, test, and cross-validationhttps://clawdemy.org/lessons/classical-machine-learning/cross-validation/lesson/https://clawdemy.org/lessons/classical-machine-learning/cross-validation/lesson/Lesson 14 of Track 10 (Classical Machine Learning), in Phase 4 (Knowing whether your model is any good). The previous lesson said the diagnostic is to compare training and test error. That depends on having an HONEST test error. This lesson covers how to get one: the simple train/test split, the three-way split with a validation set for tuning, and k-fold cross-validation, the standard way to get a stable test-error estimate from limited data. It also names the data-leakage traps that quietly turn an honest evaluation into an optimistic lie.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 14 of Track 10 (Classical Machine Learning), in Phase 4 (Knowing whether your model is any good). The previous lesson said the diagnostic is to compare training and test error. That depends on having an HONEST test error. This lesson covers how to get one: the simple train/test split, the three-way split with a validation set for tuning, and k-fold cross-validation, the standard way to get a stable test-error estimate from limited data. It also names the data-leakage traps that quietly turn an honest evaluation into an optimistic lie.Squeezing dimensions: PCAhttps://clawdemy.org/lessons/classical-machine-learning/pca/lesson/https://clawdemy.org/lessons/classical-machine-learning/pca/lesson/Lesson 11 of Track 10 (Classical Machine Learning), in Phase 3 (Finding structure without labels). Clustering grouped unlabeled points. The other great unsupervised job is the opposite: compression. When every data point has dozens or hundreds of features, you need to boil them down to a handful that still capture the signal. PCA does that by finding new axes along which the data varies most. This lesson builds the directions-of-maximum-variance intuition, shows the 2D-to-1D picture, names what a principal component is, and is clear about the linear assumption that the next lesson will work around.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 11 of Track 10 (Classical Machine Learning), in Phase 3 (Finding structure without labels). Clustering grouped unlabeled points. The other great unsupervised job is the opposite: compression. When every data point has dozens or hundreds of features, you need to boil them down to a handful that still capture the signal. PCA does that by finding new axes along which the data varies most. This lesson builds the directions-of-maximum-variance intuition, shows the 2D-to-1D picture, names what a principal component is, and is clear about the linear assumption that the next lesson will work around.Seeing high-dimensional data: t-SNEhttps://clawdemy.org/lessons/classical-machine-learning/t-sne/lesson/https://clawdemy.org/lessons/classical-machine-learning/t-sne/lesson/Lesson 12 of Track 10 (Classical Machine Learning), closing Phase 3 (Finding structure without labels). PCA was great at compression but flat: its straight axes can hide curved or clustered structure. t-SNE is built for a different job, producing a 2D picture in which similar high-dimensional points end up near each other so you can see the clusters. The catch is that the picture is deceptive in specific ways. This lesson shows what t-SNE reveals, what it does not, and how to read its plots without over-reading them.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 12 of Track 10 (Classical Machine Learning), closing Phase 3 (Finding structure without labels). PCA was great at compression but flat: its straight axes can hide curved or clustered structure. t-SNE is built for a different job, producing a 2D picture in which similar high-dimensional points end up near each other so you can see the clusters. The catch is that the picture is deceptive in specific ways. This lesson shows what t-SNE reveals, what it does not, and how to read its plots without over-reading them.Recovering the third dimension, 3D visionhttps://clawdemy.org/lessons/computer-vision/3d-vision/lesson/https://clawdemy.org/lessons/computer-vision/3d-vision/lesson/The world is three-dimensional; photographs are two-dimensional. Every camera capture collapses one dimension (depth) that has to be recovered if a vision system wants to interact with the world physically. This lesson covers how vision recovers 3D structure from 2D images. We meet the depth cues (stereo disparity, monocular priors, motion), the 3D representations (depth maps, voxels, point clouds, meshes, implicit / SDFs, NeRF), the standard methods (monocular depth like MiDaS, multi-view stereo, Structure from Motion via COLMAP, NeRF, 3D Gaussian Splatting), and work one stereo-disparity-to-depth calculation by hand (`Z = (f · b) / d`).Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseThe world is three-dimensional; photographs are two-dimensional. Every camera capture collapses one dimension (depth) that has to be recovered if a vision system wants to interact with the world physically. This lesson covers how vision recovers 3D structure from 2D images. We meet the depth cues (stereo disparity, monocular priors, motion), the 3D representations (depth maps, voxels, point clouds, meshes, implicit / SDFs, NeRF), the standard methods (monocular depth like MiDaS, multi-view stereo, Structure from Motion via COLMAP, NeRF, 3D Gaussian Splatting), and work one stereo-disparity-to-depth calculation by hand (`Z = (f · b) / d`).The architectures that cracked vision, AlexNet to ResNethttps://clawdemy.org/lessons/computer-vision/cnn-architectures/lesson/https://clawdemy.org/lessons/computer-vision/cnn-architectures/lesson/Lesson 5 introduced the conv layer. This lesson is the story of how it actually got stacked, between 2012 and 2015, into the architectures that cracked computer vision. We walk four landmarks (AlexNet, VGG, GoogLeNet, ResNet) with their key ideas, parameter counts, and ImageNet results, then explain ResNet's residual block (`y = F(x) + x`) and why identity shortcuts solved the optimization-difficulty problem that had capped depth. The folded subsection on training at scale covers data parallelism, model parallelism, and the engineering tricks (mixed precision, learning-rate warmup, the linear scaling rule) that let modern vision models train on hundreds to thousands of accelerators while the underlying gradient descent algorithm stays unchanged.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 5 introduced the conv layer. This lesson is the story of how it actually got stacked, between 2012 and 2015, into the architectures that cracked computer vision. We walk four landmarks (AlexNet, VGG, GoogLeNet, ResNet) with their key ideas, parameter counts, and ImageNet results, then explain ResNet's residual block (`y = F(x) + x`) and why identity shortcuts solved the optimization-difficulty problem that had capped depth. The folded subsection on training at scale covers data parallelism, model parallelism, and the engineering tricks (mixed precision, learning-rate warmup, the linear scaling rule) that let modern vision models train on hundreds to thousands of accelerators while the underlying gradient descent algorithm stays unchanged.How machines see local patterns, convolutionhttps://clawdemy.org/lessons/computer-vision/convolution-and-cnns/lesson/https://clawdemy.org/lessons/computer-vision/convolution-and-cnns/lesson/Phase 2 opener. The general-purpose classifier from Phase 1 would technically work on images, but its first layer is wasteful (a single FC neuron on a 224x224x3 input holds 150,528 weights) and blind to the spatial structure of images. This lesson replaces that layer with the convolution: a small learned filter slides spatially across the input, computing a dot product with each local patch and producing a feature map of where its pattern occurred. We work one filter (a vertical-edge detector) by hand on a 5x5 image, name the three hyperparameters (depth K, stride S, padding P), state the exact output spatial-size formula `(W - F + 2P) / S + 1`, and count the parameter savings (AlexNet's first conv layer = 34,944 parameters, the same number for any input image size). The training loop on top is unchanged.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falsePhase 2 opener. The general-purpose classifier from Phase 1 would technically work on images, but its first layer is wasteful (a single FC neuron on a 224x224x3 input holds 150,528 weights) and blind to the spatial structure of images. This lesson replaces that layer with the convolution: a small learned filter slides spatially across the input, computing a dot product with each local patch and producing a feature map of where its pattern occurred. We work one filter (a vertical-edge detector) by hand on a 5x5 image, name the three hyperparameters (depth K, stride S, padding P), state the exact output spatial-size formula `(W - F + 2P) / S + 1`, and count the parameter savings (AlexNet's first conv layer = 34,944 parameters, the same number for any input image size). The training loop on top is unchanged.Beyond what is it, detection, segmentation, and seeing inside the nethttps://clawdemy.org/lessons/computer-vision/detection-segmentation-visualizing/lesson/https://clawdemy.org/lessons/computer-vision/detection-segmentation-visualizing/lesson/Classification answers 'what is in this image?' Real-world vision often needs more. This lesson covers the three task families that go beyond classification. **Detection** produces lists of (class, bounding box) per image (R-CNN family vs YOLO; anchor boxes; IoU + mAP evaluation). **Segmentation** labels every pixel (semantic with FCN / U-Net vs instance with Mask R-CNN). **Visualization** lets us peek inside trained networks (saliency, occlusion, Grad-CAM, t-SNE, DeepDream) with an honest caveat that these are debugging tools, not full explanations. We work one IoU computation by hand in the body and another in practice, and the training loop on top is unchanged across all three task families.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseClassification answers 'what is in this image?' Real-world vision often needs more. This lesson covers the three task families that go beyond classification. **Detection** produces lists of (class, bounding box) per image (R-CNN family vs YOLO; anchor boxes; IoU + mAP evaluation). **Segmentation** labels every pixel (semantic with FCN / U-Net vs instance with Mask R-CNN). **Visualization** lets us peek inside trained networks (saliency, occlusion, Grad-CAM, t-SNE, DeepDream) with an honest caveat that these are debugging tools, not full explanations. We work one IoU computation by hand in the body and another in practice, and the training loop on top is unchanged across all three task families.Generating images by denoising, diffusionhttps://clawdemy.org/lessons/computer-vision/diffusion-models/lesson/https://clawdemy.org/lessons/computer-vision/diffusion-models/lesson/VAEs were stable but blurry; GANs were sharp but unstable. Diffusion models take a third approach that has largely replaced both for high-quality image generation since around 2020. The trick is to gradually corrupt training images with noise (a fixed forward process), train a network to predict and reverse the noise (the learned reverse process), and then run the network in reverse: start from pure noise and iteratively denoise into an image. This lesson covers diffusion at vision-context intuition level, works one forward noising step by hand, names the trade-off (high quality + stable training, but slow iterative inference), and explains how text-to-image systems (Stable Diffusion, Imagen, DALL-E 2/3) add language conditioning on top with classifier-free guidance. The L11 VAE makes a comeback as latent diffusion's first-stage encoder.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseVAEs were stable but blurry; GANs were sharp but unstable. Diffusion models take a third approach that has largely replaced both for high-quality image generation since around 2020. The trick is to gradually corrupt training images with noise (a fixed forward process), train a network to predict and reverse the noise (the learned reverse process), and then run the network in reverse: start from pure noise and iteratively denoise into an image. This lesson covers diffusion at vision-context intuition level, works one forward noising step by hand, names the trade-off (high quality + stable training, but slow iterative inference), and explains how text-to-image systems (Stable Diffusion, Imagen, DALL-E 2/3) add language conditioning on top with classifier-free guidance. The L11 VAE makes a comeback as latent diffusion's first-stage encoder.Teaching machines to imagine, GANs and VAEshttps://clawdemy.org/lessons/computer-vision/gans-and-vaes/lesson/https://clawdemy.org/lessons/computer-vision/gans-and-vaes/lesson/Every architecture in this track so far has been discriminative (image in, label out). This lesson opens the generative side. We distinguish discriminative from generative modeling, walk the two pre-2020 generative-image-model families (VAEs and GANs) at intuition level, and work the reparameterization trick `z = μ + σ · ε` by hand. The VAE-vs-GAN trade-off (smooth-but-blurry vs sharp-but-hard-to-train) sets up why neither was a perfect solution and motivates diffusion (next lesson). Full mechanical derivations live in sister tracks (T19 for the VAE's ELBO, T24 for GAN training dynamics); this lesson stays at the vision-applied-use level.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseEvery architecture in this track so far has been discriminative (image in, label out). This lesson opens the generative side. We distinguish discriminative from generative modeling, walk the two pre-2020 generative-image-model families (VAEs and GANs) at intuition level, and work the reparameterization trick `z = μ + σ · ε` by hand. The VAE-vs-GAN trade-off (smooth-but-blurry vs sharp-but-hard-to-train) sets up why neither was a perfect solution and motivates diffusion (next lesson). Full mechanical derivations live in sister tracks (T19 for the VAE's ELBO, T24 for GAN training dynamics); this lesson stays at the vision-applied-use level.Computer vision among people, the human-centered viewhttps://clawdemy.org/lessons/computer-vision/human-centered-ai/lesson/https://clawdemy.org/lessons/computer-vision/human-centered-ai/lesson/Closing lesson of Track 16. T16 built classifiers, detectors, segmenters, generative models, 3D recovery, vision-language systems, and world models, and many of them are deployed in the real world. The final question this track owes is what these systems get right and wrong in deployment, and how to reason about those strengths and failures as engineering concerns. We catalog the standard failure modes (distribution shift, adversarial examples, OOD inputs, shortcut learning, calibration / overconfidence), treat bias as a property of training data with concrete measurement (disaggregated reporting) and mitigation (data / model / evaluation engineering), and close with the trustworthiness gap between benchmark accuracy and real-world reliability. Policy debates around vision systems are real, important, and outside this lesson's scope; the right forum for those is different.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseClosing lesson of Track 16. T16 built classifiers, detectors, segmenters, generative models, 3D recovery, vision-language systems, and world models, and many of them are deployed in the real world. The final question this track owes is what these systems get right and wrong in deployment, and how to reason about those strengths and failures as engineering concerns. We catalog the standard failure modes (distribution shift, adversarial examples, OOD inputs, shortcut learning, calibration / overconfidence), treat bias as a property of training data with concrete measurement (disaggregated reporting) and mitigation (data / model / evaluation engineering), and close with the trustworthiness gap between benchmark accuracy and real-world reliability. Policy debates around vision systems are real, important, and outside this lesson's scope; the right forum for those is different.Telling pictures apart with one score, linear classifiershttps://clawdemy.org/lessons/computer-vision/linear-classifiers/lesson/https://clawdemy.org/lessons/computer-vision/linear-classifiers/lesson/Lesson 1 named the strategy (learn from labeled examples); this lesson is the simplest machine that actually carries it out. The linear classifier flattens an image into a column of numbers, multiplies it by a learned weight matrix, adds a learned bias, and reads off one score per class. We define the score function `s = W · x + b`, ground it in CIFAR-10's shapes (x is 3072 numbers, W is 10 by 3072, 10 scores out), compute a small prediction by hand, see what each row of W really is (a learned per-class template), look at the geometric (hyperplane) view, and meet the structural limit (one template per class) that motivates everything that follows.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 1 named the strategy (learn from labeled examples); this lesson is the simplest machine that actually carries it out. The linear classifier flattens an image into a column of numbers, multiplies it by a learned weight matrix, adds a learned bias, and reads off one score per class. We define the score function `s = W · x + b`, ground it in CIFAR-10's shapes (x is 3072 numbers, W is 10 by 3072, 10 scores out), compute a small prediction by hand, see what each row of W really is (a learned per-class template), look at the geometric (hyperplane) view, and meet the structural limit (one template per class) that motivates everything that follows.How a classifier learns, loss and optimizationhttps://clawdemy.org/lessons/computer-vision/loss-and-optimization/lesson/https://clawdemy.org/lessons/computer-vision/loss-and-optimization/lesson/Lesson 2 left us with a classifier (s = W · x + b) and no way to set its knobs. This lesson defines both halves of the answer. A loss function turns 'predictions match labels' into a single number to drive down (we define multiclass SVM and softmax / cross-entropy and work each on the same worked example); regularization adds a penalty on large weights for better generalization; and gradient descent is the loop that nudges W and b in the negative-gradient direction with step size set by the learning rate. We name analytic vs numerical gradients and mini-batch / SGD as the practical realization. That four-step cycle (forward pass, loss, gradient, step) is how every classifier in this track, including the giants ahead, actually trains.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 2 left us with a classifier (s = W · x + b) and no way to set its knobs. This lesson defines both halves of the answer. A loss function turns 'predictions match labels' into a single number to drive down (we define multiclass SVM and softmax / cross-entropy and work each on the same worked example); regularization adds a penalty on large weights for better generalization; and gradient descent is the loop that nudges W and b in the negative-gradient direction with step size set by the learning rate. We name analytic vs numerical gradients and mini-batch / SGD as the practical realization. That four-step cycle (forward pass, loss, gradient, step) is how every classifier in this track, including the giants ahead, actually trains.Learning features instead of coding them, neural networks and backprophttps://clawdemy.org/lessons/computer-vision/neural-networks-and-backprop/lesson/https://clawdemy.org/lessons/computer-vision/neural-networks-and-backprop/lesson/The Phase 1 capstone. Lesson 2 capped us at one template per class; lesson 3 gave us a training loop. This lesson lifts that cap. Stacking two linear layers gains nothing on its own (the composition collapses to one linear layer), so we insert a non-linearity (ReLU) between them. The hidden layer now produces learned features of the image instead of operating on raw pixels, which broke the multi-modal limit and ended the hand-engineered-features era of computer vision. Computing the gradient through every weight in every layer is then made tractable by backpropagation, the chain rule applied recursively through the network's computational graph: one forward pass plus one backward pass yields gradients for every weight at once. By the end, the full general-purpose image-classifier training loop is in place.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseThe Phase 1 capstone. Lesson 2 capped us at one template per class; lesson 3 gave us a training loop. This lesson lifts that cap. Stacking two linear layers gains nothing on its own (the composition collapses to one linear layer), so we insert a non-linearity (ReLU) between them. The hidden layer now produces learned features of the image instead of operating on raw pixels, which broke the multi-modal limit and ended the hand-engineered-features era of computer vision. Computing the gradient through every weight in every layer is then made tractable by backpropagation, the chain rule applied recursively through the network's computational graph: one forward pass plus one backward pass yields gradients for every weight at once. By the end, the full general-purpose image-classifier training loop is in place.Learning from images without labels, self-supervised visionhttps://clawdemy.org/lessons/computer-vision/self-supervised-vision/lesson/https://clawdemy.org/lessons/computer-vision/self-supervised-vision/lesson/Phase 3 opener. Every supervised model so far in this track has needed labeled images, and labels are expensive (ImageNet's million labels took years). Self-supervised learning lets a model learn useful visual features from unlabeled images alone, by constructing pretext tasks whose labels come from the data itself. We walk the pretext-task history (rotation, jigsaw, colorization), the contrastive-learning shift (SimCLR, MoCo, BYOL) with one cosine similarity by hand, and masked image modeling (MAE, DINO/DINOv2). The pre-train-then-fine-tune workflow that powers most modern vision-language and multimodal systems lives here, and it is the engine that makes vision feasible in label-scarce domains (medical imaging, satellite, scientific data) where unlabeled data is abundant.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falsePhase 3 opener. Every supervised model so far in this track has needed labeled images, and labels are expensive (ImageNet's million labels took years). Self-supervised learning lets a model learn useful visual features from unlabeled images alone, by constructing pretext tasks whose labels come from the data itself. We walk the pretext-task history (rotation, jigsaw, colorization), the contrastive-learning shift (SimCLR, MoCo, BYOL) with one cosine similarity by hand, and masked image modeling (MAE, DINO/DINOv2). The pre-train-then-fine-tune workflow that powers most modern vision-language and multimodal systems lives here, and it is the engine that makes vision feasible in label-scarce domains (medical imaging, satellite, scientific data) where unlabeled data is abundant.Sequence tools for vision, recurrence and attentionhttps://clawdemy.org/lessons/computer-vision/sequence-tools-for-vision/lesson/https://clawdemy.org/lessons/computer-vision/sequence-tools-for-vision/lesson/A single image is a static scene; many vision tasks involve sequences (captions are sequences of words, videos are sequences of frames, and the Vision Transformer treats an image itself as a sequence of patches). This lesson covers the two sequence-processing tools (recurrence and attention) at the level needed for vision applications. Recurrence (RNN, LSTM, GRU) processes a sequence one step at a time and carries a hidden state forward; attention compares every position to every other in parallel and returns a weighted average of values. We cover vision applications (CNN-RNN captioning, CNN-attention captioning, CNN-RNN video, Vision Transformer), work one attention computation by hand, and route to sister tracks (T12 L2 for recurrence; T5 multi-lesson + T14 for transformers) for the deep mechanics. Combining Lec 7+8 into one lesson is a deliberate Phase 0 choice to avoid duplicating sister-track depth.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseA single image is a static scene; many vision tasks involve sequences (captions are sequences of words, videos are sequences of frames, and the Vision Transformer treats an image itself as a sequence of patches). This lesson covers the two sequence-processing tools (recurrence and attention) at the level needed for vision applications. Recurrence (RNN, LSTM, GRU) processes a sequence one step at a time and carries a hidden state forward; attention compares every position to every other in parallel and returns a weighted average of values. We cover vision applications (CNN-RNN captioning, CNN-attention captioning, CNN-RNN video, Vision Transformer), work one attention computation by hand, and route to sister tracks (T12 L2 for recurrence; T5 multi-lesson + T14 for transformers) for the deep mechanics. Combining Lec 7+8 into one lesson is a deliberate Phase 0 choice to avoid duplicating sister-track depth.Teaching machines to understand videohttps://clawdemy.org/lessons/computer-vision/video-understanding/lesson/https://clawdemy.org/lessons/computer-vision/video-understanding/lesson/A photo is one moment; a video is a sequence of moments stretched across time. This lesson walks the standard ways of adding the time dimension to a vision system, from the surprisingly competitive single-frame baseline through late and early fusion, 3D convolutions (~3x param cost per filter; C3D and I3D), two-stream networks (RGB appearance + optical-flow motion; SlowFast for the modern descendant), CNN-plus-RNN (cross-link to L7), and video transformers (TimeSformer's divided space-time attention as a practical factorization). The training loop is unchanged across all of them. We work the 2D-vs-3D conv parameter-count ratio in the body and again in practice, and emphasize the practitioner discipline of always running the single-frame baseline as the floor any video model must beat.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseA photo is one moment; a video is a sequence of moments stretched across time. This lesson walks the standard ways of adding the time dimension to a vision system, from the surprisingly competitive single-frame baseline through late and early fusion, 3D convolutions (~3x param cost per filter; C3D and I3D), two-stream networks (RGB appearance + optical-flow motion; SlowFast for the modern descendant), CNN-plus-RNN (cross-link to L7), and video transformers (TimeSformer's divided space-time attention as a practical factorization). The training loop is unchanged across all of them. We work the 2D-vs-3D conv parameter-count ratio in the body and again in practice, and emphasize the practitioner discipline of always running the single-frame baseline as the floor any video model must beat.Connecting pictures and words, vision and languagehttps://clawdemy.org/lessons/computer-vision/vision-and-language/lesson/https://clawdemy.org/lessons/computer-vision/vision-and-language/lesson/Modern AI systems do not treat images and language as separate problems; they share a representation. This lesson covers CLIP's two-tower contrastive setup (image encoder + text encoder trained jointly on ~400M web image-text pairs), the downstream applications that fall out of the trained joint embedding space (zero-shot classification, image-text retrieval, captioning, VQA), modern general-purpose vision-language models (VLMs), and the economic frame that closes Phase 3 (image-text pairs are abundant on the web; CLIP-scale pre-training exploits that abundance). We work one image-text cosine similarity by hand and one zero-shot-classification reasoning exercise.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseModern AI systems do not treat images and language as separate problems; they share a representation. This lesson covers CLIP's two-tower contrastive setup (image encoder + text encoder trained jointly on ~400M web image-text pairs), the downstream applications that fall out of the trained joint embedding space (zero-shot classification, image-text retrieval, captioning, VQA), modern general-purpose vision-language models (VLMs), and the economic frame that closes Phase 3 (image-text pairs are abundant on the web; CLIP-scale pre-training exploits that abundance). We work one image-text cosine similarity by hand and one zero-shot-classification reasoning exercise.Models that imagine the world, world modelinghttps://clawdemy.org/lessons/computer-vision/world-modeling/lesson/https://clawdemy.org/lessons/computer-vision/world-modeling/lesson/Every vision system so far in this track has been reactive (process current input, output an answer). World modeling extends vision to predictive: given the past, predict the future. Self-driving trajectory prediction, robotics planning, video generation, and model-based reinforcement learning are all variants. This lesson covers world modeling at vision-context level: the three-piece architecture (encoder + dynamics + optional decoder), the central pixel-space-vs-latent-space prediction trade-off (worked with a parameter-cost calculation), landmark architectures (World Models, Dreamer family, MuZero, JEPA, Sora-style video world models), and the cross-track ties to T18 (model-based RL depth) and T24 (production-scale video generation depth).Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseEvery vision system so far in this track has been reactive (process current input, output an answer). World modeling extends vision to predictive: given the past, predict the future. Self-driving trajectory prediction, robotics planning, video generation, and model-based reinforcement learning are all variants. This lesson covers world modeling at vision-context level: the three-piece architecture (encoder + dynamics + optional decoder), the central pixel-space-vs-latent-space prediction trade-off (worked with a parameter-cost calculation), landmark architectures (World Models, Dreamer family, MuZero, JEPA, Sora-style video world models), and the cross-track ties to T18 (model-based RL depth) and T24 (production-scale video generation depth).Actor-critic methodshttps://clawdemy.org/lessons/deep-reinforcement-learning/actor-critic/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/actor-critic/lesson/Lesson 5 of Track 18 (Deep Reinforcement Learning) and the close of Phase 1. REINFORCE's central problem is variance. Actor-critic methods reduce it by training a second network alongside the policy, a critic that estimates the value function and supplies a baseline (or a bootstrapped target) for the policy update. On the same sigmoid bandit from L4, the optimal baseline takes the variance from 0.0625 to zero (SNR from 1 to infinity); in practice the critic is learned and the variance reduction costs some bias. The lesson presents the two-network split, the advantage estimators (MC, TD, n-step, GAE), the bias-variance tradeoff, and the family that includes A2C/A3C, SAC, PPO, and the RLHF post-training step.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 5 of Track 18 (Deep Reinforcement Learning) and the close of Phase 1. REINFORCE's central problem is variance. Actor-critic methods reduce it by training a second network alongside the policy, a critic that estimates the value function and supplies a baseline (or a bootstrapped target) for the policy update. On the same sigmoid bandit from L4, the optimal baseline takes the variance from 0.0625 to zero (SNR from 1 to infinity); in practice the critic is learned and the variance reduction costs some bias. The lesson presents the two-network split, the advantage estimators (MC, TD, n-step, GAE), the bias-variance tradeoff, and the family that includes A2C/A3C, SAC, PPO, and the RLHF post-training step.Brief: Control as inference (closes Phase 2)https://clawdemy.org/lessons/deep-reinforcement-learning/control-as-inference/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/control-as-inference/lesson/Editorial brief for Lesson 12 of Track 18. Phase 2 L12 (final): the optimality-conditioned graphical model, the soft Bellman backup, and the unification of SAC + RLHF + DPO. Closes Phase 2.Mon, 25 May 2026 00:00:00 GMTClawdemy18:00falseEditorial brief for Lesson 12 of Track 18. Phase 2 L12 (final): the optimality-conditioned graphical model, the soft Bellman backup, and the unification of SAC + RLHF + DPO. Closes Phase 2.Brief: DQN (replay buffer, target network, double Q-learning)https://clawdemy.org/lessons/deep-reinforcement-learning/dqn/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/dqn/lesson/Editorial brief for Lesson 7 of Track 18 (Deep Reinforcement Learning). Phase 2 lesson 2: each DQN engineering trick mapped to a leg of the deadly triad named in L6. Worked: closed-form max-overestimation bias on a small example.Mon, 25 May 2026 00:00:00 GMTClawdemy15:00falseEditorial brief for Lesson 7 of Track 18 (Deep Reinforcement Learning). Phase 2 lesson 2: each DQN engineering trick mapped to a leg of the deadly triad named in L6. Worked: closed-form max-overestimation bias on a small example.Imitation learning and behavioral cloninghttps://clawdemy.org/lessons/deep-reinforcement-learning/imitation-learning/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/imitation-learning/lesson/Lesson 2 of Track 18 (Deep Reinforcement Learning). The simplest approach to producing a policy is to ignore the reward entirely and copy an expert. Collect a dataset of (state, expert action) pairs and train a network by supervised learning to predict the expert's action. This is behavioral cloning. It is appealing because it turns RL into supervised learning, with no environment interaction during training. It breaks because small per-step errors compound across long trajectories: BC's worst-case expected mistakes scale as O(εT²) in episode length, where the linear-in-T analog (achieved by DAgger's on-policy correction) is O(εT). The lesson works the bound numerically, presents DAgger as the standard fix, names where BC is good enough anyway (short horizons, supervised LLM fine-tuning), and shows why long-horizon imitation generally needs either DAgger or genuine reinforcement learning.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 2 of Track 18 (Deep Reinforcement Learning). The simplest approach to producing a policy is to ignore the reward entirely and copy an expert. Collect a dataset of (state, expert action) pairs and train a network by supervised learning to predict the expert's action. This is behavioral cloning. It is appealing because it turns RL into supervised learning, with no environment interaction during training. It breaks because small per-step errors compound across long trajectories: BC's worst-case expected mistakes scale as O(εT²) in episode length, where the linear-in-T analog (achieved by DAgger's on-policy correction) is O(εT). The lesson works the bound numerically, presents DAgger as the standard fix, names where BC is good enough anyway (short horizons, supervised LLM fine-tuning), and shows why long-horizon imitation generally needs either DAgger or genuine reinforcement learning.Introduction to deep reinforcement learninghttps://clawdemy.org/lessons/deep-reinforcement-learning/introduction-to-deep-rl/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/introduction-to-deep-rl/lesson/Lesson 1 of Track 18 (Deep Reinforcement Learning), and the track opener. Reinforcement learning is the third major regime of machine learning: an agent acts in an environment, receives rewards over time (often delayed), and learns a policy that maximizes the accumulated reward. The deep variant replaces classical RL's lookup tables with neural networks, which gives the field its reach (Atari, AlphaGo, robotics, RLHF for LLMs) and its difficulty (credit assignment, distribution shift, broken convergence guarantees). This lesson situates the field, names the agent-environment loop and its vocabulary (state, action, reward, policy, return), works one discounted-return computation, and previews the difficulties the rest of the track responds to.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 1 of Track 18 (Deep Reinforcement Learning), and the track opener. Reinforcement learning is the third major regime of machine learning: an agent acts in an environment, receives rewards over time (often delayed), and learns a policy that maximizes the accumulated reward. The deep variant replaces classical RL's lookup tables with neural networks, which gives the field its reach (Atari, AlphaGo, robotics, RLHF for LLMs) and its difficulty (credit assignment, distribution shift, broken convergence guarantees). This lesson situates the field, names the agent-environment loop and its vocabulary (state, action, reward, policy, return), works one discounted-return computation, and previews the difficulties the rest of the track responds to.Brief: Model-based RL, learning the dynamicshttps://clawdemy.org/lessons/deep-reinforcement-learning/model-based-learning/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/model-based-learning/lesson/Editorial brief for Lesson 9 of Track 18. Phase 2 L9: opens the P-branch of the dispatch table. Worked: least-squares fit of a linear-Gaussian model (dual-path with inspection); compounding-error rollout.Mon, 25 May 2026 00:00:00 GMTClawdemy16:00falseEditorial brief for Lesson 9 of Track 18. Phase 2 L9: opens the P-branch of the dispatch table. Worked: least-squares fit of a linear-Gaussian model (dual-path with inspection); compounding-error rollout.Brief: Planning with a learned modelhttps://clawdemy.org/lessons/deep-reinforcement-learning/planning-with-models/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/planning-with-models/lesson/Editorial brief for Lesson 10 of Track 18. Phase 2 L10: closes the P-branch and the L3 dispatch-table tour. Worked: one full CEM iteration by hand; MPC horizon-decision exercise.Mon, 25 May 2026 00:00:00 GMTClawdemy16:00falseEditorial brief for Lesson 10 of Track 18. Phase 2 L10: closes the P-branch and the L3 dispatch-table tour. Worked: one full CEM iteration by hand; MPC horizon-decision exercise.Policy gradients (REINFORCE)https://clawdemy.org/lessons/deep-reinforcement-learning/policy-gradients/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/policy-gradients/lesson/Lesson 4 of Track 18 (Deep Reinforcement Learning). The most direct way to improve a neural-network policy is to follow the gradient of the expected return with respect to its parameters. The obstacle: the expectation is over trajectories sampled by the policy itself, so standard differentiation does not apply. The log-derivative trick is the one calculus identity that solves it, and the algorithm that falls out is REINFORCE (Williams, 1992). The lesson derives it from scratch, shows why the environment dynamics drop out (making deep RL model-free), works a sigmoid bandit with dual-path validation (analytic expected gradient vs single-sample variance), introduces the rewards-to-go and baseline-subtraction refinements that yield the advantage A^π = Q^π - V^π from lesson 3, and names the high-variance failure mode that the rest of the policy-gradient family exists to manage.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 4 of Track 18 (Deep Reinforcement Learning). The most direct way to improve a neural-network policy is to follow the gradient of the expected return with respect to its parameters. The obstacle: the expectation is over trajectories sampled by the policy itself, so standard differentiation does not apply. The log-derivative trick is the one calculus identity that solves it, and the algorithm that falls out is REINFORCE (Williams, 1992). The lesson derives it from scratch, shows why the environment dynamics drop out (making deep RL model-free), works a sigmoid bandit with dual-path validation (analytic expected gradient vs single-sample variance), introduces the rewards-to-go and baseline-subtraction refinements that yield the advantage A^π = Q^π - V^π from lesson 3, and names the high-variance failure mode that the rest of the policy-gradient family exists to manage.Brief: PPO (trust regions, clipped surrogate, RLHF workhorse)https://clawdemy.org/lessons/deep-reinforcement-learning/ppo/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/ppo/lesson/Editorial brief for Lesson 8 of Track 18. Phase 2 lesson 3: the on-policy resolution to the deadly triad. Derive PPO from importance-sampled surrogate via TRPO; show the asymmetric clip behavior; situate as the RLHF workhorse.Mon, 25 May 2026 00:00:00 GMTClawdemy16:00falseEditorial brief for Lesson 8 of Track 18. Phase 2 lesson 3: the on-policy resolution to the deadly triad. Derive PPO from importance-sampled surrogate via TRPO; show the asymmetric clip behavior; situate as the RLHF workhorse.RL fundamentals (MDPs, returns, value, and policy)https://clawdemy.org/lessons/deep-reinforcement-learning/rl-fundamentals/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/rl-fundamentals/lesson/Lesson 3 of Track 18 (Deep Reinforcement Learning). The first lesson framed the agent-environment loop; the second showed why imitation alone is not enough. This lesson makes the loop precise. The Markov decision process tuple (S, A, P, R, γ) is the formal object every RL algorithm in the rest of the track is defined against, and the value functions (V, Q, advantage A = Q - V) and Bellman equation it gives you are the language those algorithms speak. The lesson defines each object, solves a small 2-state Bellman system by hand (V(s0) = 1/(1-γ²) = 5.263 at γ=0.9), verifies it by direct geometric summation, and ends with the dispatch table that every later T18 lesson plugs into.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 3 of Track 18 (Deep Reinforcement Learning). The first lesson framed the agent-environment loop; the second showed why imitation alone is not enough. This lesson makes the loop precise. The Markov decision process tuple (S, A, P, R, γ) is the formal object every RL algorithm in the rest of the track is defined against, and the value functions (V, Q, advantage A = Q - V) and Bellman equation it gives you are the language those algorithms speak. The lesson defines each object, solves a small 2-state Bellman system by hand (V(s0) = 1/(1-γ²) = 5.263 at γ=0.9), verifies it by direct geometric summation, and ends with the dispatch table that every later T18 lesson plugs into.Brief: RLHF (opens Phase 3)https://clawdemy.org/lessons/deep-reinforcement-learning/rlhf/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/rlhf/lesson/Editorial brief for Lesson 13 of Track 18. Phase 3 opener: the InstructGPT pipeline derived as the variational solution from L11/L12. §6 watch-zone discipline applied throughout (operational instruments + empirical/value distinction).Mon, 25 May 2026 00:00:00 GMTClawdemy18:00falseEditorial brief for Lesson 13 of Track 18. Phase 3 opener: the InstructGPT pipeline derived as the variational solution from L11/L12. §6 watch-zone discipline applied throughout (operational instruments + empirical/value distinction).Brief: Value-based RL (Q-learning, the deadly triad)https://clawdemy.org/lessons/deep-reinforcement-learning/value-based-rl/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/value-based-rl/lesson/Editorial brief for Lesson 6 of Track 18 (Deep Reinforcement Learning). Phase 2 opener: the Q-branch of the dispatch table. Derive Q-learning from the Bellman optimality equation, run Q-iteration on a small MDP with dual-path validation, name the deadly triad.Mon, 25 May 2026 00:00:00 GMTClawdemy15:00falseEditorial brief for Lesson 6 of Track 18 (Deep Reinforcement Learning). Phase 2 opener: the Q-branch of the dispatch table. Derive Q-learning from the Bellman optimality equation, run Q-iteration on a small MDP with dual-path validation, name the deadly triad.Brief: Variational inference for RLhttps://clawdemy.org/lessons/deep-reinforcement-learning/variational-inference/lesson/https://clawdemy.org/lessons/deep-reinforcement-learning/variational-inference/lesson/Editorial brief for Lesson 11 of Track 18. Phase 2 L11: ELBO, reparameterization trick, the two RL applications. Sets up L12 control-as-inference.Mon, 25 May 2026 00:00:00 GMTClawdemy17:00falseEditorial brief for Lesson 11 of Track 18. Phase 2 L11: ELBO, reparameterization trick, the two RL applications. Sets up L12 control-as-inference.Autoregressive models, factoring by the chain rulehttps://clawdemy.org/lessons/generative-models-and-diffusion/autoregressive-models/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/autoregressive-models/lesson/Lesson 2 of Track 19 (Generative Models and Diffusion), and the first math-density lesson. An autoregressive model factors any joint distribution into a product of conditionals using the chain rule of probability, learns each conditional with a neural network that respects causality, and trains by minimizing the negative log-likelihood (= next-token cross-entropy). This is the math behind every modern large language model, in one identity and one architectural constraint.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 2 of Track 19 (Generative Models and Diffusion), and the first math-density lesson. An autoregressive model factors any joint distribution into a product of conditionals using the chain rule of probability, learns each conditional with a neural network that respects causality, and trains by minimizing the negative log-likelihood (= next-token cross-entropy). This is the math behind every modern large language model, in one identity and one architectural constraint.Diffusion models I, the forward and reverse processeshttps://clawdemy.org/lessons/generative-models-and-diffusion/diffusion-i-forward-and-reverse-processes/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/diffusion-i-forward-and-reverse-processes/lesson/Lesson 12 of Track 19 (Generative Models and Diffusion). A diffusion model defines a fixed forward Markov chain that progressively noises data into Gaussian noise, then learns a reverse Markov chain that denoises step by step back to data. This lesson builds the DDPM derivation, derives the closed-form forward-sampling shortcut that makes training feasible, shows that the simplified DDPM loss is denoising score matching at the timestep's noise level, and walks the training and sampling loops. Opens the §6 watch territory for L12-L14 with the five-layer in-body checkpoint pattern.Mon, 25 May 2026 00:00:00 GMTClawdemy16:00falseLesson 12 of Track 19 (Generative Models and Diffusion). A diffusion model defines a fixed forward Markov chain that progressively noises data into Gaussian noise, then learns a reverse Markov chain that denoises step by step back to data. This lesson builds the DDPM derivation, derives the closed-form forward-sampling shortcut that makes training feasible, shows that the simplified DDPM loss is denoising score matching at the timestep's noise level, and walks the training and sampling loops. Opens the §6 watch territory for L12-L14 with the five-layer in-body checkpoint pattern.Energy-based models, the partition-function problemhttps://clawdemy.org/lessons/generative-models-and-diffusion/energy-based-models/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/energy-based-models/lesson/Lesson 10 of Track 19 (Generative Models and Diffusion), opening Phase 3. An energy-based model defines an unnormalized density by naming an energy function E(x) and dividing by a normalization constant Z. The architectural freedom on E is the paradigm's main appeal; the intractability of Z is the entire engineering challenge. This lesson derives the model, walks the maximum-likelihood gradient (which contains a hard-to-estimate negative-phase expectation), and identifies the conceptual escape (the partition function vanishes under the x-gradient, so the score function is computable directly) that opens the score-matching framework in the next lesson and the diffusion paradigm in lessons 12-14.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 10 of Track 19 (Generative Models and Diffusion), opening Phase 3. An energy-based model defines an unnormalized density by naming an energy function E(x) and dividing by a normalization constant Z. The architectural freedom on E is the paradigm's main appeal; the intractability of Z is the entire engineering challenge. This lesson derives the model, walks the maximum-likelihood gradient (which contains a hard-to-estimate negative-phase expectation), and identifies the conceptual escape (the partition function vanishes under the x-gradient, so the score function is computable directly) that opens the score-matching framework in the next lesson and the diffusion paradigm in lessons 12-14.Evaluating generative modelshttps://clawdemy.org/lessons/generative-models-and-diffusion/evaluating-generative-models/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/evaluating-generative-models/lesson/Lesson 9 of Track 19 (Generative Models and Diffusion), closing Phase 2. How do you compare a VAE to a GAN to a diffusion model, when likelihood is exact for some paradigms, bounded for others, and unavailable for the rest? This lesson covers the paradigm-agnostic evaluation toolkit (FID, Inception Score, precision and recall for distributions, human preference studies) and the cross-paradigm reading that each paradigm has its own fingerprint of evaluation instruments.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 9 of Track 19 (Generative Models and Diffusion), closing Phase 2. How do you compare a VAE to a GAN to a diffusion model, when likelihood is exact for some paradigms, bounded for others, and unavailable for the rest? This lesson covers the paradigm-agnostic evaluation toolkit (FID, Inception Score, precision and recall for distributions, human preference studies) and the cross-paradigm reading that each paradigm has its own fingerprint of evaluation instruments.GANs, the minimax gamehttps://clawdemy.org/lessons/generative-models-and-diffusion/gans-the-minimax-game/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/gans-the-minimax-game/lesson/Lesson 7 of Track 19 (Generative Models and Diffusion). A generative adversarial network drops the likelihood objective entirely and trains by a minimax game between a generator and a discriminator. This lesson states the objective, derives the optimal discriminator at fixed generator, shows that the implicit divergence the generator minimizes is the Jensen-Shannon divergence (not the forward KL of Phase 1), and explains why mode collapse and training instability are paradigm-level features of that divergence choice rather than incidental bugs.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 7 of Track 19 (Generative Models and Diffusion). A generative adversarial network drops the likelihood objective entirely and trains by a minimax game between a generator and a discriminator. This lesson states the objective, derives the optimal discriminator at fixed generator, shows that the implicit divergence the generator minimizes is the Jensen-Shannon divergence (not the forward KL of Phase 1), and explains why mode collapse and training instability are paradigm-level features of that divergence choice rather than incidental bugs.Latent variables and the ELBOhttps://clawdemy.org/lessons/generative-models-and-diffusion/latent-variables-and-the-elbo/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/latent-variables-and-the-elbo/lesson/Lesson 5 of Track 19 (Generative Models and Diffusion), opening Phase 2 (latent-and-adversarial). A latent-variable model introduces a hidden code z behind the data, with p_model(x) = ∫ p(x|z) p(z) dz, an intractable marginal. The evidence lower bound (ELBO) is a tractable lower bound on log p_model(x), derived in two lines using Jensen's inequality, that splits into a reconstruction term and a KL regularizer. The gap between ELBO and log p_model(x) is itself a KL divergence (from the variational posterior to the true posterior), and maximizing the ELBO closes that gap automatically.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 5 of Track 19 (Generative Models and Diffusion), opening Phase 2 (latent-and-adversarial). A latent-variable model introduces a hidden code z behind the data, with p_model(x) = ∫ p(x|z) p(z) dz, an intractable marginal. The evidence lower bound (ELBO) is a tractable lower bound on log p_model(x), derived in two lines using Jensen's inequality, that splits into a reconstruction term and a KL regularizer. The gap between ELBO and log p_model(x) is itself a KL divergence (from the variational posterior to the true posterior), and maximizing the ELBO closes that gap automatically.Maximum likelihood and the KL viewhttps://clawdemy.org/lessons/generative-models-and-diffusion/maximum-likelihood-and-the-kl-view/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/maximum-likelihood-and-the-kl-view/lesson/Lesson 3 of Track 19 (Generative Models and Diffusion). The previous lesson minimized the negative log-likelihood without saying why. This lesson derives it from first principles: maximum likelihood is the empirical version of minimizing the forward KL divergence from the data distribution to the model, and that single derivation explains why every likelihood-based paradigm in this track (autoregressive, flow, VAE-via-ELBO) shares the same training objective.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 3 of Track 19 (Generative Models and Diffusion). The previous lesson minimized the negative log-likelihood without saying why. This lesson derives it from first principles: maximum likelihood is the empirical version of minimizing the forward KL divergence from the data distribution to the model, and that single derivation explains why every likelihood-based paradigm in this track (autoregressive, flow, VAE-via-ELBO) shares the same training objective.Normalizing flows, change of variables for distributionshttps://clawdemy.org/lessons/generative-models-and-diffusion/normalizing-flows/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/normalizing-flows/lesson/Lesson 4 of Track 19 (Generative Models and Diffusion), closing Phase 1. A normalizing flow parameterizes the model distribution exactly through an invertible transformation from a simple base distribution. The math is the multidimensional change-of-variables formula plus a Jacobian determinant that rescales density to conserve probability. This lesson derives the formula, builds it into the same NLL training objective from L3, and shows what architectural constraints (invertibility, tractable Jacobian) flows must satisfy to deliver exact likelihood and parallel sampling in one paradigm.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 4 of Track 19 (Generative Models and Diffusion), closing Phase 1. A normalizing flow parameterizes the model distribution exactly through an invertible transformation from a simple base distribution. The math is the multidimensional change-of-variables formula plus a Jacobian determinant that rescales density to conserve probability. This lesson derives the formula, builds it into the same NLL training objective from L3, and shows what architectural constraints (invertibility, tractable Jacobian) flows must satisfy to deliver exact likelihood and parallel sampling in one paradigm.Score matching and score-based generationhttps://clawdemy.org/lessons/generative-models-and-diffusion/score-matching/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/score-matching/lesson/Lesson 11 of Track 19 (Generative Models and Diffusion). The previous lesson ended on an observation: the partition function vanishes under the x-gradient, so the score function is computable directly. This lesson is what you do with that observation. Score matching trains a model to estimate the score directly without ever computing Z. The practical form, denoising score matching, reduces the objective to a noise-prediction MSE that scales to high-dimensional data and sets up the diffusion paradigm in the next three lessons.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 11 of Track 19 (Generative Models and Diffusion). The previous lesson ended on an observation: the partition function vanishes under the x-gradient, so the score function is computable directly. This lesson is what you do with that observation. Score matching trains a model to estimate the score directly without ever computing Z. The practical form, denoising score matching, reduces the objective to a noise-prediction MSE that scales to high-dimensional data and sets up the diffusion paradigm in the next three lessons.VAE training in practice, the reparameterization trickhttps://clawdemy.org/lessons/generative-models-and-diffusion/vae-training-in-practice/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/vae-training-in-practice/lesson/Lesson 6 of Track 19 (Generative Models and Diffusion). The previous lesson derived the ELBO abstractly. This lesson takes it to a concrete variational autoencoder, where the encoder and decoder are neural networks. The reparameterization trick (writing a stochastic sample as a deterministic function of the parameters plus an independent noise variable) makes the ELBO differentiable, the closed-form Gaussian KL makes the regularizer cheap, and the per-example loss is one Monte Carlo reconstruction plus one closed-form KL.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 6 of Track 19 (Generative Models and Diffusion). The previous lesson derived the ELBO abstractly. This lesson takes it to a concrete variational autoencoder, where the encoder and decoder are neural networks. The reparameterization trick (writing a stochastic sample as a deterministic function of the parameters plus an independent noise variable) makes the ELBO differentiable, the closed-form Gaussian KL makes the regularizer cheap, and the per-example loss is one Monte Carlo reconstruction plus one closed-form KL.GAN training in practice, Wasserstein loss and gradient penaltyhttps://clawdemy.org/lessons/generative-models-and-diffusion/wgan-gradient-penalty/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/wgan-gradient-penalty/lesson/Lesson 8 of Track 19 (Generative Models and Diffusion). The previous lesson showed that the original GAN minimizes Jensen-Shannon divergence and inherits paradigm-level pathologies (vanishing gradients, mode collapse, no clean stopping criterion). This lesson keeps the minimax framework but changes the divergence to the Wasserstein distance, which gives meaningful gradients even when the data and generator distributions barely overlap. The architectural change is the gradient penalty, which softly enforces the 1-Lipschitz constraint the Wasserstein formulation requires. The result is WGAN-GP, the production-grade GAN variant most adversarial systems actually use.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 8 of Track 19 (Generative Models and Diffusion). The previous lesson showed that the original GAN minimizes Jensen-Shannon divergence and inherits paradigm-level pathologies (vanishing gradients, mode collapse, no clean stopping criterion). This lesson keeps the minimax framework but changes the divergence to the Wasserstein distance, which gives meaningful gradients even when the data and generator distributions barely overlap. The architectural change is the gradient penalty, which softly enforces the 1-Lipschitz constraint the Wasserstein formulation requires. The result is WGAN-GP, the production-grade GAN variant most adversarial systems actually use.What a generative model is, and the four-paradigm maphttps://clawdemy.org/lessons/generative-models-and-diffusion/what-a-generative-model-is/lesson/https://clawdemy.org/lessons/generative-models-and-diffusion/what-a-generative-model-is/lesson/The opener of Track 19 (Generative Models and Diffusion). A generative model learns a distribution well enough to sample new data from it; almost every modern AI system that produces something (text, images, audio, video) is one. This lesson defines generative precisely, lays out the four paradigms the whole track is organized around (autoregressive, latent-variable, adversarial, score-based / diffusion), and shows how to place any modern system on that map at a glance.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseThe opener of Track 19 (Generative Models and Diffusion). A generative model learns a distribution well enough to sample new data from it; almost every modern AI system that produces something (text, images, audio, video) is one. This lesson defines generative precisely, lays out the four paradigms the whole track is organized around (autoregressive, latent-variable, adversarial, score-based / diffusion), and shows how to place any modern system on that map at a glance.Agentshttps://clawdemy.org/lessons/llm-ops-and-production/agents/lesson/https://clawdemy.org/lessons/llm-ops-and-production/agents/lesson/Lesson 10 of Track 21. What an LLM agent is (the lesson-4 tool-use loop with the model deciding when to stop), the three foundational patterns (function-calling agents, ReAct, plan-and-execute), the three tests for whether a task should be an agent (variable shape + bounded tools + acceptable cost), the five engineering failure modes (loops, wrong paths, compound cost, harder evaluation, brittle tool boundaries), and how lesson 7's LLMOps discipline scales to trajectory-level evaluation. Taught technical-primer: WHAT, WHEN, WHAT-GOES-WRONG, HOW; agent-autonomy and contested-alignment debates explicitly out of scope.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 10 of Track 21. What an LLM agent is (the lesson-4 tool-use loop with the model deciding when to stop), the three foundational patterns (function-calling agents, ReAct, plan-and-execute), the three tests for whether a task should be an agent (variable shape + bounded tools + acceptable cost), the five engineering failure modes (loops, wrong paths, compound cost, harder evaluation, brittle tool boundaries), and how lesson 7's LLMOps discipline scales to trajectory-level evaluation. Taught technical-primer: WHAT, WHEN, WHAT-GOES-WRONG, HOW; agent-autonomy and contested-alignment debates explicitly out of scope.Augmented language models, retrieval and toolshttps://clawdemy.org/lessons/llm-ops-and-production/augmented-llms/lesson/https://clawdemy.org/lessons/llm-ops-and-production/augmented-llms/lesson/Lesson 4 of Track 21, opening Phase 2 (building production apps). The two patterns that take an LLM beyond what it was trained on: retrieval-augmented generation (RAG, with its seven moving parts and the trade-offs that decide whether it works), and tool use (the four-step loop where the model calls functions you define). Modern apps often implement RAG as a tool, letting the model decide when retrieval is needed. Every move lives against the three productive limits from lesson 2.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 4 of Track 21, opening Phase 2 (building production apps). The two patterns that take an LLM beyond what it was trained on: retrieval-augmented generation (RAG, with its seven moving parts and the trade-offs that decide whether it works), and tool use (the four-step loop where the model calls functions you define). Modern apps often implement RAG as a tool, letting the model decide when retrieval is needed. Every move lives against the three productive limits from lesson 2.Industry perspective: where the field is goinghttps://clawdemy.org/lessons/llm-ops-and-production/industry-perspective/lesson/https://clawdemy.org/lessons/llm-ops-and-production/industry-perspective/lesson/Lesson 11 of Track 21. The track capstone. Synthesizes the 10 lessons that came before (arc: demo to production-grade application) against the fireside-chat industry perspective from a Full Stack Deep Learning Bootcamp fireside chat with Peter Welinder (OpenAI). Three rules for reading a fireside (attribute, separate, generate questions). Five durable bets the field has converged on. Three concrete reader moves post-track. Treated as synthesis + careful read of a primary source, not as a forecast; speaker views are attributed as views, not absorbed as canon.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 11 of Track 21. The track capstone. Synthesizes the 10 lessons that came before (arc: demo to production-grade application) against the fireside-chat industry perspective from a Full Stack Deep Learning Bootcamp fireside chat with Peter Welinder (OpenAI). Three rules for reading a fireside (attribute, separate, generate questions). Five durable bets the field has converged on. Three concrete reader moves post-track. Treated as synthesis + careful read of a primary source, not as a forecast; speaker views are attributed as views, not absorbed as canon.Launch an LLM app in one hourhttps://clawdemy.org/lessons/llm-ops-and-production/launch-an-llm-app/lesson/https://clawdemy.org/lessons/llm-ops-and-production/launch-an-llm-app/lesson/Lesson 1 of Track 21, the production-tier track that opens by shipping. The track inverts the usual order: build a working LLM application first, then learn what makes it actually good. This lesson covers the five components of a minimum-viable LLM app (hosted model, API key, prompt template, application code, UI + deployment), takes one in about thirty lines of Python (Streamlit + Anthropic Claude API or another provider's), and maps honestly to the gaps the rest of the track refines (retrieval, prompt engineering, UX, observability).Mon, 25 May 2026 00:00:00 GMTClawdemy11:00falseLesson 1 of Track 21, the production-tier track that opens by shipping. The track inverts the usual order: build a working LLM application first, then learn what makes it actually good. This lesson covers the five components of a minimum-viable LLM app (hosted model, API key, prompt template, application code, UI + deployment), takes one in about thirty lines of Python (Streamlit + Anthropic Claude API or another provider's), and maps honestly to the gaps the rest of the track refines (retrieval, prompt engineering, UX, observability).LLM foundations for productionhttps://clawdemy.org/lessons/llm-ops-and-production/llm-foundations/lesson/https://clawdemy.org/lessons/llm-ops-and-production/llm-foundations/lesson/Lesson 2 of Track 21. The working picture a production builder needs after shipping the minimum app. A hosted LLM is a stateless next-token function bounded by three productive limits: context length (a hard input budget shared by system + retrieved + history + max_tokens output), cost per token (input vs output priced separately, output usually several times more, compounds at scale), and latency (TTFT + output_tokens / tokens_per_second; streaming masks it). The constraints under which every later design decision lives.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 2 of Track 21. The working picture a production builder needs after shipping the minimum app. A hosted LLM is a stateless next-token function bounded by three productive limits: context length (a hard input budget shared by system + retrieved + history + max_tokens output), cost per token (input vs output priced separately, output usually several times more, compounds at scale), and latency (TTFT + output_tokens / tokens_per_second; streaming masks it). The constraints under which every later design decision lives.LLMOpshttps://clawdemy.org/lessons/llm-ops-and-production/llmops/lesson/https://clawdemy.org/lessons/llm-ops-and-production/llmops/lesson/Lesson 7 of Track 21, closing Phase 2. The operational layer that keeps an LLM application working over time: the LLM analogue of DevOps and MLOps. Five engineering pillars: observability (log enough to debug), evaluation in production (sample + score live; A/B test changes), prompt versioning (treat prompts as code), cost and latency monitoring (dashboards + alerts), and regression testing (suite run before every change; makes model upgrades safe). The smallest practical first stack is days, not months, and the tools matter less than the discipline.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 7 of Track 21, closing Phase 2. The operational layer that keeps an LLM application working over time: the LLM analogue of DevOps and MLOps. Five engineering pillars: observability (log enough to debug), evaluation in production (sample + score live; A/B test changes), prompt versioning (treat prompts as code), cost and latency monitoring (dashboards + alerts), and regression testing (suite run before every change; makes model upgrades safe). The smallest practical first stack is days, not months, and the tools matter less than the discipline.Project walkthrough, a real LLM application end to endhttps://clawdemy.org/lessons/llm-ops-and-production/project-walkthrough/lesson/https://clawdemy.org/lessons/llm-ops-and-production/project-walkthrough/lesson/Lesson 5 of Track 21. The bootcamp's worked example, askFSDL (a Q&A app over the FSDL course materials), read for the production decisions it embeds at each pipeline stage: knowledge-source scoping, content-shaped chunking with metadata, source-carrying retrieval, a scope-honest citation-asking system prompt, streaming generation with citations, and logging that seeds LLMOps. The complexity is in the decisions, not the line count, real apps of this shape are a few hundred lines.Mon, 25 May 2026 00:00:00 GMTClawdemy11:00falseLesson 5 of Track 21. The bootcamp's worked example, askFSDL (a Q&A app over the FSDL course materials), read for the production decisions it embeds at each pipeline stage: knowledge-source scoping, content-shaped chunking with metadata, source-carrying retrieval, a scope-honest citation-asking system prompt, streaming generation with citations, and logging that seeds LLMOps. The complexity is in the decisions, not the line count, real apps of this shape are a few hundred lines.Prompt engineering, "Learn to Spell"https://clawdemy.org/lessons/llm-ops-and-production/prompt-engineering/lesson/https://clawdemy.org/lessons/llm-ops-and-production/prompt-engineering/lesson/Lesson 3 of Track 21, closing Phase 1. Prompt engineering is the single highest-leverage application skill, and the prompt is the spec for what the assistant is. This lesson covers the toolkit (clarity, format constraints, few-shot, chain-of-thought, system prompts, persona, delimiters, end-placement, negatives used sparingly), when a prompt fix beats a code fix (the largest, cheapest category of failures), the discipline that turns prompting into engineering (version + test on held-out examples), and where prompts run out (retrieval, tool use, fine-tuning, lessons 4 and 9).Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 3 of Track 21, closing Phase 1. Prompt engineering is the single highest-leverage application skill, and the prompt is the spec for what the assistant is. This lesson covers the toolkit (clarity, format constraints, few-shot, chain-of-thought, system prompts, persona, delimiters, end-placement, negatives used sparingly), when a prompt fix beats a code fix (the largest, cheapest category of failures), the discipline that turns prompting into engineering (version + test on held-out examples), and where prompts run out (retrieval, tool use, fine-tuning, lessons 4 and 9).Training your own LLMhttps://clawdemy.org/lessons/llm-ops-and-production/training-your-own-llm/lesson/https://clawdemy.org/lessons/llm-ops-and-production/training-your-own-llm/lesson/Lesson 9 of Track 21. The deep dive on the fine-tune point of the build-vs-buy spectrum from lesson 8. When training your own (smaller, specialized) model is the right move for a production application (the three-things-true-at-once test), the staged pipeline most teams should follow (open checkpoint → curated SFT data → LoRA training → optional DPO → eval → A/B test), the practical tools (TRL, Axolotl, managed compute), the economics that decide payback, and how fine-tuning fits the mix architecture. Taught technical-primer: mechanical when/how, with broader debates explicitly out of scope.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 9 of Track 21. The deep dive on the fine-tune point of the build-vs-buy spectrum from lesson 8. When training your own (smaller, specialized) model is the right move for a production application (the three-things-true-at-once test), the staged pipeline most teams should follow (open checkpoint → curated SFT data → LoRA training → optional DPO → eval → A/B test), the practical tools (TRL, Axolotl, managed compute), the economics that decide payback, and how fine-tuning fits the mix architecture. Taught technical-primer: mechanical when/how, with broader debates explicitly out of scope.UX for language user interfaceshttps://clawdemy.org/lessons/llm-ops-and-production/ux-for-luis/lesson/https://clawdemy.org/lessons/llm-ops-and-production/ux-for-luis/lesson/Lesson 6 of Track 21. A language user interface is a new interaction surface, and the patterns that make one usable are different from the patterns of forms and buttons. The five core patterns (streaming, citations, regeneration, hedging, recoverable failure), the supporting details that lift quality, and a critique-this-UX checklist. Taught as interaction-design throughout: content-policy and moderation debates are out of scope here.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 6 of Track 21. A language user interface is a new interaction surface, and the patterns that make one usable are different from the patterns of forms and buttons. The five core patterns (streaming, citations, regeneration, hedging, recoverable failure), the supporting details that lift quality, and a critique-this-UX checklist. Taught as interaction-design throughout: content-policy and moderation debates are out of scope here.What's next, the LLM landscape in motionhttps://clawdemy.org/lessons/llm-ops-and-production/whats-next/lesson/https://clawdemy.org/lessons/llm-ops-and-production/whats-next/lesson/Lesson 8 of Track 21, opening Phase 3. A survey of the six directions the LLM landscape is moving (longer context, multimodality, smaller specialized models, the build-vs-buy spectrum, agents, reasoning models), what each changes for a builder reading through lesson 2's productive limits, and how three of them set up the deeper Phase 3 lessons that follow. Survey-lean: lighter pedagogy, breadth-over-depth, points forward.Mon, 25 May 2026 00:00:00 GMTClawdemy10:00falseLesson 8 of Track 21, opening Phase 3. A survey of the six directions the LLM landscape is moving (longer context, multimodality, smaller specialized models, the build-vs-buy spectrum, agents, reasoning models), what each changes for a builder reading through lesson 2's productive limits, and how three of them set up the deeper Phase 3 lessons that follow. Survey-lean: lighter pedagogy, breadth-over-depth, points forward.From language models to large multimodal modelshttps://clawdemy.org/lessons/multimodal-ai/from-llms-to-lmms/lesson/https://clawdemy.org/lessons/multimodal-ai/from-llms-to-lmms/lesson/Lesson 2 of Track 24 (Multimodal AI), opening Phase 2 (Building large multimodal models). L1 said the most common way to build a multimodal model is to take an existing LLM and attach a vision encoder. This lesson walks that path concretely through CogVLM: how a pretrained vision transformer produces image tokens, how a bridge projects them into the LLM's input space, where CogVLM goes deeper than LLaVA-style designs with its 'visual expert' lanes, and how the model is trained in two stages without destroying the underlying language abilities.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 2 of Track 24 (Multimodal AI), opening Phase 2 (Building large multimodal models). L1 said the most common way to build a multimodal model is to take an existing LLM and attach a vision encoder. This lesson walks that path concretely through CogVLM: how a pretrained vision transformer produces image tokens, how a bridge projects them into the LLM's input space, where CogVLM goes deeper than LLaVA-style designs with its 'visual expert' lanes, and how the model is trained in two stages without destroying the underlying language abilities.Joint embedding predictive architectures (JEPA) and world modelinghttps://clawdemy.org/lessons/multimodal-ai/jepa-and-world-modeling/lesson/https://clawdemy.org/lessons/multimodal-ai/jepa-and-world-modeling/lesson/Lesson 7 of Track 24 (Multimodal AI), opening Phase 4 (Advanced multimodal directions). Phases 2 and 3 used the same underlying training objective: generative pretraining, where the model predicts a raw output (next token, next denoising step) it can be compared against ground truth. This lesson covers the most articulated alternative direction: JEPA, which predicts representations in embedding space rather than raw outputs. The bet is that focusing capacity on semantic structure instead of surface detail produces better representations for understanding and for world modeling. Covers the I-JEPA and V-JEPA recipes, the world-modeling connection, where JEPA sits today (research-strong, not production-dominant), and applies the operational scope test to separate technique from autonomy-philosophy questions.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 7 of Track 24 (Multimodal AI), opening Phase 4 (Advanced multimodal directions). Phases 2 and 3 used the same underlying training objective: generative pretraining, where the model predicts a raw output (next token, next denoising step) it can be compared against ground truth. This lesson covers the most articulated alternative direction: JEPA, which predicts representations in embedding space rather than raw outputs. The bet is that focusing capacity on semantic structure instead of surface detail produces better representations for understanding and for world modeling. Covers the I-JEPA and V-JEPA recipes, the world-modeling connection, where JEPA sits today (research-strong, not production-dominant), and applies the operational scope test to separate technique from autonomy-philosophy questions.Multimodal world models for sciencehttps://clawdemy.org/lessons/multimodal-ai/multimodal-world-models-for-science/lesson/https://clawdemy.org/lessons/multimodal-ai/multimodal-world-models-for-science/lesson/Lesson 8 of Track 24 (Multimodal AI), in Phase 4 (Advanced multimodal directions). Lesson 7 introduced JEPA-style world models as a different training paradigm aimed at predicting future semantic state. This lesson takes that framing into a specific scientific application: drug discovery. The 'world' becomes biological (cells, molecules, pathways), the 'future state' becomes how a drug candidate will perturb a biological system, and the central new discipline is the sharp line between 'model performs well on a biological benchmark' and 'this is medically useful' (a distinction that has tripped up the medical-AI literature repeatedly). Covers the data-heterogeneity challenge biology raises, the multimodal world model framing applied to it, where Noetik.ai sits in 2026, and the medical-AI-specialized operational scope test.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 8 of Track 24 (Multimodal AI), in Phase 4 (Advanced multimodal directions). Lesson 7 introduced JEPA-style world models as a different training paradigm aimed at predicting future semantic state. This lesson takes that framing into a specific scientific application: drug discovery. The 'world' becomes biological (cells, molecules, pathways), the 'future state' becomes how a drug candidate will perturb a biological system, and the central new discipline is the sharp line between 'model performs well on a biological benchmark' and 'this is medically useful' (a distinction that has tripped up the medical-AI literature repeatedly). Covers the data-heterogeneity challenge biology raises, the multimodal world model framing applied to it, where Noetik.ai sits in 2026, and the medical-AI-specialized operational scope test.Native multimodal intelligencehttps://clawdemy.org/lessons/multimodal-ai/native-multimodal-intelligence/lesson/https://clawdemy.org/lessons/multimodal-ai/native-multimodal-intelligence/lesson/Lesson 3 of Track 24 (Multimodal AI), in Phase 2 (Building large multimodal models). Lesson 2 ended on a sharp limit: in the encode-then-fuse pattern, the vision encoder and the LLM were trained separately and bridged afterward. Native multimodal architectures take the opposite bet, training one transformer on a mixed stream of text, image, audio, and video tokens from the very first step. This lesson contrasts the two designs, names what 'native' actually buys (joint co-evolution, deeper cross-modal grounding, first-class generation of any modality, low-latency interaction), walks the architectural shape, and names the costs (tokenizer design, data scale, compute, output expense).Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 3 of Track 24 (Multimodal AI), in Phase 2 (Building large multimodal models). Lesson 2 ended on a sharp limit: in the encode-then-fuse pattern, the vision encoder and the LLM were trained separately and bridged afterward. Native multimodal architectures take the opposite bet, training one transformer on a mixed stream of text, image, audio, and video tokens from the very first step. This lesson contrasts the two designs, names what 'native' actually buys (joint co-evolution, deeper cross-modal grounding, first-class generation of any modality, low-latency interaction), walks the architectural shape, and names the costs (tokenizer design, data scale, compute, output expense).Reasoning over multimodal inputshttps://clawdemy.org/lessons/multimodal-ai/reasoning-over-multimodal-inputs/lesson/https://clawdemy.org/lessons/multimodal-ai/reasoning-over-multimodal-inputs/lesson/Lesson 4 of Track 24 (Multimodal AI), closing Phase 2 (Building large multimodal models). Lessons 2 and 3 covered how multimodal models perceive multiple modalities. A different capability matters once they can perceive: reasoning over those modalities. Modern reasoning models spend significant inference compute generating chain-of-thought before answering; when that reasoning extends to images, with tool use and deliberative safety checks, you get a qualitatively different system. This lesson walks how that combination (reasoning + multimodal + tool use, plus the deliberative-alignment safety technique) actually works.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 4 of Track 24 (Multimodal AI), closing Phase 2 (Building large multimodal models). Lessons 2 and 3 covered how multimodal models perceive multiple modalities. A different capability matters once they can perceive: reasoning over those modalities. Modern reasoning models spend significant inference compute generating chain-of-thought before answering; when that reasoning extends to images, with tool use and deliberative safety checks, you get a qualitatively different system. This lesson walks how that combination (reasoning + multimodal + tool use, plus the deliberative-alignment safety technique) actually works.Transformers for video generationhttps://clawdemy.org/lessons/multimodal-ai/transformers-for-video-generation/lesson/https://clawdemy.org/lessons/multimodal-ai/transformers-for-video-generation/lesson/Lesson 6 of Track 24 (Multimodal AI), closing Phase 3 (Generative multimodal models). Lesson 5 walked the U-Net to DiT shift for image generation. This lesson takes the same DiT-family architecture and asks what changes when the output is video: a new temporal dimension (spacetime patches), a compute explosion that latent compression in both space and time must manage, a captioned-video data problem, and one central new technical challenge (temporal consistency). It covers Meta's Movie Gen and OpenAI's Sora as the production landscape, and is explicit about two additional out-of-scope conversations video raises beyond image generation (real-person reanimation, video provenance with temporal-coherence requirements).Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 6 of Track 24 (Multimodal AI), closing Phase 3 (Generative multimodal models). Lesson 5 walked the U-Net to DiT shift for image generation. This lesson takes the same DiT-family architecture and asks what changes when the output is video: a new temporal dimension (spacetime patches), a compute explosion that latent compression in both space and time must manage, a captioned-video data problem, and one central new technical challenge (temporal consistency). It covers Meta's Movie Gen and OpenAI's Sora as the production landscape, and is explicit about two additional out-of-scope conversations video raises beyond image generation (real-person reanimation, video provenance with temporal-coherence requirements).Transformers in diffusion models for image generationhttps://clawdemy.org/lessons/multimodal-ai/transformers-in-diffusion/lesson/https://clawdemy.org/lessons/multimodal-ai/transformers-in-diffusion/lesson/Lesson 5 of Track 24 (Multimodal AI), opening Phase 3 (Generative multimodal models). Phase 2 covered models that accept images as input. Phase 3 turns to the opposite direction: models that output images. Modern image generation runs on diffusion, and a specific architectural shift drove its recent quality jump: replacing the convolutional U-Net backbone (Stable Diffusion 1.x, DALL-E 2) with a transformer backbone (DiT). This lesson covers the shift, what it buys (scaling laws, global structure, architectural unification), how text conditioning has folded back into the same transformer machinery (MM-DiT), and is explicit about what is and isn't in scope (technique and architecture in; use-case, provenance, sector-policy, training-data-licensing, and likeness-rights conversations are deferred to their own forums).Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 5 of Track 24 (Multimodal AI), opening Phase 3 (Generative multimodal models). Phase 2 covered models that accept images as input. Phase 3 turns to the opposite direction: models that output images. Modern image generation runs on diffusion, and a specific architectural shift drove its recent quality jump: replacing the convolutional U-Net backbone (Stable Diffusion 1.x, DALL-E 2) with a transformer backbone (DiT). This lesson covers the shift, what it buys (scaling laws, global structure, architectural unification), how text conditioning has folded back into the same transformer machinery (MM-DiT), and is explicit about what is and isn't in scope (technique and architecture in; use-case, provenance, sector-policy, training-data-licensing, and likeness-rights conversations are deferred to their own forums).What multimodal AI actually ishttps://clawdemy.org/lessons/multimodal-ai/what-multimodal-ai-actually-is/lesson/https://clawdemy.org/lessons/multimodal-ai/what-multimodal-ai-actually-is/lesson/Lesson 1 of Track 24 (Multimodal AI), the opener of Phase 1 (Orientation). A moment ago you saw a face, heard a voice, and read a caption all at the same time, and your brain treated all of it as one thing. For most of AI's history a model could only handle one modality at a time. Multimodal AI is the family of systems built to break that wall. This opener defines what 'multimodal' actually means, names the modalities and the central fusion challenge, and lays out the operating modes (multimodal input, multimodal output, both) that the rest of the track explores in depth.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 1 of Track 24 (Multimodal AI), the opener of Phase 1 (Orientation). A moment ago you saw a face, heard a voice, and read a caption all at the same time, and your brain treated all of it as one thing. For most of AI's history a model could only handle one modality at a time. Multimodal AI is the family of systems built to break that wall. This opener defines what 'multimodal' actually means, names the modalities and the central fusion challenge, and lays out the operating modes (multimodal input, multimodal output, both) that the rest of the track explores in depth.Function approximation and deep RLhttps://clawdemy.org/lessons/reinforcement-learning-foundations/function-approximation-and-deep-rl/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/function-approximation-and-deep-rl/lesson/Lesson 9 of Track 17. Tables don't scale; Atari, Go, and robotics state spaces are too big. Function approximation replaces the table with a parameterized function (linear features or a neural network), keeps the Bellman recursion intact, and lets one update generalize across all states via shared parameters. This lesson works a single semi-gradient step on a linear Q, explains why the deadly triad (TD + off-policy + function approximation) can diverge, and shows how DQN's experience replay and target network make value-based deep RL stable.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 9 of Track 17. Tables don't scale; Atari, Go, and robotics state spaces are too big. Function approximation replaces the table with a parameterized function (linear features or a neural network), keeps the Bellman recursion intact, and lets one update generalize across all states via shared parameters. This lesson works a single semi-gradient step on a linear Q, explains why the deadly triad (TD + off-policy + function approximation) can diverge, and shows how DQN's experience replay and target network make value-based deep RL stable.Markov Decision Processeshttps://clawdemy.org/lessons/reinforcement-learning-foundations/markov-decision-processes/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/markov-decision-processes/lesson/Lesson 2 of Track 17. The first lesson sketched the agent-environment loop informally; this one nails it down. The Markov Decision Process is the universal contract of RL: a tuple (states, actions, transition probabilities, reward function, discount factor) plus the Markov property, on which every algorithm in the rest of the track operates. This lesson lays out the tuple, explains the Markov property as a property of the state representation (the Atari frame-stacking story), defines a trajectory and the discounted return, walks the return at three discount values on a small example, and draws the planning-versus-learning boundary between Phase 2 and Phase 3.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 2 of Track 17. The first lesson sketched the agent-environment loop informally; this one nails it down. The Markov Decision Process is the universal contract of RL: a tuple (states, actions, transition probabilities, reward function, discount factor) plus the Markov property, on which every algorithm in the rest of the track operates. This lesson lays out the tuple, explains the Markov property as a property of the state representation (the Atari frame-stacking story), defines a trajectory and the discounted return, walks the return at three discount values on a small example, and draws the planning-versus-learning boundary between Phase 2 and Phase 3.Monte Carlo predictionhttps://clawdemy.org/lessons/reinforcement-learning-foundations/monte-carlo-prediction/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/monte-carlo-prediction/lesson/Lesson 6 of Track 17 and the opener of Phase 3 (model-free learning). Phase 2 assumed you know P and R; Phase 3 is the real-world case where you do not. Monte Carlo prediction is the simplest model-free way to evaluate a policy: play episodes, average the observed returns, let the law of large numbers do the rest. This lesson lays out first-visit and every-visit MC, runs a 3-state worked example through five episodes that shows both convergence and the variance failure mode, and frames MC as the unbiased extreme of a bias-variance spectrum TD learning sits at the other end of.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 6 of Track 17 and the opener of Phase 3 (model-free learning). Phase 2 assumed you know P and R; Phase 3 is the real-world case where you do not. Monte Carlo prediction is the simplest model-free way to evaluate a policy: play episodes, average the observed returns, let the law of large numbers do the rest. This lesson lays out first-visit and every-visit MC, runs a 3-state worked example through five episodes that shows both convergence and the variance failure mode, and frames MC as the unbiased extreme of a bias-variance spectrum TD learning sits at the other end of.Policy gradient and the path to modern RLhttps://clawdemy.org/lessons/reinforcement-learning-foundations/policy-gradient-and-the-path-to-modern-rl/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/policy-gradient-and-the-path-to-modern-rl/lesson/Lesson 10 of Track 17, the close. Lessons 4-9 learned a value function and read the policy off as greedy; this lesson flips the script: parameterize the policy directly, then take gradient steps that increase the probability of actions that lead to high return. The capstone writes the REINFORCE update, walks one policy-gradient step on a tiny softmax policy (the probability of a rewarded action climbs from 0.50 to about 0.55), places actor-critic as the variance fix that produces PPO and the modern workhorses, and closes the track with the bridge to RLHF for large language models.Mon, 25 May 2026 00:00:00 GMTClawdemy14:00falseLesson 10 of Track 17, the close. Lessons 4-9 learned a value function and read the policy off as greedy; this lesson flips the script: parameterize the policy directly, then take gradient steps that increase the probability of actions that lead to high return. The capstone writes the REINFORCE update, walks one policy-gradient step on a tiny softmax policy (the probability of a rewarded action climbs from 0.50 to about 0.55), places actor-critic as the variance fix that produces PPO and the modern workhorses, and closes the track with the bridge to RLHF for large language models.Policy iterationhttps://clawdemy.org/lessons/reinforcement-learning-foundations/policy-iteration/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/policy-iteration/lesson/Lesson 4 of Track 17 and the opener of Phase 2. The Bellman equation said value is recursive; policy iteration is the first algorithm that actually computes the optimal policy from it. The algorithm alternates two simple steps, evaluate the current policy by solving its Bellman expectation equation, then improve the policy by acting greedily, and provably converges to pi^* in any finite MDP. This lesson lays out both steps, runs the algorithm end-to-end on a two-state MDP through two iterations, and introduces the generalized-policy-iteration lens that ties almost every later RL method together.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 4 of Track 17 and the opener of Phase 2. The Bellman equation said value is recursive; policy iteration is the first algorithm that actually computes the optimal policy from it. The algorithm alternates two simple steps, evaluate the current policy by solving its Bellman expectation equation, then improve the policy by acting greedily, and provably converges to pi^* in any finite MDP. This lesson lays out both steps, runs the algorithm end-to-end on a two-state MDP through two iterations, and introduces the generalized-policy-iteration lens that ties almost every later RL method together.Q-learning: model-free controlhttps://clawdemy.org/lessons/reinforcement-learning-foundations/q-learning/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/q-learning/lesson/Lesson 8 of Track 17 and the close of Phase 3. MC and TD prediction estimated V^pi from samples; Q-learning is the control counterpart that estimates Q^* and acts greedily. Its update is TD's bootstrap on Q with a max-over-actions in the target -- combining value iteration's Bellman optimality recursion with sample-based learning. This lesson works five Q-learning steps on a 2-state-2-action MDP (greedy policy already pi^* after 5 updates), contrasts on-policy SARSA with off-policy Q-learning, explains why exploration is required, and previews the DQN bridge with the deadly-triad caveat.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 8 of Track 17 and the close of Phase 3. MC and TD prediction estimated V^pi from samples; Q-learning is the control counterpart that estimates Q^* and acts greedily. Its update is TD's bootstrap on Q with a max-over-actions in the target -- combining value iteration's Bellman optimality recursion with sample-based learning. This lesson works five Q-learning steps on a 2-state-2-action MDP (greedy policy already pi^* after 5 updates), contrasts on-policy SARSA with off-policy Q-learning, explains why exploration is required, and previews the DQN bridge with the deadly-triad caveat.Temporal-difference learninghttps://clawdemy.org/lessons/reinforcement-learning-foundations/temporal-difference-learning/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/temporal-difference-learning/lesson/Lesson 7 of Track 17. Monte Carlo waited until an episode ended to compute a return; TD learning updates after every single step using a bootstrapped one-step target. This lesson writes the TD(0) update, walks four episodes of a deterministic chain through clean monotonic convergence (with value visibly propagating backward from the terminal one bootstrap per episode), compares MC and TD on the bias-variance axis, and places TD as the foundation under Q-learning, SARSA, DQN, and actor-critic.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 7 of Track 17. Monte Carlo waited until an episode ended to compute a return; TD learning updates after every single step using a bootstrapped one-step target. This lesson writes the TD(0) update, walks four episodes of a deterministic chain through clean monotonic convergence (with value visibly propagating backward from the terminal one bootstrap per episode), compares MC and TD on the bias-variance axis, and places TD as the foundation under Q-learning, SARSA, DQN, and actor-critic.Value functions and the Bellman equationshttps://clawdemy.org/lessons/reinforcement-learning-foundations/value-functions-and-the-bellman-equations/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/value-functions-and-the-bellman-equations/lesson/Lesson 3 of Track 17 and the close of Phase 1. With the MDP nailed down, you need a way to say how good things are. The state-value V and action-value Q answer that, the expected total reward from a state or a state-action pair under a policy. Their defining property is recursive: value here equals one step of reward plus the discounted value at the next state. That recursion, in two forms (expectation under a policy, and optimality over the best action), is the Bellman equation, the mathematical heart of reinforcement learning.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 3 of Track 17 and the close of Phase 1. With the MDP nailed down, you need a way to say how good things are. The state-value V and action-value Q answer that, the expected total reward from a state or a state-action pair under a policy. Their defining property is recursive: value here equals one step of reward plus the discounted value at the next state. That recursion, in two forms (expectation under a policy, and optimality over the best action), is the Bellman equation, the mathematical heart of reinforcement learning.Value iterationhttps://clawdemy.org/lessons/reinforcement-learning-foundations/value-iteration/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/value-iteration/lesson/Lesson 5 of Track 17 and the close of Phase 2. Policy iteration did full evaluation between improvements; value iteration is the simpler sibling that interleaves them completely. The update is a direct sweep of the Bellman optimality equation. This lesson runs value iteration four steps on the same MDP as the previous lesson so the comparison is direct, shows the greedy policy stabilizes long before V converges (a standard early-stopping trick), and places value iteration as the extreme point of generalized policy iteration that pre-figures Q-learning and DQN.Mon, 25 May 2026 00:00:00 GMTClawdemy13:00falseLesson 5 of Track 17 and the close of Phase 2. Policy iteration did full evaluation between improvements; value iteration is the simpler sibling that interleaves them completely. The update is a direct sweep of the Bellman optimality equation. This lesson runs value iteration four steps on the same MDP as the previous lesson so the comparison is direct, shows the greedy policy stabilizes long before V converges (a standard early-stopping trick), and places value iteration as the extreme point of generalized policy iteration that pre-figures Q-learning and DQN.What reinforcement learning actually ishttps://clawdemy.org/lessons/reinforcement-learning-foundations/what-reinforcement-learning-actually-is/lesson/https://clawdemy.org/lessons/reinforcement-learning-foundations/what-reinforcement-learning-actually-is/lesson/The opener of Track 17 (Reinforcement Learning Foundations). RL is a third paradigm beside supervised and unsupervised learning, the one where an agent learns from interaction with consequences. This lesson sets up the agent-environment-reward loop, explains what makes RL harder than supervised learning (no oracle, delayed reward, distribution shift from the policy), introduces the exploration-versus-exploitation dilemma that every method in the track is, underneath, an answer to, and tours where RL shows up, from board games to robotics to the RLHF behind modern chatbots.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseThe opener of Track 17 (Reinforcement Learning Foundations). RL is a third paradigm beside supervised and unsupervised learning, the one where an agent learns from interaction with consequences. This lesson sets up the agent-environment-reward loop, explains what makes RL harder than supervised learning (no oracle, delayed reward, distribution shift from the policy), introduces the exploration-versus-exploitation dilemma that every method in the track is, underneath, an answer to, and tours where RL shows up, from board games to robotics to the RLHF behind modern chatbots.Higher-order derivativeshttps://clawdemy.org/lessons/visual-math-calculus/higher-order-derivatives/lesson/https://clawdemy.org/lessons/visual-math-calculus/higher-order-derivatives/lesson/Lesson 12 of Track 8 (Visual Math: Calculus). A derivative is itself a function, so you can differentiate it again. The second derivative f'' measures how the slope is changing, which means acceleration in physics (Newton's F = ma is written in second derivatives) and curvature on a graph (cups upward when f'' > 0, downward when f'' < 0). It powers the second-derivative test that sorts maxima from minima at critical points, gives the oscillation equation f'' = -f that governs springs, sound, and waves, and shows that every derivative of e^x is e^x. In machine learning, the same curvature information drives Newton's method, the Hessian, and loss-landscape analysis.Mon, 25 May 2026 00:00:00 GMTClawdemy10:00falseLesson 12 of Track 8 (Visual Math: Calculus). A derivative is itself a function, so you can differentiate it again. The second derivative f'' measures how the slope is changing, which means acceleration in physics (Newton's F = ma is written in second derivatives) and curvature on a graph (cups upward when f'' > 0, downward when f'' < 0). It powers the second-derivative test that sorts maxima from minima at critical points, gives the oscillation equation f'' = -f that governs springs, sound, and waves, and shows that every derivative of e^x is e^x. In machine learning, the same curvature information drives Newton's method, the Hessian, and loss-landscape analysis.Taylor serieshttps://clawdemy.org/lessons/visual-math-calculus/taylor-series/lesson/https://clawdemy.org/lessons/visual-math-calculus/taylor-series/lesson/Lesson 13 of Track 8 (Visual Math: Calculus), and the track's finale. Complicated functions like sine and the exponential are hard to compute directly; polynomials are easy. The Taylor series rebuilds any well-behaved function near a point out of its derivatives there. It works the expansion f(x) is approximately f(a) + f'(a)(x-a) + f''(a)/2! (x-a)^2 + ..., shows why the factorials are required (the matching property), builds the clean series for e^x, sin, and cos, reveals the small-angle approximation and L'Hopital as first-order Taylor in disguise, and shows that Newton's method, gradient descent, the neural tangent kernel, and the way hardware computes transcendentals are all Taylor at work. The arc that opened with a circle closes here, with a single polynomial standing in for any function.Mon, 25 May 2026 00:00:00 GMTClawdemy12:00falseLesson 13 of Track 8 (Visual Math: Calculus), and the track's finale. Complicated functions like sine and the exponential are hard to compute directly; polynomials are easy. The Taylor series rebuilds any well-behaved function near a point out of its derivatives there. It works the expansion f(x) is approximately f(a) + f'(a)(x-a) + f''(a)/2! (x-a)^2 + ..., shows why the factorials are required (the matching property), builds the clean series for e^x, sin, and cos, reveals the small-angle approximation and L'Hopital as first-order Taylor in disguise, and shows that Newton's method, gradient descent, the neural tangent kernel, and the way hardware computes transcendentals are all Taylor at work. The arc that opened with a circle closes here, with a single polynomial standing in for any function.Attention alternatives and mixture of expertshttps://clawdemy.org/lessons/build-an-llm-from-scratch/attention-alternatives-and-moe/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/attention-alternatives-and-moe/lesson/Lesson 4 of Track 15, closing Phase 1. The two variations that make modern LLMs efficient, one per sublayer. Standard attention is quadratic in length and its KV cache dominates inference; multi-query and grouped-query attention shrink that cache, and sliding-window attention bounds long-context cost. Mixture of experts replaces the single FFN with many experts plus a router, decoupling total parameters (capacity, memory) from active parameters (per-token compute). Both are resource-allocation moves in lesson 2's terms.Sun, 24 May 2026 00:00:00 GMTClawdemy13:00falseLesson 4 of Track 15, closing Phase 1. The two variations that make modern LLMs efficient, one per sublayer. Standard attention is quadratic in length and its KV cache dominates inference; multi-query and grouped-query attention shrink that cache, and sliding-window attention bounds long-context cost. Mixture of experts replaces the single FFN with many experts plus a router, decoupling total parameters (capacity, memory) from active parameters (per-token compute). Both are resource-allocation moves in lesson 2's terms.Counting the cost, FLOPs, memory, and arithmetic intensityhttps://clawdemy.org/lessons/build-an-llm-from-scratch/counting-the-cost/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/counting-the-cost/lesson/Lesson 2 of Track 15. Efficiency is the track's through-line, and this lesson is the accounting that makes it concrete: estimate a model's compute before you spend it (matmul FLOPs, the 6ND training rule), its memory (parameters, gradients, optimizer states, activations, the 16N estimate), and its arithmetic intensity (compute-bound versus memory-bound), plus reading the tensor reshaping that dominates model code with einops.Sun, 24 May 2026 00:00:00 GMTClawdemy14:00falseLesson 2 of Track 15. Efficiency is the track's through-line, and this lesson is the accounting that makes it concrete: estimate a model's compute before you spend it (matmul FLOPs, the 6ND training rule), its memory (parameters, gradients, optimizer states, activations, the 16N estimate), and its arithmetic intensity (compute-bound versus memory-bound), plus reading the tensor reshaping that dominates model code with einops.What "from scratch" means, and the tokenizerhttps://clawdemy.org/lessons/build-an-llm-from-scratch/from-scratch-and-the-tokenizer/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/from-scratch-and-the-tokenizer/lesson/Lesson 1 of Track 15, the deepest tier on the site. This track builds an LLM from scratch, the real thing, the way frontier labs do. This opener lays out what 'from scratch' actually entails end to end, why efficiency (FLOPs, memory, hardware) is the through-line, and then builds the model's first component: the tokenizer. It covers why subword beats character- and word-level tokens and how byte-level BPE works, the procedure you will implement by hand.Sun, 24 May 2026 00:00:00 GMTClawdemy13:00falseLesson 1 of Track 15, the deepest tier on the site. This track builds an LLM from scratch, the real thing, the way frontier labs do. This opener lays out what 'from scratch' actually entails end to end, why efficiency (FLOPs, memory, hardware) is the through-line, and then builds the model's first component: the tokenizer. It covers why subword beats character- and word-level tokens and how byte-level BPE works, the procedure you will implement by hand.The Transformer architecture and its hyperparametershttps://clawdemy.org/lessons/build-an-llm-from-scratch/the-architecture/lesson/https://clawdemy.org/lessons/build-an-llm-from-scratch/the-architecture/lesson/Lesson 3 of Track 15. The model itself. Modern LLMs share one skeleton, a decoder-only Transformer with a residual stream, and differ in a handful of converged choices (pre-norm, RMSNorm, gated SwiGLU activations, RoPE positions, no biases, weight tying). This lesson lays out that skeleton, those choices, and the hyperparameters that size a model (d_model, n_layers, n_heads, d_ff, vocab, context), tying the parameter count back to the cost accounting of lesson 2.Sun, 24 May 2026 00:00:00 GMTClawdemy14:00falseLesson 3 of Track 15. The model itself. Modern LLMs share one skeleton, a decoder-only Transformer with a residual stream, and differ in a handful of converged choices (pre-norm, RMSNorm, gated SwiGLU activations, RoPE positions, no biases, weight tying). This lesson lays out that skeleton, those choices, and the hyperparameters that size a model (d_model, n_layers, n_heads, d_ff, vocab, context), tying the parameter count back to the cost accounting of lesson 2.Turning weak learners strong: boostinghttps://clawdemy.org/lessons/classical-machine-learning/boosting/lesson/https://clawdemy.org/lessons/classical-machine-learning/boosting/lesson/Lesson 7 of Track 10 (Classical Machine Learning), in Phase 2 (Teaching a machine to decide). A random forest grows many trees independently and averages them. Boosting takes the opposite approach: build trees one at a time, each trained to fix the mistakes the previous ones made. This lesson contrasts boosting's sequential error-correction with the forest's parallel averaging, walks AdaBoost and gradient boosting at the level of intuition, traces the residual-shrinking idea by hand, and explains why gradient-boosted trees dominate tabular data.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 7 of Track 10 (Classical Machine Learning), in Phase 2 (Teaching a machine to decide). A random forest grows many trees independently and averages them. Boosting takes the opposite approach: build trees one at a time, each trained to fix the mistakes the previous ones made. This lesson contrasts boosting's sequential error-correction with the forest's parallel averaging, walks AdaBoost and gradient boosting at the level of intuition, traces the residual-shrinking idea by hand, and explains why gradient-boosted trees dominate tabular data.Asking the right questions: decision treeshttps://clawdemy.org/lessons/classical-machine-learning/decision-trees/lesson/https://clawdemy.org/lessons/classical-machine-learning/decision-trees/lesson/Lesson 5 of Track 10 (Classical Machine Learning), in Phase 2 (Teaching a machine to decide). Where logistic regression draws one straight boundary, a decision tree asks a sequence of yes/no questions, like a flowchart, funnelling each example to a prediction. This lesson shows how to read and trace a tree, how it is built by choosing the question that best separates the classes, why an unrestrained tree overfits, and why a single tree is powerful but unstable, the flaw random forests fix next.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 5 of Track 10 (Classical Machine Learning), in Phase 2 (Teaching a machine to decide). Where logistic regression draws one straight boundary, a decision tree asks a sequence of yes/no questions, like a flowchart, funnelling each example to a prediction. This lesson shows how to read and trace a tree, how it is built by choosing the question that best separates the classes, why an unrestrained tree overfits, and why a single tree is powerful but unstable, the flaw random forests fix next.Fitting a line: linear regressionhttps://clawdemy.org/lessons/classical-machine-learning/fitting-a-line-linear-regression/lesson/https://clawdemy.org/lessons/classical-machine-learning/fitting-a-line-linear-regression/lesson/Lesson 2 of Track 10 (Classical Machine Learning), in Phase 1 (What learning from data means). Linear regression is the simplest supervised algorithm and the mental model behind every model that has weights. This lesson defines what 'best-fit line' actually means (the line that minimizes the sum of squared residuals), works the comparison by hand on a tiny dataset, teaches you to read a slope and intercept as a real-world relationship, extends to multiple features, and sets up the question lesson 3 answers: how do you actually find that line?Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 2 of Track 10 (Classical Machine Learning), in Phase 1 (What learning from data means). Linear regression is the simplest supervised algorithm and the mental model behind every model that has weights. This lesson defines what 'best-fit line' actually means (the line that minimizes the sum of squared residuals), works the comparison by hand on a tiny dataset, teaches you to read a slope and intercept as a real-world relationship, extends to multiple features, and sets up the question lesson 3 answers: how do you actually find that line?Building a hierarchy: hierarchical clusteringhttps://clawdemy.org/lessons/classical-machine-learning/hierarchical-clustering/lesson/https://clawdemy.org/lessons/classical-machine-learning/hierarchical-clustering/lesson/Lesson 10 of Track 10 (Classical Machine Learning), in Phase 3 (Finding structure without labels). K-means made you pick the number of clusters up front; hierarchical clustering does not. It builds a whole tree of nested groups, from every point alone up to one big cluster, and lets you read structure at any scale. This lesson shows the bottom-up merging process, how to read a dendrogram, and the key skill of choosing where to cut the tree.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 10 of Track 10 (Classical Machine Learning), in Phase 3 (Finding structure without labels). K-means made you pick the number of clusters up front; hierarchical clustering does not. It builds a whole tree of nested groups, from every point alone up to one big cluster, and lets you read structure at any scale. This lesson shows the bottom-up merging process, how to read a dendrogram, and the key skill of choosing where to cut the tree.How models actually learn: gradient descenthttps://clawdemy.org/lessons/classical-machine-learning/how-models-learn-gradient-descent/lesson/https://clawdemy.org/lessons/classical-machine-learning/how-models-learn-gradient-descent/lesson/Lesson 3 of Track 10 (Classical Machine Learning), closing Phase 1 (What learning from data means). Lesson 2 defined the best-fit line but not how to find it. Gradient descent is the answer, and it is how nearly every modern model learns. This lesson builds the foggy-hillside intuition, names the gradient and the learning rate, traces the downhill loop by hand on a toy loss, and shows why this one procedure scales from a two-parameter line to a billion-parameter network.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 3 of Track 10 (Classical Machine Learning), closing Phase 1 (What learning from data means). Lesson 2 defined the best-fit line but not how to find it. Gradient descent is the answer, and it is how nearly every modern model learns. This lesson builds the foggy-hillside intuition, names the gradient and the learning rate, traces the downhill loop by hand on a toy loss, and shows why this one procedure scales from a two-parameter line to a billion-parameter network.Grouping without labels: k-means clusteringhttps://clawdemy.org/lessons/classical-machine-learning/k-means-clustering/lesson/https://clawdemy.org/lessons/classical-machine-learning/k-means-clustering/lesson/Lesson 9 of Track 10 (Classical Machine Learning), the opener of Phase 3 (Finding structure without labels). Every model so far needed labels; clustering drops them. You have data and no answers, and you want the natural groups hiding in it. K-means is the workhorse. This lesson walks the assign-and-update loop by hand, shows how to choose the number of clusters, and is honest about when clustering helps and when it invents groups that are not there.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 9 of Track 10 (Classical Machine Learning), the opener of Phase 3 (Finding structure without labels). Every model so far needed labels; clustering drops them. You have data and no answers, and you want the natural groups hiding in it. K-means is the workhorse. This lesson walks the assign-and-update loop by hand, shows how to choose the number of clusters, and is honest about when clustering helps and when it invents groups that are not there.From a line to a probability: logistic regressionhttps://clawdemy.org/lessons/classical-machine-learning/logistic-regression/lesson/https://clawdemy.org/lessons/classical-machine-learning/logistic-regression/lesson/Lesson 4 of Track 10 (Classical Machine Learning), the opener of Phase 2 (Teaching a machine to decide). Many real questions are yes-or-no, and a straight line cannot answer them: it runs past 1 and below 0, where probabilities cannot go. Logistic regression keeps the line's weighted sum and squashes it through an S-shaped curve into a probability. This lesson shows why a line fails, how the sigmoid fixes it, where the decision boundary sits, and how the model is fit by the gradient descent from lesson 3.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 4 of Track 10 (Classical Machine Learning), the opener of Phase 2 (Teaching a machine to decide). Many real questions are yes-or-no, and a straight line cannot answer them: it runs past 1 and below 0, where probabilities cannot go. Logistic regression keeps the line's weighted sum and squashes it through an S-shaped curve into a probability. This lesson shows why a line fails, how the sigmoid fixes it, where the decision boundary sits, and how the model is fit by the gradient descent from lesson 3.Wisdom of crowds: random forestshttps://clawdemy.org/lessons/classical-machine-learning/random-forests/lesson/https://clawdemy.org/lessons/classical-machine-learning/random-forests/lesson/Lesson 6 of Track 10 (Classical Machine Learning), in Phase 2 (Teaching a machine to decide). A single decision tree is unstable and overfits. The random forest fixes that with the wisdom of crowds: grow hundreds of trees, each on a slightly different slice of the data and features, and let them vote. This lesson shows where the diversity comes from (bagging plus random feature subsets), why averaging many overfit trees cancels their noise and lowers variance, and what you trade away to get it.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 6 of Track 10 (Classical Machine Learning), in Phase 2 (Teaching a machine to decide). A single decision tree is unstable and overfits. The random forest fixes that with the wisdom of crowds: grow hundreds of trees, each on a slightly different slice of the data and features, and let them vote. This lesson shows where the diversity comes from (bagging plus random feature subsets), why averaging many overfit trees cancels their noise and lowers variance, and what you trade away to get it.Drawing the widest margin: support vector machineshttps://clawdemy.org/lessons/classical-machine-learning/support-vector-machines/lesson/https://clawdemy.org/lessons/classical-machine-learning/support-vector-machines/lesson/Lesson 8 of Track 10 (Classical Machine Learning), closing Phase 2 (Teaching a machine to decide). Many lines can separate two classes; the support vector machine picks the one with the widest gap between them, the boundary running down the middle of the widest possible street. This lesson builds the maximum-margin idea, explains support vectors and the soft margin, and unpacks the kernel trick that lets a straight-boundary method carve curved boundaries by lifting the data into a higher dimension.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 8 of Track 10 (Classical Machine Learning), closing Phase 2 (Teaching a machine to decide). Many lines can separate two classes; the support vector machine picks the one with the widest gap between them, the boundary running down the middle of the widest possible street. This lesson builds the maximum-margin idea, explains support vectors and the soft margin, and unpacks the kernel trick that lets a straight-boundary method carve curved boundaries by lifting the data into a higher dimension.What machine learning actually ishttps://clawdemy.org/lessons/classical-machine-learning/what-machine-learning-actually-is/lesson/https://clawdemy.org/lessons/classical-machine-learning/what-machine-learning-actually-is/lesson/Lesson 1 of Track 10 (Classical Machine Learning), the opener of Phase 1 (What learning from data means). Machine learning flips traditional programming: instead of writing the rules, you hand the machine labeled examples and let it infer the rules itself. This lesson draws that line, splits the field into supervised learning (labeled, predicting numbers or categories) and unsupervised learning (unlabeled, finding structure), names when machine learning is the wrong tool, and lands the rule that governs the whole track: a model is only as good as it does on data it has never seen.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 1 of Track 10 (Classical Machine Learning), the opener of Phase 1 (What learning from data means). Machine learning flips traditional programming: instead of writing the rules, you hand the machine labeled examples and let it infer the rules itself. This lesson draws that line, splits the field into supervised learning (labeled, predicting numbers or categories) and unsupervised learning (unlabeled, finding structure), names when machine learning is the wrong tool, and lands the rule that governs the whole track: a model is only as good as it does on data it has never seen.Why seeing is hard for machineshttps://clawdemy.org/lessons/computer-vision/why-seeing-is-hard/lesson/https://clawdemy.org/lessons/computer-vision/why-seeing-is-hard/lesson/The opener of Phase 1 (Foundations for vision) and the Track 16 entry point. A computer handed a photo receives only a grid of numbers, with no object or meaning inside. This lesson builds the central problem of computer vision: the semantic gap between pixels and meaning, why the same object produces wildly different numbers (viewpoint, scale, deformation, occlusion, illumination, clutter, intra-class variation), why hand-written rules collapse, and the data-driven shift (collect labeled images, train, evaluate on the unseen) that the rest of the track is built on.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseThe opener of Phase 1 (Foundations for vision) and the Track 16 entry point. A computer handed a photo receives only a grid of numbers, with no object or meaning inside. This lesson builds the central problem of computer vision: the semantic gap between pixels and meaning, why the same object produces wildly different numbers (viewpoint, scale, deformation, occlusion, illumination, clutter, intra-class variation), why hand-written rules collapse, and the data-driven shift (collect labeled images, train, evaluate on the unseen) that the rest of the track is built on.Backpropagation and the chain rulehttps://clawdemy.org/lessons/neural-network-intuition/backpropagation-and-the-chain-rule/lesson/https://clawdemy.org/lessons/neural-network-intuition/backpropagation-and-the-chain-rule/lesson/Lesson 9 of Track 11 (Neural Network Intuition), and the most math-leaning lesson in the track. Lesson 8 kept saying backprop figures out how much each knob should change without computing it; this lesson names the how-much. It is the chain rule applied through the layers. It uses the chain rule (not teaches it; Track 8 does that), shows why the cost is a deeply nested function, works the smallest chain by hand (dC/dw1 as a product of four simple factors = 3), reveals that the chain-rule product is exactly lesson 8's backward flow of desires, explains why running it backward reuses shared factors so one sweep yields the whole gradient, and locates the vanishing-gradient difficulty in the same product of rates.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 9 of Track 11 (Neural Network Intuition), and the most math-leaning lesson in the track. Lesson 8 kept saying backprop figures out how much each knob should change without computing it; this lesson names the how-much. It is the chain rule applied through the layers. It uses the chain rule (not teaches it; Track 8 does that), shows why the cost is a deeply nested function, works the smallest chain by hand (dC/dw1 as a product of four simple factors = 3), reveals that the chain-rule product is exactly lesson 8's backward flow of desires, explains why running it backward reuses shared factors so one sweep yields the whole gradient, and locates the vanishing-gradient difficulty in the same product of rates.Gradient descent, step by stephttps://clawdemy.org/lessons/neural-network-intuition/gradient-descent-step-by-step/lesson/https://clawdemy.org/lessons/neural-network-intuition/gradient-descent-step-by-step/lesson/Lesson 7 of Track 11 (Neural Network Intuition), and the close of the learning arc. Three lessons built to this: learning is minimizing the cost, the negative gradient points downhill, and now we take the walk. This lesson gives the gradient descent update rule (new value = old value minus learning rate times slope), runs it by hand until the cost slides toward zero, shows how a badly chosen learning rate makes training diverge or crawl, frames training as a repeated loop, names stochastic gradient descent as the real-world shortcut, and flags the one thing it assumes but does not explain: how the gradient itself gets computed. That is backpropagation, the subject of Phase 3.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 7 of Track 11 (Neural Network Intuition), and the close of the learning arc. Three lessons built to this: learning is minimizing the cost, the negative gradient points downhill, and now we take the walk. This lesson gives the gradient descent update rule (new value = old value minus learning rate times slope), runs it by hand until the cost slides toward zero, shows how a badly chosen learning rate makes training diverge or crawl, frames training as a repeated loop, names stochastic gradient descent as the real-world shortcut, and flags the one thing it assumes but does not explain: how the gradient itself gets computed. That is backpropagation, the subject of Phase 3.Neurons as numbers, layers as structurehttps://clawdemy.org/lessons/neural-network-intuition/neurons-and-layers/lesson/https://clawdemy.org/lessons/neural-network-intuition/neurons-and-layers/lesson/Lesson 2 of Track 11 (Neural Network Intuition). The last lesson named the goal, a function from 784 numbers to 10, and left it sealed. This lesson opens it up. Inside is nothing exotic: layers of neurons, where a neuron is just a container holding one number between 0 and 1 (its activation). It traces a real pixel into the 784-neuron input layer, reads a guess off the 10-neuron output layer by finding the tallest activation, meets the two hidden layers in between, and explains why this one-directional design is called feedforward. The edges-to-loops story of what hidden layers do is offered as a hope to hold loosely, not a proven fact.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 2 of Track 11 (Neural Network Intuition). The last lesson named the goal, a function from 784 numbers to 10, and left it sealed. This lesson opens it up. Inside is nothing exotic: layers of neurons, where a neuron is just a container holding one number between 0 and 1 (its activation). It traces a real pixel into the 784-neuron input layer, reads a guess off the 10-neuron output layer by finding the tallest activation, meets the two hidden layers in between, and explains why this one-directional design is called feedforward. The edges-to-loops story of what hidden layers do is offered as a hope to hold loosely, not a proven fact.Seeing it whole, and where nexthttps://clawdemy.org/lessons/neural-network-intuition/seeing-it-whole-and-where-next/lesson/https://clawdemy.org/lessons/neural-network-intuition/seeing-it-whole-and-where-next/lesson/Lesson 10 of Track 11 (Neural Network Intuition), the synthesis finale. Ten lessons ago a messy handwritten 3 was something you could read instantly but not explain; now you can explain it down to the arithmetic. This closing lesson adds no new machinery. It assembles the whole picture in one breath (function, layers, neurons, cost, landscape, gradient descent, backpropagation), walks one full training step end to end on that very 3, is honest about what the track did not cover (architectures, optimizers, regularization, fine-tuning, code), leaves you with one durable image (a row of dials and a landscape, a patient walk downhill), and routes you to three next tracks: build it yourself (T13), understand modern LLMs (T5), or use AI to build things (T20).Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 10 of Track 11 (Neural Network Intuition), the synthesis finale. Ten lessons ago a messy handwritten 3 was something you could read instantly but not explain; now you can explain it down to the arithmetic. This closing lesson adds no new machinery. It assembles the whole picture in one breath (function, layers, neurons, cost, landscape, gradient descent, backpropagation), walks one full training step end to end on that very 3, is honest about what the track did not cover (architectures, optimizers, regularization, fine-tuning, code), leaves you with one durable image (a row of dials and a landscape, a patient walk downhill), and routes you to three next tracks: build it yourself (T13), understand modern LLMs (T5), or use AI to build things (T20).The cost landscapehttps://clawdemy.org/lessons/neural-network-intuition/the-cost-landscape/lesson/https://clawdemy.org/lessons/neural-network-intuition/the-cost-landscape/lesson/Lesson 6 of Track 11 (Neural Network Intuition). Lesson 5 turned learning into a clean goal, make the cost small, but left us standing in a 13,000-dimensional space with no idea which way to move. This lesson gives that space a shape. It pictures the cost as a landscape of hills and valleys (each knob setting a point, its cost the height), explains why high dimensions are fine even though they cannot be drawn, introduces the gradient as the direction of steepest uphill, and shows why stepping along the negative gradient lowers the cost fastest. It works the downhill step by hand in one and two dimensions, and ends on an honest caveat: downhill walking reaches a local minimum, not always the deepest valley.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 6 of Track 11 (Neural Network Intuition). Lesson 5 turned learning into a clean goal, make the cost small, but left us standing in a 13,000-dimensional space with no idea which way to move. This lesson gives that space a shape. It pictures the cost as a landscape of hills and valleys (each knob setting a point, its cost the height), explains why high dimensions are fine even though they cannot be drawn, introduces the gradient as the direction of steepest uphill, and shows why stepping along the negative gradient lowers the cost fastest. It works the downhill step by hand in one and two dimensions, and ends on an honest caveat: downhill walking reaches a local minimum, not always the deepest valley.The whole network as one functionhttps://clawdemy.org/lessons/neural-network-intuition/the-whole-network-as-one-function/lesson/https://clawdemy.org/lessons/neural-network-intuition/the-whole-network-as-one-function/lesson/Lesson 4 of Track 11 (Neural Network Intuition), and the close of the structure arc. The first three lessons named a goal and built the parts; this lesson steps back to see the whole machine, and it turns out to be exactly the function promised in lesson 1. Running it is the forward pass: lesson 3's neuron formula applied layer by layer. It evaluates a tiny network end to end by hand, introduces the f(x; w, b) framing that separates the per-use input from the fixed weights and biases, and shows that the same skeleton behaves completely differently depending only on its parameter values. The chapter's payoff: a network is a function, and all its capability lives in those numbers, which sets up the question Phase 2 answers, how the right numbers get found.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 4 of Track 11 (Neural Network Intuition), and the close of the structure arc. The first three lessons named a goal and built the parts; this lesson steps back to see the whole machine, and it turns out to be exactly the function promised in lesson 1. Running it is the forward pass: lesson 3's neuron formula applied layer by layer. It evaluates a tiny network end to end by hand, introduces the f(x; w, b) framing that separates the per-use input from the fixed weights and biases, and shows that the same skeleton behaves completely differently depending only on its parameter values. The chapter's payoff: a network is a function, and all its capability lives in those numbers, which sets up the question Phase 2 answers, how the right numbers get found.Weights, biases, and the squishhttps://clawdemy.org/lessons/neural-network-intuition/weights-biases-and-the-squish/lesson/https://clawdemy.org/lessons/neural-network-intuition/weights-biases-and-the-squish/lesson/Lesson 3 of Track 11 (Neural Network Intuition). Lesson 2 said hidden neurons get their number from the layer before but never said how. This lesson is the how: the single computation every neuron runs. Multiply each incoming activation by a weight, add them up, add a bias, and squash the result into range with an activation function (sigmoid or ReLU). It works one neuron by hand both ways, explains that weights set attention and biases set eagerness, and counts the knobs, showing the small 784-16-16-10 digit network already needs about 13,002 weights and biases while modern networks have billions. The punchline: a network's behavior lives entirely in those parameter values.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 3 of Track 11 (Neural Network Intuition). Lesson 2 said hidden neurons get their number from the layer before but never said how. This lesson is the how: the single computation every neuron runs. Multiply each incoming activation by a weight, add them up, add a bias, and squash the result into range with an activation function (sigmoid or ReLU). It works one neuron by hand both ways, explains that weights set attention and biases set eagerness, and counts the knobs, showing the small 784-16-16-10 digit network already needs about 13,002 weights and biases while modern networks have billions. The punchline: a network's behavior lives entirely in those parameter values.What backpropagation is really doinghttps://clawdemy.org/lessons/neural-network-intuition/what-backpropagation-is-really-doing/lesson/https://clawdemy.org/lessons/neural-network-intuition/what-backpropagation-is-really-doing/lesson/Lesson 8 of Track 11 (Neural Network Intuition), and the opener of the backpropagation arc. Lesson 7 confessed a gap: gradient descent needs the gradient, and we never said how to get it. This lesson gives the intuition behind the answer, backpropagation, with no calculus. Brute force (nudge each knob, re-run the network) is hopeless at 13,000 knobs, so instead we ask what each output neuron wants, watch those wishes turn into adjustments to weights and biases plus requests of the previous layer, and see those requests roll backward layer by layer. A single forward pass plus a single backward sweep yields the whole gradient for about the cost of running the network once, and averaging the wishes over many examples is why learning needs lots of data.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 8 of Track 11 (Neural Network Intuition), and the opener of the backpropagation arc. Lesson 7 confessed a gap: gradient descent needs the gradient, and we never said how to get it. This lesson gives the intuition behind the answer, backpropagation, with no calculus. Brute force (nudge each knob, re-run the network) is hopeless at 13,000 knobs, so instead we ask what each output neuron wants, watch those wishes turn into adjustments to weights and biases plus requests of the previous layer, and see those requests roll backward layer by layer. A single forward pass plus a single backward sweep yields the whole gradient for about the cost of running the network once, and averaging the wishes over many examples is why learning needs lots of data.What learning really meanshttps://clawdemy.org/lessons/neural-network-intuition/what-learning-really-means/lesson/https://clawdemy.org/lessons/neural-network-intuition/what-learning-really-means/lesson/Lesson 5 of Track 11 (Neural Network Intuition), and the opener of the learning arc. Phase 1 ended on a cliffhanger: a network only works once its roughly 13,000 weights and biases are set well, so how do we find good values? This lesson builds the measure that makes the search possible, the cost function: a single number for how wrong the network is right now. It writes the desired answer as a one-hot output, works the cost by hand on a confident-correct output (about 0.0129) and a total shrug (0.90), reframes cost as a function of the knobs C(w, b), and collapses learning into one idea: adjust the weights and biases to make that number small. The catch (13,000 dials, a bumpy surface, no brute force) sets up lessons 6 and 7.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 5 of Track 11 (Neural Network Intuition), and the opener of the learning arc. Phase 1 ended on a cliffhanger: a network only works once its roughly 13,000 weights and biases are set well, so how do we find good values? This lesson builds the measure that makes the search possible, the cost function: a single number for how wrong the network is right now. It writes the desired answer as a one-hot output, works the cost by hand on a confident-correct output (about 0.0129) and a total shrug (0.90), reframes cost as a function of the knobs C(w, b), and collapses learning into one idea: adjust the weights and biases to make that number small. The catch (13,000 dials, a bumpy surface, no brute force) sets up lessons 6 and 7.Build and share a demohttps://clawdemy.org/lessons/practical-transformers/build-and-share-a-demo/lesson/https://clawdemy.org/lessons/practical-transformers/build-and-share-a-demo/lesson/Lesson 9 of Track 14 and the start of Phase 3. Everything so far has lived in a notebook; this lesson ships. Wrap any model in a browser interface with a few lines of Gradio (gr.Interface plus launch), put your inference code in the function, match components to the model's inputs and outputs, share it with a temporary public link, and publish it permanently on Hugging Face Spaces, all without writing any frontend code.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 9 of Track 14 and the start of Phase 3. Everything so far has lived in a notebook; this lesson ships. Wrap any model in a browser interface with a few lines of Gradio (gr.Interface plus launch), put your inference code in the function, match components to the model's inputs and outputs, share it with a temporary public link, and publish it permanently on Hugging Face Spaces, all without writing any frontend code.Curating high-quality datasetshttps://clawdemy.org/lessons/practical-transformers/curating-datasets/lesson/https://clawdemy.org/lessons/practical-transformers/curating-datasets/lesson/Lesson 11 of Track 14. The last lesson ended on a line worth taking seriously: a model is only as good as its data. This lesson is about that data, why quality (not model size) is increasingly the lever that decides results, and how to curate and evaluate a training dataset with Argilla, the human-in-the-loop annotation and feedback platform that turns raw data into something worth training on, then exports it back to the Hub.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 11 of Track 14. The last lesson ended on a line worth taking seriously: a model is only as good as its data. This lesson is about that data, why quality (not model size) is increasingly the lever that decides results, and how to curate and evaluate a training dataset with Argilla, the human-in-the-loop annotation and feedback platform that turns raw data into something worth training on, then exports it back to the Hub.Debug your training and get unstuckhttps://clawdemy.org/lessons/practical-transformers/debug-and-get-unstuck/lesson/https://clawdemy.org/lessons/practical-transformers/debug-and-get-unstuck/lesson/Lesson 8 of Track 14 and the close of Phase 2. The most universally useful lesson in the track: how to read a Python traceback (bottom to top), debug a pipeline by forming and checking a hypothesis, recognize where training pipelines commonly break, build a minimal reproducible example, and ask the community for help in a way that actually gets answered. These skills outlast every specific API in the track.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 8 of Track 14 and the close of Phase 2. The most universally useful lesson in the track: how to read a Python traceback (bottom to top), debug a pipeline by forming and checking a hypothesis, recognize where training pipelines commonly break, build a minimal reproducible example, and ask the community for help in a way that actually gets answered. These skills outlast every specific API in the track.Fine-tune a pretrained model on your own datahttps://clawdemy.org/lessons/practical-transformers/fine-tune-on-your-data/lesson/https://clawdemy.org/lessons/practical-transformers/fine-tune-on-your-data/lesson/Lesson 3 of Track 14, the hands-on heart of Phase 1. Take a pretrained model and continue training it on a task-specific dataset using the Trainer, then measure whether it actually improved. You will meet the data collator (dynamic padding), the expected head-swap warning, the TrainingArguments config object, the Trainer itself, and the evaluation discipline of compute_metrics, fine-tuning BERT on the MRPC dataset to about 86% accuracy in a few minutes.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 3 of Track 14, the hands-on heart of Phase 1. Take a pretrained model and continue training it on a task-specific dataset using the Trainer, then measure whether it actually improved. You will meet the data collator (dynamic padding), the expected head-swap warning, the TrainingArguments config object, the Trainer itself, and the evaluation discipline of compute_metrics, fine-tuning BERT on the MRPC dataset to about 86% accuracy in a few minutes.Fine-tuning LLMs, supervised and instruction tuninghttps://clawdemy.org/lessons/practical-transformers/fine-tuning-llms/lesson/https://clawdemy.org/lessons/practical-transformers/fine-tuning-llms/lesson/Lesson 10 of Track 14, the first LLM-frontier lesson. The assistant-style models you use went through a different fine-tuning than the classifier of lesson 3. This lesson distinguishes task fine-tuning from supervised fine-tuning (SFT), shows when to reach for SFT versus prompting, explains the chat-formatted data and chat templates it needs, introduces the SFTTrainer from TRL, and covers how LoRA makes fine-tuning large models affordable. It stays strictly at a mechanical, how-it-works level.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 10 of Track 14, the first LLM-frontier lesson. The assistant-style models you use went through a different fine-tuning than the classifier of lesson 3. This lesson distinguishes task fine-tuning from supervised fine-tuning (SFT), shows when to reach for SFT versus prompting, explains the chat-formatted data and chat templates it needs, introduces the SFTTrainer from TRL, and covers how LoRA makes fine-tuning large models affordable. It stays strictly at a mechanical, how-it-works level.Reasoning models and the road aheadhttps://clawdemy.org/lessons/practical-transformers/reasoning-models-and-the-road-ahead/lesson/https://clawdemy.org/lessons/practical-transformers/reasoning-models-and-the-road-ahead/lesson/Lesson 12 of Track 14, the track capstone. You started not knowing what a transformer was; you can now run, fine-tune, share, curate for, and ship one. This final lesson looks at the current frontier, reasoning models: what they add over ordinary LLMs, how reinforcement learning trains a model to think before it answers, where the open Hugging Face ecosystem fits (Open R1), and the durable working method that outlasts any specific frontier. It stays at a mechanical, how-it-works level.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 12 of Track 14, the track capstone. You started not knowing what a transformer was; you can now run, fine-tune, share, curate for, and ship one. This final lesson looks at the current frontier, reasoning models: what they add over ordinary LLMs, how reinforcement learning trains a model to think before it answers, where the open Hugging Face ecosystem fits (Open R1), and the durable working method that outlasts any specific frontier. It stays at a mechanical, how-it-works level.Run a model in a few lines, pipelines and Auto classeshttps://clawdemy.org/lessons/practical-transformers/run-a-model-in-a-few-lines/lesson/https://clawdemy.org/lessons/practical-transformers/run-a-model-in-a-few-lines/lesson/Lesson 2 of Track 14, and the first one where you run code. It starts with the two-line pipeline() call that runs a whole task, then opens the box: the three steps a pipeline hides (a tokenizer, the model, postprocessing) reproduced by hand with the Auto classes. You will see input_ids and attention_mask, the difference between AutoModel and AutoModelForSequenceClassification, why models output logits instead of probabilities, and the single from_pretrained idiom the whole library runs on.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 2 of Track 14, and the first one where you run code. It starts with the two-line pipeline() call that runs a whole task, then opens the box: the three steps a pipeline hides (a tokenizer, the model, postprocessing) reproduced by hand with the Auto classes. You will see input_ids and attention_mask, the difference between AutoModel and AutoModelForSequenceClassification, why models output logits instead of probabilities, and the single from_pretrained idiom the whole library runs on.Share your work on the Hubhttps://clawdemy.org/lessons/practical-transformers/share-on-the-hub/lesson/https://clawdemy.org/lessons/practical-transformers/share-on-the-hub/lesson/Lesson 4 of Track 14, the close of Phase 1. Push a model and tokenizer to the Hugging Face Hub so anyone can load them with from_pretrained, write a model card so the work is actually usable, and understand why sharing is the engine of the whole ecosystem. You will authenticate, compare the three upload routes (push_to_hub API, the huggingface_hub library, git/git-lfs), see what a model repo contains, and learn why the model card is the real deliverable.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 4 of Track 14, the close of Phase 1. Push a model and tokenizer to the Hugging Face Hub so anyone can load them with from_pretrained, write a model card so the work is actually usable, and understand why sharing is the engine of the whole ecosystem. You will authenticate, compare the three upload routes (push_to_hub API, the huggingface_hub library, git/git-lfs), see what a model repo contains, and learn why the model card is the real deliverable.The main NLP tasks, end to endhttps://clawdemy.org/lessons/practical-transformers/the-main-nlp-tasks/lesson/https://clawdemy.org/lessons/practical-transformers/the-main-nlp-tasks/lesson/Lesson 7 of Track 14, where everything comes together. The six common NLP tasks (sequence and token classification, question answering, masked and causal language modeling, summarization, translation) all follow one loop; what changes is the head, the label shape, and the metric. This lesson builds the real applied skill: looking at a problem, naming which task it is, and choosing the right `AutoModelFor<Task>` head, data shape, and metric, plus the two recurring wrinkles of token alignment and sequence-to-sequence.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 7 of Track 14, where everything comes together. The six common NLP tasks (sequence and token classification, question answering, masked and causal language modeling, summarization, translation) all follow one loop; what changes is the head, the label shape, and the metric. This lesson builds the real applied skill: looking at a problem, naming which task it is, and choosing the right `AutoModelFor<Task>` head, data shape, and metric, plus the two recurring wrinkles of token alignment and sequence-to-sequence.Tokenizers up closehttps://clawdemy.org/lessons/practical-transformers/tokenizers-up-close/lesson/https://clawdemy.org/lessons/practical-transformers/tokenizers-up-close/lesson/Lesson 6 of Track 14. Open the tokenizer you have called since lesson 2. This lesson walks the four-stage pipeline a fast tokenizer runs (normalization, pre-tokenization, the subword model, postprocessing), explains why fast tokenizers are fast and what offsets and word IDs buy you, names the three subword algorithms (BPE, WordPiece, Unigram) and who uses them, and trains a brand-new tokenizer on a corpus of Python code with train_new_from_iterator, cutting token counts by about a quarter.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 6 of Track 14. Open the tokenizer you have called since lesson 2. This lesson walks the four-stage pipeline a fast tokenizer runs (normalization, pre-tokenization, the subword model, postprocessing), explains why fast tokenizers are fast and what offsets and word IDs buy you, names the three subword algorithms (BPE, WordPiece, Unigram) and who uses them, and trains a brand-new tokenizer on a corpus of Python code with train_new_from_iterator, cutting token counts by about a quarter.Wrangling data with the Datasets libraryhttps://clawdemy.org/lessons/practical-transformers/wrangle-data-with-datasets/lesson/https://clawdemy.org/lessons/practical-transformers/wrangle-data-with-datasets/lesson/Lesson 5 of Track 14 and the start of Phase 2. Real data is never as tidy as the GLUE dataset made it look, so this lesson turns to the datasets library: load data from the Hub or your own files, then clean and transform it at scale with map and filter, the batched=True superpower that makes it fast, the Arrow backend that handles data larger than RAM, and the train_test_split discipline that prepares data for training.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 5 of Track 14 and the start of Phase 2. Real data is never as tidy as the GLUE dataset made it look, so this lesson turns to the datasets library: load data from the Hub or your own files, then clean and transform it at scale with map and filter, the batched=True superpower that makes it fast, the Arrow backend that handles data larger than RAM, and the train_test_split discipline that prepares data for training.Updating beliefs with evidence: Bayes' theoremhttps://clawdemy.org/lessons/statistics-and-probability/bayes-theorem/lesson/https://clawdemy.org/lessons/statistics-and-probability/bayes-theorem/lesson/Lesson 7 of Track 9 and the close of Phase 2. Bayes' theorem converts the chance of A given B into the chance of B given A, and it is the mathematics of updating a belief when evidence arrives. This lesson builds Bayes from natural frequencies, re-derives lesson 1's base-rate result exactly (a 99%-accurate test that is still 50% right on a positive), shows how a second test updates again to 99%, and connects it to spam filters, base-rate neglect, and combining a prior with new data.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 7 of Track 9 and the close of Phase 2. Bayes' theorem converts the chance of A given B into the chance of B given A, and it is the mathematics of updating a belief when evidence arrives. This lesson builds Bayes from natural frequencies, re-derives lesson 1's base-rate result exactly (a 99%-accurate test that is still 50% right on a positive), shows how a second test updates again to 99%, and connects it to spam filters, base-rate neglect, and combining a prior with new data.When one event tells you about another: conditional probability and independencehttps://clawdemy.org/lessons/statistics-and-probability/conditional-probability-and-independence/lesson/https://clawdemy.org/lessons/statistics-and-probability/conditional-probability-and-independence/lesson/Lesson 6 of Track 9. The multiplication rule needed independence, but the events that matter in AI are dependent. This lesson defines conditional probability (the chance of A given B), reads it off a two-way table, generalizes the multiplication rule to dependent events, redefines independence in those terms, and hammers the subject's costliest confusion: the chance of A given B is not the chance of B given A. It sets up Bayes' theorem in the next lesson.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 6 of Track 9. The multiplication rule needed independence, but the events that matter in AI are dependent. This lesson defines conditional probability (the chance of A given B), reads it off a two-way table, generalizes the multiplication rule to dependent events, redefines independence in those terms, and hammers the subject's costliest confusion: the chance of A given B is not the chance of B given A. It sets up Bayes' theorem in the next lesson.How sure are we? confidence intervalshttps://clawdemy.org/lessons/statistics-and-probability/confidence-intervals/lesson/https://clawdemy.org/lessons/statistics-and-probability/confidence-intervals/lesson/Lesson 12 of Track 9. A single measured number hides its uncertainty; a confidence interval shows it, turning '90% accurate' into '90%, give or take 4 points.' This lesson builds the interval as estimate plus or minus a margin of error (about two standard errors for 95%), shows how data and confidence trade off against width, and corrects the interpretation almost everyone gets wrong: a 95% interval is not a 95% probability that the truth is in this particular range.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 12 of Track 9. A single measured number hides its uncertainty; a confidence interval shows it, turning '90% accurate' into '90%, give or take 4 points.' This lesson builds the interval as estimate plus or minus a margin of error (about two standard errors for 95%), shows how data and confidence trade off against width, and corrects the interpretation almost everyone gets wrong: a 95% interval is not a 95% probability that the truth is in this particular range.Testing a claim: hypothesis testing and p-valueshttps://clawdemy.org/lessons/statistics-and-probability/hypothesis-testing-and-p-values/lesson/https://clawdemy.org/lessons/statistics-and-probability/hypothesis-testing-and-p-values/lesson/Lesson 13 of Track 9. Confidence intervals hinted a difference might be noise; hypothesis testing makes the call. This lesson sets up the null and alternative, explains the logic of assuming the null and measuring how surprising the data is, defines the p-value carefully, and dismantles the misreadings that make it the most abused number in science: it is not the probability the null is true, significant is not important, and failing to reject is not proof.Sun, 24 May 2026 00:00:00 GMTClawdemy13:00falseLesson 13 of Track 9. Confidence intervals hinted a difference might be noise; hypothesis testing makes the call. This lesson sets up the null and alternative, explains the logic of assuming the null and measuring how surprising the data is, defines the p-value carefully, and dismantles the misreadings that make it the most abused number in science: it is not the probability the null is true, significant is not important, and failing to reject is not proof.Probability foundationshttps://clawdemy.org/lessons/statistics-and-probability/probability-foundations/lesson/https://clawdemy.org/lessons/statistics-and-probability/probability-foundations/lesson/Lesson 5 of Track 9 and the opener of Phase 2. A probability is a number from 0 to 1, and combining probabilities takes just three rules: the complement (and the at-least-one shortcut), the addition rule for OR (subtract the overlap), and the multiplication rule for independent ANDs. This lesson works each on dice, coins, and cards, flags that multiplication needs independence, and connects the rules to pipeline reliability and how a language model scores a sentence.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 5 of Track 9 and the opener of Phase 2. A probability is a number from 0 to 1, and combining probabilities takes just three rules: the complement (and the at-least-one shortcut), the addition rule for OR (subtract the overlap), and the multiplication rule for independent ANDs. This lesson works each on dice, coins, and cards, flags that multiplication needs independence, and connects the rules to pipeline reliability and how a language model scores a sentence.Random variables and expected valuehttps://clawdemy.org/lessons/statistics-and-probability/random-variables-and-expected-value/lesson/https://clawdemy.org/lessons/statistics-and-probability/random-variables-and-expected-value/lesson/Lesson 8 of Track 9 and the opener of Phase 3. A random variable is a number whose value comes from chance (a payoff, a count, a loss), and its expected value is the long-run average it settles toward. This lesson defines random variables and their distributions, computes expected value and variance by hand, and shows why expected value is the backbone of machine-learning objectives: the thing a loss function minimizes and a reward an agent maximizes.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 8 of Track 9 and the opener of Phase 3. A random variable is a number whose value comes from chance (a payoff, a count, a loss), and its expected value is the long-run average it settles toward. This lesson defines random variables and their distributions, computes expected value and variance by hand, and shows why expected value is the backbone of machine-learning objectives: the thing a loss function minimizes and a reward an agent maximizes.From sample to population: sampling and the central limit theoremhttps://clawdemy.org/lessons/statistics-and-probability/sampling-and-the-central-limit-theorem/lesson/https://clawdemy.org/lessons/statistics-and-probability/sampling-and-the-central-limit-theorem/lesson/Lesson 11 of Track 9 and the opener of Phase 4. Every number measured on a sample is an estimate that varies from sample to sample. This lesson separates a sample statistic from the population parameter it estimates, introduces the standard error (sigma over root n) and the square-root law behind 'more data helps,' and states the central limit theorem, the reason sample means are normal no matter the data's shape, which makes the rest of inference possible.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 11 of Track 9 and the opener of Phase 4. Every number measured on a sample is an estimate that varies from sample to sample. This lesson separates a sample statistic from the population parameter it estimates, introduces the standard error (sigma over root n) and the square-root law behind 'more data helps,' and states the central limit theorem, the reason sample means are normal no matter the data's shape, which makes the rest of inference possible.Statistics in machine learninghttps://clawdemy.org/lessons/statistics-and-probability/statistics-in-machine-learning/lesson/https://clawdemy.org/lessons/statistics-and-probability/statistics-in-machine-learning/lesson/Lesson 14 of Track 9, the capstone. It walks every tool from the track into a real machine-learning workflow: describing data, reading model outputs as conditional probabilities, expected value as the training objective, and the heart of it, evaluation as inference (a test set is a sample, a metric is an estimate with a confidence interval, comparing models is a hypothesis test). It draws a clean boundary to the Classical ML track for the model-scoring toolkit and closes on the through-line: statistics is the discipline of not fooling yourself about uncertainty.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 14 of Track 9, the capstone. It walks every tool from the track into a real machine-learning workflow: describing data, reading model outputs as conditional probabilities, expected value as the training objective, and the heart of it, evaluation as inference (a test set is a sample, a metric is an estimate with a confidence interval, comparing models is a hypothesis test). It draws a clean boundary to the Classical ML track for the model-scoring toolkit and closes on the through-line: statistics is the discipline of not fooling yourself about uncertainty.Summarizing data: center and spreadhttps://clawdemy.org/lessons/statistics-and-probability/summarizing-data-center-and-spread/lesson/https://clawdemy.org/lessons/statistics-and-probability/summarizing-data-center-and-spread/lesson/Lesson 2 of Track 9. Before any model learns, someone summarizes the data, and the summary can mislead. This lesson covers the two questions every summary answers (where is the center, how spread out is it), the mean-versus-median tradeoff under skew, how to compute variance and standard deviation by hand, and why standardizing features by their mean and standard deviation is one of machine learning's most common first steps.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 2 of Track 9. Before any model learns, someone summarizes the data, and the summary can mislead. This lesson covers the two questions every summary answers (where is the center, how spread out is it), the mean-versus-median tradeoff under skew, how to compute variance and standard deviation by hand, and why standardizing features by their mean and standard deviation is one of machine learning's most common first steps.Counts and trials: the binomial distributionhttps://clawdemy.org/lessons/statistics-and-probability/the-binomial-distribution/lesson/https://clawdemy.org/lessons/statistics-and-probability/the-binomial-distribution/lesson/Lesson 10 of Track 9 and the close of Phase 3. When you count successes in a fixed number of independent yes-or-no trials, the binomial distribution gives the probabilities. This lesson lays out the four conditions, builds the exactly-k probability formula, works it on coins and a model's accuracy, gives the n-times-p expected-count shortcut, separates exactly-k from at-least-k, and connects it to accuracy as a binomial count.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 10 of Track 9 and the close of Phase 3. When you count successes in a fixed number of independent yes-or-no trials, the binomial distribution gives the probabilities. This lesson lays out the four conditions, builds the exactly-k probability formula, works it on coins and a model's accuracy, gives the n-times-p expected-count shortcut, separates exactly-k from at-least-k, and connects it to accuracy as a binomial count.The bell curve: the normal distributionhttps://clawdemy.org/lessons/statistics-and-probability/the-normal-distribution/lesson/https://clawdemy.org/lessons/statistics-and-probability/the-normal-distribution/lesson/Lesson 9 of Track 9. The bell curve named in the histogram lesson gets made precise. This lesson explains how a continuous distribution carries probability as area under a curve, defines the normal by its mean and standard deviation, gives the 68-95-99.7 rule, formalizes the z-score as the standardization met earlier, and connects the normal to AI: feature standardization, the default model of noise, and outlier detection.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 9 of Track 9. The bell curve named in the histogram lesson gets made precise. This lesson explains how a continuous distribution carries probability as area under a curve, defines the normal by its mean and standard deviation, gives the 68-95-99.7 rule, formalizes the z-score as the standardization met earlier, and connects the normal to AI: feature standardization, the default model of noise, and outlier detection.The shape of data: distributions and histogramshttps://clawdemy.org/lessons/statistics-and-probability/the-shape-of-data-distributions-and-histograms/lesson/https://clawdemy.org/lessons/statistics-and-probability/the-shape-of-data-distributions-and-histograms/lesson/Lesson 3 of Track 9. A center and spread summarize data, but a histogram shows its shape, and shape carries information no single number can. This lesson builds the histogram, names the shapes (symmetric, skewed, uniform, bimodal, bell), reconnects skew to the mean-versus-median gap, and shows why inspecting a feature's distribution before modeling catches outliers, hidden subpopulations, and class imbalance that summary numbers miss.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 3 of Track 9. A center and spread summarize data, but a histogram shows its shape, and shape carries information no single number can. This lesson builds the histogram, names the shapes (symmetric, skewed, uniform, bimodal, bell), reconnects skew to the mean-versus-median gap, and shows why inspecting a feature's distribution before modeling catches outliers, hidden subpopulations, and class imbalance that summary numbers miss.When two things move together: correlationhttps://clawdemy.org/lessons/statistics-and-probability/when-two-things-move-together-correlation/lesson/https://clawdemy.org/lessons/statistics-and-probability/when-two-things-move-together-correlation/lesson/Lesson 4 of Track 9 and the close of Phase 1. Correlation measures how tightly two quantities move together; this lesson reads the scatterplot, interprets the correlation coefficient between -1 and +1, warns that it sees only straight lines, and spends real time on the most misused idea in data analysis: correlation is not causation. It connects to machine learning (redundant features, spurious signals) and draws a clean line to where prediction proper lives, the Classical Machine Learning track.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 4 of Track 9 and the close of Phase 1. Correlation measures how tightly two quantities move together; this lesson reads the scatterplot, interprets the correlation coefficient between -1 and +1, warns that it sees only straight lines, and spends real time on the most misused idea in data analysis: correlation is not causation. It connects to machine learning (redundant features, spurious signals) and draws a clean line to where prediction proper lives, the Classical Machine Learning track.Why AI runs on statisticshttps://clawdemy.org/lessons/statistics-and-probability/why-ai-runs-on-statistics/lesson/https://clawdemy.org/lessons/statistics-and-probability/why-ai-runs-on-statistics/lesson/The opener of Track 9 (Statistics & Probability for AI). Every AI system speaks in probabilities, not certainties: a spam filter says 98% spam, a model reports 0.91 confidence, a recommender ranks by likelihood. This orientation lesson situates statistics and probability as the language AI uses to reason under uncertainty. It explains why uncertainty is unavoidable, splits the two directions of statistical reasoning (probability forward, statistics backward), maps where each idea in the track shows up inside real systems, and works the base-rate example to show why a 99%-accurate test can be right only half the time.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseThe opener of Track 9 (Statistics & Probability for AI). Every AI system speaks in probabilities, not certainties: a spam filter says 98% spam, a model reports 0.91 confidence, a recommender ranks by likelihood. This orientation lesson situates statistics and probability as the language AI uses to reason under uncertainty. It explains why uncertainty is unavoidable, splits the two directions of statistical reasoning (probability forward, statistics backward), maps where each idea in the track shows up inside real systems, and works the base-rate example to show why a 99%-accurate test can be right only half the time.The chain rule, visuallyhttps://clawdemy.org/lessons/visual-math-calculus/chain-rule-visually/lesson/https://clawdemy.org/lessons/visual-math-calculus/chain-rule-visually/lesson/Lesson 6 of Track 8 (Visual Math: Calculus). The product rule handled functions multiplied; the chain rule handles functions nested one inside another, like sin(x^2). It says rates multiply through a composition: d/dx(f(g(x))) = f'(g(x)) * g'(x), the outer derivative (evaluated at the inner function) times the inner derivative. The lesson reads a composition as a pipeline whose stage-rates compound, drills the classic 'evaluated at the inner function' error, works several examples (polynomial, trig-with-power, double nesting, an e preview), and shows that this is the single most-used calculus rule in machine learning because backpropagation is the chain rule applied through a network's layers.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 6 of Track 8 (Visual Math: Calculus). The product rule handled functions multiplied; the chain rule handles functions nested one inside another, like sin(x^2). It says rates multiply through a composition: d/dx(f(g(x))) = f'(g(x)) * g'(x), the outer derivative (evaluated at the inner function) times the inner derivative. The lesson reads a composition as a pipeline whose stage-rates compound, drills the classic 'evaluated at the inner function' error, works several examples (polynomial, trig-with-power, double nesting, an e preview), and shows that this is the single most-used calculus rule in machine learning because backpropagation is the chain rule applied through a network's layers.The essence of calculushttps://clawdemy.org/lessons/visual-math-calculus/essence-of-calculus/lesson/https://clawdemy.org/lessons/visual-math-calculus/essence-of-calculus/lesson/Lesson 1 of Track 8 (Visual Math: Calculus), and the orientation for the whole track. You know the area of a circle is πR², but almost nobody can say why. Rebuilding it from scratch turns out to contain all of calculus in miniature. This lesson slices the disk into thin rings, unrolls each into a rectangle of area about 2πr·dr, sums them into the area under the line 2πr (a triangle that works out to exactly πR²), and in doing so names the two pillars (rates and accumulation) and the surprising fact that they are inverses, the Fundamental Theorem of Calculus, seen on a circle before any term is defined carefully.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 1 of Track 8 (Visual Math: Calculus), and the orientation for the whole track. You know the area of a circle is πR², but almost nobody can say why. Rebuilding it from scratch turns out to contain all of calculus in miniature. This lesson slices the disk into thin rings, unrolls each into a rectangle of area about 2πr·dr, sums them into the area under the line 2πr (a triangle that works out to exactly πR²), and in doing so names the two pillars (rates and accumulation) and the surprising fact that they are inverses, the Fundamental Theorem of Calculus, seen on a circle before any term is defined carefully.Implicit differentiationhttps://clawdemy.org/lessons/visual-math-calculus/implicit-differentiation/lesson/https://clawdemy.org/lessons/visual-math-calculus/implicit-differentiation/lesson/Lesson 8 of Track 8 (Visual Math: Calculus). Every derivative so far assumed you could write y as a clean function of x, but most real relations (like the circle x^2 + y^2 = 25) tie x and y together without untangling. Implicit differentiation finds the slope anyway, and it is just the chain rule applied to a relationship: treat y as a function of x, differentiate both sides, attach a dy/dx to every y term, and solve. The lesson works the circle (dy/dx = -x/y, checked perpendicular to the radius), derives the ln(x) derivative from e^y = x, handles a relation that cannot be untangled, and introduces related rates (the sliding ladder) as the time-based twin.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 8 of Track 8 (Visual Math: Calculus). Every derivative so far assumed you could write y as a clean function of x, but most real relations (like the circle x^2 + y^2 = 25) tie x and y together without untangling. Implicit differentiation finds the slope anyway, and it is just the chain rule applied to a relationship: treat y as a function of x, differentiate both sides, attach a dy/dx to every y term, and solve. The lesson works the circle (dy/dx = -x/y, checked perpendicular to the radius), derives the ln(x) derivative from e^y = x, handles a relation that cannot be untangled, and introduces related rates (the sliding ladder) as the time-based twin.Integration and the fundamental theoremhttps://clawdemy.org/lessons/visual-math-calculus/integration-and-the-fundamental-theorem/lesson/https://clawdemy.org/lessons/visual-math-calculus/integration-and-the-fundamental-theorem/lesson/Lesson 10 of Track 8 (Visual Math: Calculus), opening Phase 3. The first lesson found a circle's area by slicing it into rings, integration done informally. This lesson makes accumulation precise: it defines the definite integral as a limit of thin rectangles (a Riemann sum) and states the fundamental theorem of calculus, which ties accumulation to differentiation. To add up a quantity over a range, find a function whose rate of change is that quantity (an antiderivative) and subtract its endpoint values: integral from a to b of f = F(b) - F(a). Antiderivatives are the derivative rules run backward, and the lesson closes the circle by computing the integral of 2*pi*r as pi*R^2.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 10 of Track 8 (Visual Math: Calculus), opening Phase 3. The first lesson found a circle's area by slicing it into rings, integration done informally. This lesson makes accumulation precise: it defines the definite integral as a limit of thin rectangles (a Riemann sum) and states the fundamental theorem of calculus, which ties accumulation to differentiation. To add up a quantity over a range, find a function whose rate of change is that quantity (an antiderivative) and subtract its endpoint values: integral from a to b of f = F(b) - F(a). Antiderivatives are the derivative rules run backward, and the lesson closes the circle by computing the integral of 2*pi*r as pi*R^2.Limits, done carefullyhttps://clawdemy.org/lessons/visual-math-calculus/limits-done-carefully/lesson/https://clawdemy.org/lessons/visual-math-calculus/limits-done-carefully/lesson/Lesson 9 of Track 8 (Visual Math: Calculus), closing Phase 2. Every derivative in this track has secretly been a limit, the value the rise-over-run ratio approaches as the interval shrinks. This lesson examines the limit itself: what 'approaches' really means, made precise by the epsilon-delta idea (for any demanded precision, an input window exists), and how L'Hopital's rule rescues the awkward 0/0 and infinity/infinity forms the rate definition keeps producing. It works several limits (sin x / x = 1, (e^x - 1)/x = 1, a twice-applied case = 1/2, (ln x)/x = 0), shows why the rule works (leading first-order behavior), and bundles three short source chapters into one.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 9 of Track 8 (Visual Math: Calculus), closing Phase 2. Every derivative in this track has secretly been a limit, the value the rise-over-run ratio approaches as the interval shrinks. This lesson examines the limit itself: what 'approaches' really means, made precise by the epsilon-delta idea (for any demanded precision, an input window exists), and how L'Hopital's rule rescues the awkward 0/0 and infinity/infinity forms the rate definition keeps producing. It works several limits (sin x / x = 1, (e^x - 1)/x = 1, a twice-applied case = 1/2, (ln x)/x = 0), shows why the rule works (leading first-order behavior), and bundles three short source chapters into one.The power rule from geometryhttps://clawdemy.org/lessons/visual-math-calculus/power-rule-from-geometry/lesson/https://clawdemy.org/lessons/visual-math-calculus/power-rule-from-geometry/lesson/Lesson 3 of Track 8 (Visual Math: Calculus). Last lesson computed derivatives by grinding through binomial expansions; the answers (2t for t-squared, 3t-squared for t-cubed) hide a pattern, the power rule. This lesson shows where it comes from by reasoning about growing squares and cubes: nudge the side of a square and you add two strips plus a vanishing corner, nudge a cube and you add three slabs. So d/dt(t^n) = n*t^(n-1), where n counts the faces that grow and t^(n-1) is each face's size. It extends to negative and fractional powers, adds the constant-multiple and sum rules, and turns polynomial differentiation into a quick scan.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 3 of Track 8 (Visual Math: Calculus). Last lesson computed derivatives by grinding through binomial expansions; the answers (2t for t-squared, 3t-squared for t-cubed) hide a pattern, the power rule. This lesson shows where it comes from by reasoning about growing squares and cubes: nudge the side of a square and you add two strips plus a vanishing corner, nudge a cube and you add three slabs. So d/dt(t^n) = n*t^(n-1), where n counts the faces that grow and t^(n-1) is each face's size. It extends to negative and fractional powers, adds the constant-multiple and sum rules, and turns polynomial differentiation into a quick scan.The product rule, visuallyhttps://clawdemy.org/lessons/visual-math-calculus/product-rule-visually/lesson/https://clawdemy.org/lessons/visual-math-calculus/product-rule-visually/lesson/Lesson 5 of Track 8 (Visual Math: Calculus), opening Phase 2. When two functions are multiplied, the natural guess for the derivative (multiply the derivatives) is wrong. The right answer, the product rule d/dx(f*g) = f'*g + f*g', has two terms, and one picture shows why: let f be a rectangle's width and g its height, so f*g is its area; nudging x adds a top strip and a side strip (the two terms) plus a tiny corner block that vanishes (exactly the wrong f'*g' guess). The lesson works several examples, cross-checks against the power rule, and extends to three or more factors (one term per factor).Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 5 of Track 8 (Visual Math: Calculus), opening Phase 2. When two functions are multiplied, the natural guess for the derivative (multiply the derivatives) is wrong. The right answer, the product rule d/dx(f*g) = f'*g + f*g', has two terms, and one picture shows why: let f be a rectangle's width and g its height, so f*g is its area; nudging x adds a top strip and a side strip (the two terms) plus a tiny corner block that vanishes (exactly the wrong f'*g' guess). The lesson works several examples, cross-checks against the power rule, and extends to three or more factors (one term per factor).The derivative as a ratehttps://clawdemy.org/lessons/visual-math-calculus/the-derivative-as-a-rate/lesson/https://clawdemy.org/lessons/visual-math-calculus/the-derivative-as-a-rate/lesson/Lesson 2 of Track 8 (Visual Math: Calculus). A derivative is supposed to be the rate of change at a single instant, but over an instant nothing changes, so how can there be a rate? This lesson resolves that paradox with one idea: the derivative is the value the average rate (rise over run) approaches as the measuring interval shrinks to zero. It computes a free-fall velocity and the derivative of t-cubed from scratch, shows the secant line pivoting into the tangent (so 'rate at an instant' becomes 'slope at a point'), and demystifies dy/dx as limit notation rather than a fraction of infinitesimals.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 2 of Track 8 (Visual Math: Calculus). A derivative is supposed to be the rate of change at a single instant, but over an instant nothing changes, so how can there be a rate? This lesson resolves that paradox with one idea: the derivative is the value the average rate (rise over run) approaches as the measuring interval shrinks to zero. It computes a free-fall velocity and the derivative of t-cubed from scratch, shows the secant line pivoting into the tangent (so 'rate at an instant' becomes 'slope at a point'), and demystifies dy/dx as limit notation rather than a fraction of infinitesimals.Trig derivatives from geometryhttps://clawdemy.org/lessons/visual-math-calculus/trig-derivatives-from-geometry/lesson/https://clawdemy.org/lessons/visual-math-calculus/trig-derivatives-from-geometry/lesson/Lesson 4 of Track 8 (Visual Math: Calculus). The power rule handled powers of t, but sine and cosine are not powers of anything, so they get their own picture: a point moving around the unit circle at unit speed. From that single image both trig derivatives fall out, d/dx(sin x) = cos x and d/dx(cos x) = -sin x, with the minus on cosine because its coordinate shrinks as the point climbs. The lesson reads the derivatives off the point's velocity (the position rotated a quarter turn), sanity-checks them against the curve shapes, and shows two payoffs: the small-angle approximation sin(x) ≈ x and the f'' = -f equation that makes sine the universal shape of oscillation. Radians are what keep it all free of stray factors.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 4 of Track 8 (Visual Math: Calculus). The power rule handled powers of t, but sine and cosine are not powers of anything, so they get their own picture: a point moving around the unit circle at unit speed. From that single image both trig derivatives fall out, d/dx(sin x) = cos x and d/dx(cos x) = -sin x, with the minus on cosine because its coordinate shrinks as the point climbs. The lesson reads the derivatives off the point's velocity (the position rotated a quarter turn), sanity-checks them against the curve shapes, and shows two payoffs: the small-angle approximation sin(x) ≈ x and the f'' = -f equation that makes sine the universal shape of oscillation. Radians are what keep it all free of stray factors.Why area equals slopehttps://clawdemy.org/lessons/visual-math-calculus/why-area-equals-slope/lesson/https://clawdemy.org/lessons/visual-math-calculus/why-area-equals-slope/lesson/Lesson 11 of Track 8 (Visual Math: Calculus). Last lesson stated the fundamental theorem and showed how to use it; this lesson explains why it is true, with one geometric observation. Define the area function A(x) = integral from a to x of f, the area accumulated up to a moving right end. Extend it by a sliver dx and the new area is a thin rectangle of height f(x), so A(x+dx) - A(x) is about f(x)*dx, and in the limit A'(x) = f(x): the derivative of the area function is the original curve. That single fact is the fundamental theorem, and it shows why integration and differentiation are inverse operations: the slope of an accumulated area is the thing being accumulated.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 11 of Track 8 (Visual Math: Calculus). Last lesson stated the fundamental theorem and showed how to use it; this lesson explains why it is true, with one geometric observation. Define the area function A(x) = integral from a to x of f, the area accumulated up to a moving right end. Extend it by a sliver dx and the new area is a thin rectangle of height f(x), so A(x+dx) - A(x) is about f(x)*dx, and in the limit A'(x) = f(x): the derivative of the area function is the original curve. That single fact is the fundamental theorem, and it shows why integration and differentiation are inverse operations: the slope of an accumulated area is the thing being accumulated.Why e is specialhttps://clawdemy.org/lessons/visual-math-calculus/why-e-is-special/lesson/https://clawdemy.org/lessons/visual-math-calculus/why-e-is-special/lesson/Lesson 7 of Track 8 (Visual Math: Calculus). Everyone knows e is about 2.718, and almost nobody knows why that number earns its own letter. The answer is not its digits but a behavior: e is the one base for which the exponential is its own derivative, d/dx(e^x) = e^x. This lesson shows that the derivative of any exponential a^x is M(a)*a^x, that the multiplier M(a) crosses 1 between bases 2 and 3 (the crossing point being e), and that combined with the chain rule it gives d/dx(e^(kx)) = k*e^(kx), the solution to 'rate proportional to value'. That single property is why e threads through growth, decay, softmax, and the sigmoid.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 7 of Track 8 (Visual Math: Calculus). Everyone knows e is about 2.718, and almost nobody knows why that number earns its own letter. The answer is not its digits but a behavior: e is the one base for which the exponential is its own derivative, d/dx(e^x) = e^x. This lesson shows that the derivative of any exponential a^x is M(a)*a^x, that the multiplier M(a) crosses 1 between bases 2 and 3 (the crossing point being e), and that combined with the chain rule it gives d/dx(e^(kx)) = k*e^(kx), the solution to 'rate proportional to value'. That single property is why e threads through growth, decay, softmax, and the sigmoid.Deriving the 3D cross product from dualityhttps://clawdemy.org/lessons/visual-math-linear-algebra/3d-cross-product-via-duality/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/3d-cross-product-via-duality/lesson/Lesson 11 of Track 4 (Visual Math: Linear Algebra). The 3D cross product has a formula that looks like something you just have to memorize. You do not. This lesson derives it from scratch by combining the duality idea from the dot-product lesson with the determinant-as-volume idea, and the famous criss-cross formula, along with its three geometric properties, falls out on its own.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 11 of Track 4 (Visual Math: Linear Algebra). The 3D cross product has a formula that looks like something you just have to memorize. You do not. This lesson derives it from scratch by combining the duality idea from the dot-product lesson with the determinant-as-volume idea, and the famous criss-cross formula, along with its three geometric properties, falls out on its own.Stepping up to 3Dhttps://clawdemy.org/lessons/visual-math-linear-algebra/3d-transformations/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/3d-transformations/lesson/Lesson 5 of Track 4 (Visual Math: Linear Algebra), and the close of Phase 1. Everything so far lived on a flat plane. This lesson steps into three dimensions and shows that almost nothing changes: a third basis vector, a third column, one more number per vector, and every rule you already know carries straight over. The same leap takes you to the hundreds of dimensions a real model uses.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 5 of Track 4 (Visual Math: Linear Algebra), and the close of Phase 1. Everything so far lived on a flat plane. This lesson steps into three dimensions and shows that almost nothing changes: a third basis vector, a third column, one more number per vector, and every rule you already know carries straight over. The same leap takes you to the hundreds of dimensions a real model uses.Vectors that aren't arrows, abstract vector spaceshttps://clawdemy.org/lessons/visual-math-linear-algebra/abstract-vector-spaces/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/abstract-vector-spaces/lesson/Lesson 15 of Track 4 (Visual Math: Linear Algebra), the capstone. The very first lesson said a vector is anything you can add and scale coherently, even if it is not an arrow or a list. This final lesson cashes that promise: functions and polynomials are vectors too, the derivative is an honest matrix, and every tool you built across the track works on objects you cannot draw, including the high-dimensional spaces AI actually lives in.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 15 of Track 4 (Visual Math: Linear Algebra), the capstone. The very first lesson said a vector is anything you can add and scale coherently, even if it is not an arrow or a list. This final lesson cashes that promise: functions and polynomials are vectors too, the derivative is an honest matrix, and every tool you built across the track works on objects you cannot draw, including the high-dimensional spaces AI actually lives in.Coordinates as a choice, change of basishttps://clawdemy.org/lessons/visual-math-linear-algebra/change-of-basis/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/change-of-basis/lesson/Lesson 13 of Track 4 (Visual Math: Linear Algebra). A vector's coordinates are not a fact about the vector; they are a description relative to a basis you happened to choose. This lesson makes that operational: how to translate a vector's coordinates from one basis to another and back, and how the same transformation gets a different matrix in a different basis via the M-inverse-A-M sandwich.Sun, 24 May 2026 00:00:00 GMTClawdemy12:00falseLesson 13 of Track 4 (Visual Math: Linear Algebra). A vector's coordinates are not a fact about the vector; they are a description relative to a basis you happened to choose. This lesson makes that operational: how to translate a vector's coordinates from one basis to another and back, and how the same transformation gets a different matrix in a different basis via the M-inverse-A-M sandwich.Solving by area ratios, Cramer's rulehttps://clawdemy.org/lessons/visual-math-linear-algebra/cramers-rule/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/cramers-rule/lesson/Lesson 12 of Track 4 (Visual Math: Linear Algebra). Several lessons ago we said the solution to a linear system is the inverse times the target, but never computed it. Cramer's rule is one way to get the answer directly from the matrix entries, and it falls out of one idea you already have: a linear transformation scales every area by its determinant.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 12 of Track 4 (Visual Math: Linear Algebra). Several lessons ago we said the solution to a linear system is the inverse times the target, but never computed it. Cramer's rule is one way to get the answer directly from the matrix entries, and it falls out of one idea you already have: a linear transformation scales every area by its determinant.Cross products as signed areahttps://clawdemy.org/lessons/visual-math-linear-algebra/cross-products/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/cross-products/lesson/Lesson 10 of Track 4 (Visual Math: Linear Algebra), opening Phase 3. The dot product measured how much two vectors line up; the cross product measures how much they spread apart, the area they span, with a sign that records which way they turn. In 2D it is one signed number, and it turns out to be exactly the determinant you already know.Sun, 24 May 2026 00:00:00 GMTClawdemy9:00falseLesson 10 of Track 4 (Visual Math: Linear Algebra), opening Phase 3. The dot product measured how much two vectors line up; the cross product measures how much they spread apart, the area they span, with a sign that records which way they turn. In 2D it is one signed number, and it turns out to be exactly the determinant you already know.The determinanthttps://clawdemy.org/lessons/visual-math-linear-algebra/determinant/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/determinant/lesson/Lesson 6 of Track 4 (Visual Math: Linear Algebra), opening Phase 2. A linear transformation stretches and squashes space; the determinant is the single number that says by how much, and whether it flips space inside out. This lesson builds that number from the area of the unit square, derives the ad-bc formula, and shows why a zero determinant signals a collapse that cannot be undone.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 6 of Track 4 (Visual Math: Linear Algebra), opening Phase 2. A linear transformation stretches and squashes space; the determinant is the single number that says by how much, and whether it flips space inside out. This lesson builds that number from the area of the unit square, derives the ad-bc formula, and shows why a zero determinant signals a collapse that cannot be undone.Dot products and projectionhttps://clawdemy.org/lessons/visual-math-linear-algebra/dot-products/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/dot-products/lesson/Lesson 9 of Track 4 (Visual Math: Linear Algebra), closing Phase 2. The dot product turns two vectors into a single number, and it has two formulas that look unrelated yet always agree. This lesson computes it both ways, explains why they match (duality), and cashes the promise from the very first lesson about how AI compares vectors in attention, cosine similarity, and inside every neuron.Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 9 of Track 4 (Visual Math: Linear Algebra), closing Phase 2. The dot product turns two vectors into a single number, and it has two formulas that look unrelated yet always agree. This lesson computes it both ways, explains why they match (duality), and cashes the promise from the very first lesson about how AI compares vectors in attention, cosine similarity, and inside every neuron.The stubborn vectors, eigenvectors and eigenvalueshttps://clawdemy.org/lessons/visual-math-linear-algebra/eigenvectors-and-eigenvalues/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/eigenvectors-and-eigenvalues/lesson/Lesson 14 of Track 4 (Visual Math: Linear Algebra). When a transformation moves the plane, most vectors get knocked off their own line. A few stubborn ones stay on their line and only get scaled. Those are eigenvectors, the scaling factor is the eigenvalue, and in the eigenvector basis the transformation becomes a clean diagonal matrix, the simplest it can look.Sun, 24 May 2026 00:00:00 GMTClawdemy13:00falseLesson 14 of Track 4 (Visual Math: Linear Algebra). When a transformation moves the plane, most vectors get knocked off their own line. A few stubborn ones stay on their line and only get scaled. Those are eigenvectors, the scaling factor is the eigenvalue, and in the eigenvector basis the transformation becomes a clean diagonal matrix, the simplest it can look.Undoing a transformation, and when you cannothttps://clawdemy.org/lessons/visual-math-linear-algebra/inverses-column-space-null-space/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/inverses-column-space-null-space/lesson/Lesson 7 of Track 4 (Visual Math: Linear Algebra). Last lesson ended on a warning: when the determinant is zero, information is lost. This lesson makes that precise. It builds the inverse (the undo button), shows it exists only when the determinant is nonzero, and introduces the two ideas that explain exactly what a collapse destroys: column space (everything reachable) and null space (everything crushed to zero).Sun, 24 May 2026 00:00:00 GMTClawdemy11:00falseLesson 7 of Track 4 (Visual Math: Linear Algebra). Last lesson ended on a warning: when the determinant is zero, information is lost. This lesson makes that precise. It builds the inverse (the undo button), shows it exists only when the determinant is nonzero, and introduces the two ideas that explain exactly what a collapse destroys: column space (everything reachable) and null space (everything crushed to zero).Linear transformations as moveshttps://clawdemy.org/lessons/visual-math-linear-algebra/linear-transformations/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/linear-transformations/lesson/Lesson 3 of Track 4 (Visual Math: Linear Algebra). A matrix looks like a grid of numbers with no obvious meaning. This lesson shows what it actually is: a record of where the two basis vectors land. That single idea turns matrix-vector multiplication from a rule you memorize into a picture you can sketch, and lets you read what any 2x2 matrix does to space straight off its columns.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 3 of Track 4 (Visual Math: Linear Algebra). A matrix looks like a grid of numbers with no obvious meaning. This lesson shows what it actually is: a record of where the two basis vectors land. That single idea turns matrix-vector multiplication from a rule you memorize into a picture you can sketch, and lets you read what any 2x2 matrix does to space straight off its columns.Matrix multiplication as compositionhttps://clawdemy.org/lessons/visual-math-linear-algebra/matrix-multiplication/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/matrix-multiplication/lesson/Lesson 4 of Track 4 (Visual Math: Linear Algebra). Matrix multiplication has a reputation as an arbitrary rows-times-columns rule. It is not arbitrary: multiplying two matrices means doing one transformation, then another. This lesson shows why the product is computed the way it is, why you read it right to left, why order matters (AB is not BA), and why grouping does not.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 4 of Track 4 (Visual Math: Linear Algebra). Matrix multiplication has a reputation as an arbitrary rows-times-columns rule. It is not arbitrary: multiplying two matrices means doing one transformation, then another. This lesson shows why the product is computed the way it is, why you read it right to left, why order matters (AB is not BA), and why grouping does not.Matrices between dimensionshttps://clawdemy.org/lessons/visual-math-linear-algebra/nonsquare-matrices/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/nonsquare-matrices/lesson/Lesson 8 of Track 4 (Visual Math: Linear Algebra). Every matrix so far has been square, taking a space back to a space of the same size. Drop that assumption. A rectangular matrix moves between dimensions, embedding a small space into a bigger one or projecting a big space down into a smaller one, and the rules you already know (columns, rank, null space) still tell the whole story.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseLesson 8 of Track 4 (Visual Math: Linear Algebra). Every matrix so far has been square, taking a space back to a space of the same size. Drop that assumption. A rectangular matrix moves between dimensions, embedding a small space into a bigger one or projecting a big space down into a smaller one, and the rules you already know (columns, rank, null space) still tell the whole story.Spans and basishttps://clawdemy.org/lessons/visual-math-linear-algebra/spans-and-basis/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/spans-and-basis/lesson/Lesson 2 of Track 4 (Visual Math: Linear Algebra). Give yourself a couple of vectors and the only two operations you know, adding and scaling, and ask which points you can reach. The answer is the span, and it leads straight to a basis (the smallest set that reaches everything), to linear independence, and to what the dimension of a space really means.Sun, 24 May 2026 00:00:00 GMTClawdemy9:00falseLesson 2 of Track 4 (Visual Math: Linear Algebra). Give yourself a couple of vectors and the only two operations you know, adding and scaling, and ask which points you can reach. The answer is the span, and it leads straight to a basis (the smallest set that reaches everything), to linear independence, and to what the dimension of a space really means.What vectors actually arehttps://clawdemy.org/lessons/visual-math-linear-algebra/what-vectors-actually-are/lesson/https://clawdemy.org/lessons/visual-math-linear-algebra/what-vectors-actually-are/lesson/The opener of Track 4 (Visual Math: Linear Algebra). The word vector means an arrow in physics, a list of numbers in code, and an abstract object in a math textbook, and this lesson shows they are one object seen from three angles. It connects the arrow and the list through a coordinate system, pins down the two operations (addition and scaling) that actually define a vector, and shows why this single idea is the atom that everything later in AI math is built from.Sun, 24 May 2026 00:00:00 GMTClawdemy10:00falseThe opener of Track 4 (Visual Math: Linear Algebra). The word vector means an arrow in physics, a list of numbers in code, and an abstract object in a math textbook, and this lesson shows they are one object seen from three angles. It connects the arrow and the list through a coordinate system, pins down the two operations (addition and scaling) that actually define a vector, and shows why this single idea is the atom that everything later in AI math is built from.What transformers do, and why they took over AIhttps://clawdemy.org/lessons/practical-transformers/what-transformers-do/lesson/https://clawdemy.org/lessons/practical-transformers/what-transformers-do/lesson/Track 14 opens here. The thing that wrote back to you in a chat box this week was almost certainly a transformer, a specific architecture from 2017. This lesson gives the working description (tokens in, tokens out, attention in the middle), explains why transformers replaced the older sequential models, sorts the three architectural shapes you will meet, walks a short timeline, separates the expensive pre-training step from cheap fine-tuning, names the limits honestly, and places the Hugging Face ecosystem the rest of the track is built on. No math required.Sat, 23 May 2026 00:00:00 GMTClawdemy11:00falseTrack 14 opens here. The thing that wrote back to you in a chat box this week was almost certainly a transformer, a specific architecture from 2017. This lesson gives the working description (tokens in, tokens out, attention in the middle), explains why transformers replaced the older sequential models, sorts the three architectural shapes you will meet, walks a short timeline, separates the expensive pre-training step from cheap fine-tuning, names the limits honestly, and places the Hugging Face ecosystem the rest of the track is built on. No math required.Agents that retrieve their own information: agentic RAGhttps://clawdemy.org/lessons/ai-agents-and-tool-use/agentic-rag/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/agentic-rag/lesson/Lesson 6 of Track 20 (AI Agents and Tool Use). An agent that answers from your documents has to find the right passage first. The classic technique, RAG, is a fixed pipeline: always retrieve, then answer. This lesson shows the agentic turn, where retrieval becomes a tool the agent decides whether and when to call, can run more than once, and can judge whether what came back is good enough. It builds the classical-versus-agentic contrast on a worked example, names the cost of that adaptability, and shows that agentic RAG is assembled entirely from earlier pieces (the loop, the tool call, the tool definition, and memory).Fri, 22 May 2026 00:00:00 GMTClawdemy10:00falseLesson 6 of Track 20 (AI Agents and Tool Use). An agent that answers from your documents has to find the right passage first. The classic technique, RAG, is a fixed pipeline: always retrieve, then answer. This lesson shows the agentic turn, where retrieval becomes a tool the agent decides whether and when to call, can run more than once, and can judge whether what came back is good enough. It builds the classical-versus-agentic contrast on a worked example, names the cost of that adaptability, and shows that agentic RAG is assembled entirely from earlier pieces (the loop, the tool call, the tool definition, and memory).Building trustworthy agentshttps://clawdemy.org/lessons/ai-agents-and-tool-use/building-trustworthy-agents/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/building-trustworthy-agents/lesson/Lesson 10 of Track 20 (AI Agents and Tool Use), and the opener of Phase 3. An agent that works in a demo is not the same as one you can put in front of real users. Even with no attacker in sight, agents fail in characteristic ways: they call tools that do not exist, loop without making progress, answer wrongly with full confidence, mishandle tool errors, act on missing information, and over-share data. This lesson names those six own-failure modes and the guardrail that contains each, gives the blast-radius principle for human-in-the-loop, and draws a clean line between trustworthiness (the agent failing on its own) and security (the agent under attack), which the next lesson covers.Fri, 22 May 2026 00:00:00 GMTClawdemy11:00falseLesson 10 of Track 20 (AI Agents and Tool Use), and the opener of Phase 3. An agent that works in a demo is not the same as one you can put in front of real users. Even with no attacker in sight, agents fail in characteristic ways: they call tools that do not exist, loop without making progress, answer wrongly with full confidence, mishandle tool errors, act on missing information, and over-share data. This lesson names those six own-failure modes and the guardrail that contains each, gives the blast-radius principle for human-in-the-loop, and draws a clean line between trustworthiness (the agent failing on its own) and security (the agent under attack), which the next lesson covers.Choosing an agent frameworkhttps://clawdemy.org/lessons/ai-agents-and-tool-use/choosing-an-agent-framework/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/choosing-an-agent-framework/lesson/Lesson 3 of Track 20 (AI Agents and Tool Use). Lessons 1 and 2 showed an agent is a loop you could write yourself, so the first real question is not which framework to use but whether to use one at all. This lesson frames the honest hand-roll-versus-framework tradeoff, then surveys the framework landscape by category (orchestration/multi-agent, retrieval-first, graph/state-machine, managed service) so you can match a framework to your task by fit rather than by popularity, and closes on why understanding the loop keeps you free of any one library.Fri, 22 May 2026 00:00:00 GMTClawdemy11:00falseLesson 3 of Track 20 (AI Agents and Tool Use). Lessons 1 and 2 showed an agent is a loop you could write yourself, so the first real question is not which framework to use but whether to use one at all. This lesson frames the honest hand-roll-versus-framework tradeoff, then surveys the framework landscape by category (orchestration/multi-agent, retrieval-first, graph/state-machine, managed service) so you can match a framework to your task by fit rather than by popularity, and closes on why understanding the loop keeps you free of any one library.Giving agents memoryhttps://clawdemy.org/lessons/ai-agents-and-tool-use/giving-agents-memory/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/giving-agents-memory/lesson/Lesson 5 of Track 20 (AI Agents and Tool Use). Every agent so far started fresh and forgot you the moment it answered. This lesson separates the two kinds of memory an agent has: the short-term context that lasts one run, and the persistent memory that survives across runs. Then it tackles the harder question, which is not how to store memory but what is actually worth remembering, with the costs of over-remembering (context, staleness, privacy) and the kinds of facts worth keeping.Fri, 22 May 2026 00:00:00 GMTClawdemy11:00falseLesson 5 of Track 20 (AI Agents and Tool Use). Every agent so far started fresh and forgot you the moment it answered. This lesson separates the two kinds of memory an agent has: the short-term context that lasts one run, and the persistent memory that survives across runs. Then it tackles the harder question, which is not how to store memory but what is actually worth remembering, with the costs of over-remembering (context, staleness, privacy) and the kinds of facts worth keeping.How tool use turns a model into an agenthttps://clawdemy.org/lessons/ai-agents-and-tool-use/how-tool-use-turns-a-model-into-an-agent/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/how-tool-use-turns-a-model-into-an-agent/lesson/Lesson 2 of Track 20 (AI Agents and Tool Use). Lesson 1 said an agent is a model in a loop with tools; this lesson opens up the move that powers it. A language model cannot run code, so how does it call a tool? It does not: it emits a structured request, and the loop around it runs the tool and feeds the result back. The lesson traces that four-step exchange (describe, request, execute, decide) end to end, watches the model choose between tools, shows how a tool failure is just another result, and explains why a tool call is only text in an agreed shape.Fri, 22 May 2026 00:00:00 GMTClawdemy11:00falseLesson 2 of Track 20 (AI Agents and Tool Use). Lesson 1 said an agent is a model in a loop with tools; this lesson opens up the move that powers it. A language model cannot run code, so how does it call a tool? It does not: it emits a structured request, and the loop around it runs the tool and feeds the result back. The lesson traces that four-step exchange (describe, request, execute, decide) end to end, watches the model choose between tools, shows how a tool failure is just another result, and explains why a tool call is only text in an agreed shape.Agents that self-check: metacognitionhttps://clawdemy.org/lessons/ai-agents-and-tool-use/metacognition/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/metacognition/lesson/Lesson 9 of Track 20 (AI Agents and Tool Use), and the closer of Phase 2. The previous lesson made agents more reliable by adding more of them. This one does it with a cheaper move: have a single agent check its own work before committing. Metacognition is an agent thinking about its own thinking, a reflection step where it asks whether its answer or plan is actually right. The lesson connects the self-correction threads from earlier lessons, weighs reflection against adding an agent, and stays honest about what a second look can and cannot catch.Fri, 22 May 2026 00:00:00 GMTClawdemy10:00falseLesson 9 of Track 20 (AI Agents and Tool Use), and the closer of Phase 2. The previous lesson made agents more reliable by adding more of them. This one does it with a cheaper move: have a single agent check its own work before committing. Metacognition is an agent thinking about its own thinking, a reflection step where it asks whether its answer or plan is actually right. The lesson connects the self-correction threads from earlier lessons, weighs reflection against adding an agent, and stays honest about what a second look can and cannot catch.Many agents working together: multi-agent systemshttps://clawdemy.org/lessons/ai-agents-and-tool-use/multi-agent-systems/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/multi-agent-systems/lesson/Lesson 8 of Track 20 (AI Agents and Tool Use). A plan's sub-tasks often look like separate jobs, which raises a tempting question: should each go to its own agent? Multi-agent systems split work across several specialized agents that coordinate. This lesson is about when that actually helps and when it just adds overhead. The honest framing is fit, not ranking: coordination is not free, and many tasks are better served by one well-designed generalist agent.Fri, 22 May 2026 00:00:00 GMTClawdemy11:00falseLesson 8 of Track 20 (AI Agents and Tool Use). A plan's sub-tasks often look like separate jobs, which raises a tempting question: should each go to its own agent? Multi-agent systems split work across several specialized agents that coordinate. This lesson is about when that actually helps and when it just adds overhead. The honest framing is fit, not ranking: coordination is not free, and many tasks are better served by one well-designed generalist agent.Planning: breaking a goal into stepshttps://clawdemy.org/lessons/ai-agents-and-tool-use/planning/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/planning/lesson/Lesson 7 of Track 20 (AI Agents and Tool Use). Every agent so far decided one move at a time, reacting to whatever just happened. That works until the task is too big to wing. This lesson is the planning turn: the agent looks ahead, breaks a goal into an ordered set of sub-tasks, and decides the shape of the work before it starts. It traces a plan being built and executed, shows how an agent replans when reality does not cooperate, and gives the rule for getting the grain size of the steps right.Fri, 22 May 2026 00:00:00 GMTClawdemy10:00falseLesson 7 of Track 20 (AI Agents and Tool Use). Every agent so far decided one move at a time, reacting to whatever just happened. That works until the task is too big to wing. This lesson is the planning turn: the agent looks ahead, breaks a goal into an ordered set of sub-tasks, and decides the shape of the work before it starts. It traces a plan being built and executed, shows how an agent replans when reality does not cooperate, and gives the rule for getting the grain size of the steps right.The tool-use design pattern in depthhttps://clawdemy.org/lessons/ai-agents-and-tool-use/the-tool-use-design-pattern-in-depth/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/the-tool-use-design-pattern-in-depth/lesson/Lesson 4 of Track 20 (AI Agents and Tool Use), opening Phase 2. Lesson 2 showed how a tool call works; this lesson is how to get the model to make the right call, which comes down to the tool definition. The model picks tools and fills arguments from their descriptions and nothing else, so a tool it misuses is almost always one described badly. The lesson walks bad-versus-good tool definitions, fixes vague parameters, adds negative guidance, draws boundaries between overlapping tools, and makes outputs legible.Fri, 22 May 2026 00:00:00 GMTClawdemy11:00falseLesson 4 of Track 20 (AI Agents and Tool Use), opening Phase 2. Lesson 2 showed how a tool call works; this lesson is how to get the model to make the right call, which comes down to the tool definition. The model picks tools and fills arguments from their descriptions and nothing else, so a tool it misuses is almost always one described badly. The lesson walks bad-versus-good tool definitions, fixes vague parameters, adds negative guidance, draws boundaries between overlapping tools, and makes outputs legible.What makes an AI an "agent"https://clawdemy.org/lessons/ai-agents-and-tool-use/what-makes-an-ai-an-agent/lesson/https://clawdemy.org/lessons/ai-agents-and-tool-use/what-makes-an-ai-an-agent/lesson/The opener of Track 20 (AI Agents and Tool Use). Everyone says AI agents will change everything, but few stop to define an agent. This lesson gives a definition that holds up: an agent is a model wrapped in a perceive-decide-act loop with tools. It contrasts an agent with the chatbot you already use, traces the loop through a multi-step task, takes apart the four-part anatomy that makes a plain model agentic (model, system prompt, tools, loop), places the idea in the older history of AI agents, and gives an honest test for when a task actually warrants an agent.Fri, 22 May 2026 00:00:00 GMTClawdemy11:00falseThe opener of Track 20 (AI Agents and Tool Use). Everyone says AI agents will change everything, but few stop to define an agent. This lesson gives a definition that holds up: an agent is a model wrapped in a perceive-decide-act loop with tools. It contrasts an agent with the chatbot you already use, traces the loop through a multi-step task, takes apart the four-part anatomy that makes a plain model agentic (model, system prompt, tools, loop), places the idea in the older history of AI agents, and gives an honest test for when a task actually warrants an agent.The handwritten-digit problemhttps://clawdemy.org/lessons/neural-network-intuition/the-handwritten-digit-problem/lesson/https://clawdemy.org/lessons/neural-network-intuition/the-handwritten-digit-problem/lesson/The opener of Track 11 (Neural Network Intuition). Recognizing a messy handwritten 3 is effortless for you and brutally hard to write as a computer program. This lesson shows why a digit is just a grid of brightness numbers to a computer, why rule-writing falls apart on real handwriting, why handwritten digits became the classic first problem in machine learning, and the paradigm shift that powers almost all of modern AI: stop writing rules, start showing labeled examples.Fri, 22 May 2026 00:00:00 GMTClawdemy8:00falseThe opener of Track 11 (Neural Network Intuition). Recognizing a messy handwritten 3 is effortless for you and brutally hard to write as a computer program. This lesson shows why a digit is just a grid of brightness numbers to a computer, why rule-writing falls apart on real handwriting, why handwritten digits became the classic first problem in machine learning, and the paradigm shift that powers almost all of modern AI: stop writing rules, start showing labeled examples.BERT, part one: the bidirectional encoder and its structural tokenshttps://clawdemy.org/lessons/ai-foundations/bert-architecture/lesson/https://clawdemy.org/lessons/ai-foundations/bert-architecture/lesson/Lesson 8 of Phase 2 (How models think) in Track 5. BERT is the encoder-only branch's defining model. This lesson covers the architecture only: dropping the decoder, why bidirectional self-attention (no causal mask) is the move, the structural tokens (CLS, SEP) that shape the input, and the three additive embeddings (token + position + segment). Pretraining objectives and fine-tuning patterns are in the next lesson.Sat, 09 May 2026 00:00:00 GMTClawdemy13:00falseLesson 8 of Phase 2 (How models think) in Track 5. BERT is the encoder-only branch's defining model. This lesson covers the architecture only: dropping the decoder, why bidirectional self-attention (no causal mask) is the move, the structural tokens (CLS, SEP) that shape the input, and the three additive embeddings (token + position + segment). Pretraining objectives and fine-tuning patterns are in the next lesson.BERT, part two: pretraining objectives and the train-then-fine-tune workflowhttps://clawdemy.org/lessons/ai-foundations/bert-pretraining-and-fine-tuning/lesson/https://clawdemy.org/lessons/ai-foundations/bert-pretraining-and-fine-tuning/lesson/Lesson 9 of Phase 2 (How models think) in Track 5. Bidirectionality forced new pretraining objectives. This lesson walks MLM with the 80/10/10 mix, NSP, the two-stage train-then-fine-tune workflow, and the two common fine-tuning patterns (CLS-head classification, per-token span detection). The previous lesson covered BERT's architecture; this lesson is what it was trained to do.Sat, 09 May 2026 00:00:00 GMTClawdemy13:00falseLesson 9 of Phase 2 (How models think) in Track 5. Bidirectionality forced new pretraining objectives. This lesson walks MLM with the 80/10/10 mix, NSP, the two-stage train-then-fine-tune workflow, and the two common fine-tuning patterns (CLS-head classification, per-token span detection). The previous lesson covered BERT's architecture; this lesson is what it was trained to do.How chain of thought makes models think out loudhttps://clawdemy.org/lessons/ai-foundations/chain-of-thought-prompting/lesson/https://clawdemy.org/lessons/ai-foundations/chain-of-thought-prompting/lesson/Phase 5 closer in our adaptation of Stanford CME 295 Lectures 3 and 6. Asking a model to produce reasoning steps before its answer reliably improves accuracy on multi-step problems. This lesson covers what chain-of-thought prompting is, the two flavors (zero-shot and few-shot), why it works, when it fails, and how it sets up Phase 6's reasoning models.Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falsePhase 5 closer in our adaptation of Stanford CME 295 Lectures 3 and 6. Asking a model to produce reasoning steps before its answer reliably improves accuracy on multi-step problems. This lesson covers what chain-of-thought prompting is, the two flavors (zero-shot and few-shot), why it works, when it fails, and how it sets up Phase 6's reasoning models.How agent loops workhttps://clawdemy.org/lessons/ai-foundations/how-agent-loops-work/lesson/https://clawdemy.org/lessons/ai-foundations/how-agent-loops-work/lesson/Phase 6 closer in our adaptation of Stanford CME 295 Lecture 7. An agent is a tool-using LLM that loops. This lesson covers the observe-plan-act pattern, how multiple tool calls compose into longer-horizon work, the multi-agent setting and the A2A protocol, and the safety threads (data exfiltration, prompt injection, tool misuse) that weave through everything.Fri, 08 May 2026 00:00:00 GMTClawdemy13:00falsePhase 6 closer in our adaptation of Stanford CME 295 Lecture 7. An agent is a tool-using LLM that loops. This lesson covers the observe-plan-act pattern, how multiple tool calls compose into longer-horizon work, the multi-agent setting and the A2A protocol, and the safety threads (data exfiltration, prompt injection, tool misuse) that weave through everything.How models call functionshttps://clawdemy.org/lessons/ai-foundations/how-models-call-functions/lesson/https://clawdemy.org/lessons/ai-foundations/how-models-call-functions/lesson/Phase 6 lesson on function calling and tool use in our adaptation of Stanford CME 295 Lecture 7. RAG fetched unstructured text. Function calling fetches structured data from APIs (or triggers structured actions). This lesson covers the three-stage mechanism, how function-calling models are trained, and what the LLM actually sees.Fri, 08 May 2026 00:00:00 GMTClawdemy13:00falsePhase 6 lesson on function calling and tool use in our adaptation of Stanford CME 295 Lecture 7. RAG fetched unstructured text. Function calling fetches structured data from APIs (or triggers structured actions). This lesson covers the three-stage mechanism, how function-calling models are trained, and what the LLM actually sees.How models know word orderhttps://clawdemy.org/lessons/ai-foundations/how-models-know-word-order/lesson/https://clawdemy.org/lessons/ai-foundations/how-models-know-word-order/lesson/The Phase 1 closer to the 'how text gets read' arc. Self-attention processes all tokens in parallel and loses the implicit position signal that older recurrent models had for free. The 2017 transformer paper added position information back as a vector (sinusoidal or learned) added to the input embedding. This lesson covers why position info has to exist at all and what the original two answers were, deliberately stopping before the modern attention-injected schemes (Phase 2 picks those up after attention is taught).Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falseThe Phase 1 closer to the 'how text gets read' arc. Self-attention processes all tokens in parallel and loses the implicit position signal that older recurrent models had for free. The 2017 transformer paper added position information back as a vector (sinusoidal or learned) added to the input embedding. This lesson covers why position info has to exist at all and what the original two answers were, deliberately stopping before the modern attention-injected schemes (Phase 2 picks those up after attention is taught).How reasoning models think differentlyhttps://clawdemy.org/lessons/ai-foundations/how-reasoning-models-think/lesson/https://clawdemy.org/lessons/ai-foundations/how-reasoning-models-think/lesson/Phase 6 opener in our adaptation of Stanford CME 295 Lecture 6. Reasoning models are trained to produce long internal reasoning chains as part of their policy, not just when prompted. This lesson covers what makes them different from standard LLMs, the compute-budget framing, the major reasoning benchmarks (AIME, GSM8K, HumanEval, SWE-bench, CodeForces), and how to read a Pass@K claim correctly.Fri, 08 May 2026 00:00:00 GMTClawdemy13:00falsePhase 6 opener in our adaptation of Stanford CME 295 Lecture 6. Reasoning models are trained to produce long internal reasoning chains as part of their policy, not just when prompted. This lesson covers what makes them different from standard LLMs, the compute-budget framing, the major reasoning benchmarks (AIME, GSM8K, HumanEval, SWE-bench, CodeForces), and how to read a Pass@K claim correctly.How we evaluate models, LLM-as-a-Judgehttps://clawdemy.org/lessons/ai-foundations/how-we-evaluate-models/lesson/https://clawdemy.org/lessons/ai-foundations/how-we-evaluate-models/lesson/Phase 7 opener in our adaptation of Stanford CME 295 Lecture 8. Evaluating an LLM is itself an LLM-shaped problem. This lesson covers the LLM-as-a-Judge pattern (one LLM rates another), how it's set up in practice, and the three named biases (position, verbosity, self-enhancement) that production LaaJ systems must defend against.Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falsePhase 7 opener in our adaptation of Stanford CME 295 Lecture 8. Evaluating an LLM is itself an LLM-shaped problem. This lesson covers the LLM-as-a-Judge pattern (one LLM rates another), how it's set up in practice, and the three named biases (position, verbosity, self-enhancement) that production LaaJ systems must defend against.How few-shot examples teach in contexthttps://clawdemy.org/lessons/ai-foundations/in-context-learning-and-few-shot/lesson/https://clawdemy.org/lessons/ai-foundations/in-context-learning-and-few-shot/lesson/Phase 5 lesson on in-context learning and few-shot prompting in our adaptation of Stanford CME 295 Lecture 3. The model's weights are frozen at inference; you can still shape its immediate behavior by putting examples in the prompt. This lesson covers what zero-shot, one-shot, and few-shot mean, why in-context learning works at all, when examples help, and when detailed instructions can do better.Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falsePhase 5 lesson on in-context learning and few-shot prompting in our adaptation of Stanford CME 295 Lecture 3. The model's weights are frozen at inference; you can still shape its immediate behavior by putting examples in the prompt. This lesson covers what zero-shot, one-shot, and few-shot mean, why in-context learning works at all, when examples help, and when detailed instructions can do better.New ways to generate, speculative decoding and diffusion LLMshttps://clawdemy.org/lessons/ai-foundations/new-ways-to-generate/lesson/https://clawdemy.org/lessons/ai-foundations/new-ways-to-generate/lesson/Phase 7 lesson on alternatives to standard autoregressive generation in our adaptation of Stanford CME 295 Lectures 3 and 9. Speculative decoding speeds up generation while preserving the target model's output distribution. Diffusion LLMs borrow from image generation: start from all-mask, denoise into text in parallel refinement passes.Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falsePhase 7 lesson on alternatives to standard autoregressive generation in our adaptation of Stanford CME 295 Lectures 3 and 9. Speculative decoding speeds up generation while preserving the target model's output distribution. Diffusion LLMs borrow from image generation: start from all-mask, denoise into text in parallel refinement passes.How RLHF and DPO align modelshttps://clawdemy.org/lessons/ai-foundations/rlhf-and-dpo/lesson/https://clawdemy.org/lessons/ai-foundations/rlhf-and-dpo/lesson/The Phase 4 closer in our adaptation of Stanford CME 295 Lecture 5. The reward model from the previous lesson can score completions but cannot update an LLM by itself. This lesson covers the two methods that close that gap: RLHF (using PPO and a KL penalty against the reference model) and DPO (the supervised shortcut that derives the same objective without a reward model).Fri, 08 May 2026 00:00:00 GMTClawdemy14:00falseThe Phase 4 closer in our adaptation of Stanford CME 295 Lecture 5. The reward model from the previous lesson can score completions but cannot update an LLM by itself. This lesson covers the two methods that close that gap: RLHF (using PPO and a KL penalty against the reference model) and DPO (the supervised shortcut that derives the same objective without a reward model).Transformers beyond text, ViT and Mixture-of-Expertshttps://clawdemy.org/lessons/ai-foundations/transformers-beyond-text/lesson/https://clawdemy.org/lessons/ai-foundations/transformers-beyond-text/lesson/Phase 7 lesson on transformer adaptations in our adaptation of Stanford CME 295 Lecture 9. The transformer block has been reused for non-text inputs (Vision Transformers) and rewired for sparse routing (Mixture-of-Experts). This lesson covers what each enables and why both matter for understanding modern AI.Fri, 08 May 2026 00:00:00 GMTClawdemy11:00falsePhase 7 lesson on transformer adaptations in our adaptation of Stanford CME 295 Lecture 9. The transformer block has been reused for non-text inputs (Vision Transformers) and rewired for sparse routing (Mixture-of-Experts). This lesson covers what each enables and why both matter for understanding modern AI.Where to be careful, a safety lens on what you've learnedhttps://clawdemy.org/lessons/ai-foundations/where-to-be-careful/lesson/https://clawdemy.org/lessons/ai-foundations/where-to-be-careful/lesson/Track 5 closer. A pull-together of every safety thread woven through Phases 4 through 7: alignment and reward hacking (Phase 4), prompt injection (Phase 5), data exfiltration and tool misuse (Phase 6), evaluation biases (Phase 7). The lesson names what was woven so a coherent safety frame remains.Fri, 08 May 2026 00:00:00 GMTClawdemy13:00falseTrack 5 closer. A pull-together of every safety thread woven through Phases 4 through 7: alignment and reward hacking (Phase 4), prompt injection (Phase 5), data exfiltration and tool misuse (Phase 6), evaluation biases (Phase 7). The lesson names what was woven so a coherent safety frame remains.Why benchmarks can misleadhttps://clawdemy.org/lessons/ai-foundations/why-benchmarks-can-mislead/lesson/https://clawdemy.org/lessons/ai-foundations/why-benchmarks-can-mislead/lesson/Phase 7 lesson on benchmark literacy in our adaptation of Stanford CME 295 Lecture 8. Benchmark numbers are easy to compare and easy to get wrong about. This lesson covers the major benchmark categories (knowledge, reasoning, coding, common sense), what each one actually measures, and the structural reasons benchmark scores can rise faster than real capability.Fri, 08 May 2026 00:00:00 GMTClawdemy12:00falsePhase 7 lesson on benchmark literacy in our adaptation of Stanford CME 295 Lecture 8. Benchmark numbers are easy to compare and easy to get wrong about. This lesson covers the major benchmark categories (knowledge, reasoning, coding, common sense), what each one actually measures, and the structural reasons benchmark scores can rise faster than real capability.Why tool-using models failhttps://clawdemy.org/lessons/ai-foundations/why-tool-using-models-fail/lesson/https://clawdemy.org/lessons/ai-foundations/why-tool-using-models-fail/lesson/Phase 7 lesson on tool-use failure modes in our adaptation of Stanford CME 295 Lecture 8. Tool-use failures fall into three buckets: tool-prediction errors (the LLM picked wrong), tool-execution errors (the tool itself misbehaved), and synthesis errors (the LLM mishandled the structured response). This lesson walks all three with named sub-failures and the lecturer's debugging methodology.Fri, 08 May 2026 00:00:00 GMTClawdemy13:00falsePhase 7 lesson on tool-use failure modes in our adaptation of Stanford CME 295 Lecture 8. Tool-use failures fall into three buckets: tool-prediction errors (the LLM picked wrong), tool-execution errors (the tool itself misbehaved), and synthesis errors (the LLM mishandled the structured response). This lesson walks all three with named sub-failures and the lecturer's debugging methodology.How preferences become reward signalshttps://clawdemy.org/lessons/ai-foundations/preferences-into-reward-signals/lesson/https://clawdemy.org/lessons/ai-foundations/preferences-into-reward-signals/lesson/The second lesson of Phase 4 in our adaptation of Stanford CME 295 Lecture 5. SFT teaches the model what to predict, not what not to predict. This lesson covers how that gap is filled: what a preference pair is, why pairwise comparison is the standard collection format, and how the resulting data is used to train a reward model. The reward model is stage one of RLHF and the bridge between human preferences and the RL update in the next lesson.Thu, 07 May 2026 00:00:00 GMTClawdemy19:00falseThe second lesson of Phase 4 in our adaptation of Stanford CME 295 Lecture 5. SFT teaches the model what to predict, not what not to predict. This lesson covers how that gap is filled: what a preference pair is, why pairwise comparison is the standard collection format, and how the resulting data is used to train a reward model. The reward model is stage one of RLHF and the bridge between human preferences and the RL update in the next lesson.Pretraining: how a model learns language by predicting the next wordhttps://clawdemy.org/lessons/ai-foundations/how-models-are-pretrained/lesson/https://clawdemy.org/lessons/ai-foundations/how-models-are-pretrained/lesson/Lesson 1 of Phase 3 (How models are trained at scale) in Track 5. Pretraining is the most expensive single thing in modern AI (millions of dollars per run, months of GPU time on large clusters), and for the decoder-only models that dominate generative AI today, also the simplest. Feed the model the open internet, ask it to predict the next token, repeat trillions of times. The lesson traces the path from older one-model-per-task paradigms to the transfer-learning shape we have today, walks one training step concretely (the cat sat on the [_]: predicted distribution over vocabulary, training signal is whatever was actually next, cross-entropy loss is the negative log of the probability the model assigned to the right answer), names Common Crawl + code repositories + books as the dominant data sources, and grounds the scale (Llama 4 Scout ~40T tokens, frontier scale roughly doubled to tripled since Llama 3's 15T).Wed, 06 May 2026 00:00:00 GMTClawdemy22:00falseLesson 1 of Phase 3 (How models are trained at scale) in Track 5. Pretraining is the most expensive single thing in modern AI (millions of dollars per run, months of GPU time on large clusters), and for the decoder-only models that dominate generative AI today, also the simplest. Feed the model the open internet, ask it to predict the next token, repeat trillions of times. The lesson traces the path from older one-model-per-task paradigms to the transfer-learning shape we have today, walks one training step concretely (the cat sat on the [_]: predicted distribution over vocabulary, training signal is whatever was actually next, cross-entropy loss is the negative log of the probability the model assigned to the right answer), names Common Crawl + code repositories + books as the dominant data sources, and grounds the scale (Llama 4 Scout ~40T tokens, frontier scale roughly doubled to tripled since Llama 3's 15T).Why pretraining is a memory engineering problem (parallelism and Flash Attention)https://clawdemy.org/lessons/ai-foundations/parallelism-and-flash-attention/lesson/https://clawdemy.org/lessons/ai-foundations/parallelism-and-flash-attention/lesson/Lesson 3 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. A Chinchilla-aligned pretraining run does not fit on one GPU, and attention turns out to be memory-bound rather than compute-bound. This lesson covers the four engineering tricks that make Chinchilla-scale training tractable on real hardware: data parallelism, the ZeRO optimization, model parallelism, and Flash Attention. Three distribute memory across many GPUs; one rearranges the memory hierarchy inside a single GPU.Wed, 06 May 2026 00:00:00 GMTClawdemy26:00falseLesson 3 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. A Chinchilla-aligned pretraining run does not fit on one GPU, and attention turns out to be memory-bound rather than compute-bound. This lesson covers the four engineering tricks that make Chinchilla-scale training tractable on real hardware: data parallelism, the ZeRO optimization, model parallelism, and Flash Attention. Three distribute memory across many GPUs; one rearranges the memory hierarchy inside a single GPU.Why precision matters: quantization and mixed precisionhttps://clawdemy.org/lessons/ai-foundations/quantization-and-mixed-precision/lesson/https://clawdemy.org/lessons/ai-foundations/quantization-and-mixed-precision/lesson/Lesson 4 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. The Phase 3 closer. The fourth and last memory lever in the pretraining-engineering toolkit. Lower-precision floating-point representations cost less memory per parameter and run faster on hardware that supports them. Quantization converts a trained model from one precision to another; mixed precision training uses different precisions in different parts of one training step to keep the savings without losing the model in numerical noise.Wed, 06 May 2026 00:00:00 GMTClawdemy18:00falseLesson 4 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. The Phase 3 closer. The fourth and last memory lever in the pretraining-engineering toolkit. Lower-precision floating-point representations cost less memory per parameter and run faster on hardware that supports them. Quantization converts a trained model from one precision to another; mixed precision training uses different precisions in different parts of one training step to keep the savings without losing the model in numerical noise.Why scale matters: scaling laws and the Chinchilla rulehttps://clawdemy.org/lessons/ai-foundations/why-scale-matters/lesson/https://clawdemy.org/lessons/ai-foundations/why-scale-matters/lesson/Lesson 2 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. Pretraining works because of scale, the lecturer says. This lesson gives that claim its empirical foundation. Two papers: the Kaplan scaling laws (loss falls predictably with more compute, more data, more parameters), and the Chinchilla compute-optimal rule (with a fixed budget, train on roughly 20 tokens per parameter). Together they explain why GPT-3 was undertrained and what changed when frontier labs rebalanced toward more data.Wed, 06 May 2026 00:00:00 GMTClawdemy20:00falseLesson 2 of Phase 3 (How models learn from text: pretraining and scale) in Track 5. Pretraining works because of scale, the lecturer says. This lesson gives that claim its empirical foundation. Two papers: the Kaplan scaling laws (loss falls predictably with more compute, more data, more parameters), and the Chinchilla compute-optimal rule (with a fixed budget, train on roughly 20 tokens per parameter). Together they explain why GPT-3 was undertrained and what changed when frontier labs rebalanced toward more data.How transformers scale to real-world data: sliding windows and KV-cache savingshttps://clawdemy.org/lessons/ai-foundations/attention-efficiency-tricks/lesson/https://clawdemy.org/lessons/ai-foundations/attention-efficiency-tricks/lesson/Lesson 6 of Phase 2 (How models think) in Track 5. Two efficiency problems, two distinct fixes. The compute problem (standard self-attention is O(n^2) in sequence length) is solved by sliding window attention, where each token attends to a local neighborhood and the receptive field grows with layer stacking the same way it does in CNNs. The memory problem (the KV cache that speeds up decoding gets large fast) is solved by sharing key and value projections across attention heads, in the MHA -> MQA -> GQA progression. The lesson keeps the two problems cleanly separate so you know what each fix targets, and closes with the 2026 context-window mainstream framing (1M-2M tokens normal, Llama 4 Scout's 10M as the exception).Sun, 03 May 2026 00:00:00 GMTClawdemy22:00falseLesson 6 of Phase 2 (How models think) in Track 5. Two efficiency problems, two distinct fixes. The compute problem (standard self-attention is O(n^2) in sequence length) is solved by sliding window attention, where each token attends to a local neighborhood and the receptive field grows with layer stacking the same way it does in CNNs. The memory problem (the KV cache that speeds up decoding gets large fast) is solved by sharing key and value projections across attention heads, in the MHA -> MQA -> GQA progression. The lesson keeps the two problems cleanly separate so you know what each fix targets, and closes with the 2026 context-window mainstream framing (1M-2M tokens normal, Llama 4 Scout's 10M as the exception).How these models keep improving: DistilBERT and RoBERTahttps://clawdemy.org/lessons/ai-foundations/bert-derivatives-distilbert-roberta/lesson/https://clawdemy.org/lessons/ai-foundations/bert-derivatives-distilbert-roberta/lesson/Lesson 10 of Phase 2 (How models think) in Track 5. The Phase 2 closer. BERT had three real limitations (context length, latency, pretraining complexity). DistilBERT addressed the latency-and-size limit through knowledge distillation (Hinton's soft-targets framing, KL divergence, halve the layer count and train the student to match BERT's output distribution; the result is about 40% smaller at roughly 66M parameters vs BERT-base's 110M with almost the same downstream performance). RoBERTa addressed the pretraining-complexity limit by dropping NSP entirely (it turned out not to help), masking dynamically on every epoch instead of once during data prep, and training MLM longer on much more data. Two distinct improvements, often combined as DistilRoBERTa for production systems where both speed and quality matter.Sun, 03 May 2026 00:00:00 GMTClawdemy18:00falseLesson 10 of Phase 2 (How models think) in Track 5. The Phase 2 closer. BERT had three real limitations (context length, latency, pretraining complexity). DistilBERT addressed the latency-and-size limit through knowledge distillation (Hinton's soft-targets framing, KL divergence, halve the layer count and train the student to match BERT's output distribution; the result is about 40% smaller at roughly 66M parameters vs BERT-base's 110M with almost the same downstream performance). RoBERTa addressed the pretraining-complexity limit by dropping NSP entirely (it turned out not to help), masking dynamically on every epoch instead of once during data prep, and training MLM longer on much more data. Two distinct improvements, often combined as DistilRoBERTa for production systems where both speed and quality matter.How transformers turn input into output: encoder-decoder and T5's span corruptionhttps://clawdemy.org/lessons/ai-foundations/encoder-decoder-and-t5-span-corruption/lesson/https://clawdemy.org/lessons/ai-foundations/encoder-decoder-and-t5-span-corruption/lesson/Lesson 7 of Phase 2 (How models think) in Track 5. The original 2017 transformer was an encoder-decoder, designed for machine translation: an encoder processes the input, a decoder generates the output, and cross-attention lets the decoder look back at the encoder while writing. Modern LLMs went decoder-only, but there is a real branch (T5, mT5, byT5) that kept the encoder-decoder shape and changed the pretraining objective from next-token prediction to span corruption (chunks masked from the input, decoder reconstructs them). The lesson walks the family, builds the span-corruption mechanism through the lecturer's teddy-bear example, names byT5's UTF-8 byte tokenizer (1-4 bytes per character), and closes on why decoder-only eventually dominated despite T5's strengths.Sun, 03 May 2026 00:00:00 GMTClawdemy18:00falseLesson 7 of Phase 2 (How models think) in Track 5. The original 2017 transformer was an encoder-decoder, designed for machine translation: an encoder processes the input, a decoder generates the output, and cross-attention lets the decoder look back at the encoder while writing. Modern LLMs went decoder-only, but there is a real branch (T5, mT5, byT5) that kept the encoder-decoder shape and changed the pretraining objective from next-token prediction to span corruption (chunks masked from the input, decoder reconstructs them). The lesson walks the family, builds the span-corruption mechanism through the lecturer's teddy-bear example, names byT5's UTF-8 byte tokenizer (1-4 bytes per character), and closes on why decoder-only eventually dominated despite T5's strengths.How prompting works: mechanics, system prompts, and prompt injectionhttps://clawdemy.org/lessons/ai-foundations/how-prompting-works/lesson/https://clawdemy.org/lessons/ai-foundations/how-prompting-works/lesson/Lesson 2 of Phase 5 (How we steer models at inference) in Track 5. Modern AI assistants follow instructions because they were post-trained to. This lesson covers what to do with that: what a prompt actually is at the token level (just input tokens conditioning the next-token prediction loop), what a system prompt is (a high-priority hint backed by an API contract plus a training-time bias, not a hard sandbox), and why every instruction-tuned model is structurally vulnerable to prompt injection (the same property that makes the model useful, instruction-following over input tokens, is what makes it follow injected instructions). Few-shot prompting and chain-of-thought get their own dedicated lessons next.Sun, 03 May 2026 00:00:00 GMTClawdemy14:00falseLesson 2 of Phase 5 (How we steer models at inference) in Track 5. Modern AI assistants follow instructions because they were post-trained to. This lesson covers what to do with that: what a prompt actually is at the token level (just input tokens conditioning the next-token prediction loop), what a system prompt is (a high-priority hint backed by an API contract plus a training-time bias, not a hard sandbox), and why every instruction-tuned model is structurally vulnerable to prompt injection (the same property that makes the model useful, instruction-following over input tokens, is what makes it follow injected instructions). Few-shot prompting and chain-of-thought get their own dedicated lessons next.How RAG works: feeding the model what it does not already knowhttps://clawdemy.org/lessons/ai-foundations/how-rag-works/lesson/https://clawdemy.org/lessons/ai-foundations/how-rag-works/lesson/A lesson in Phase 6 (How models reason and act) in Track 5. The earlier lessons covered how the model works and how to talk to it. This one covers how to give it knowledge it does not already have. The lesson walks the retrieval-augmented generation pipeline (chunking, embedding, retrieval, prompt construction, grounded generation), explains the bi-encoder + cross-encoder two-stage retrieval pattern that production systems use (bi-encoder for fast recall over millions of chunks, cross-encoder for precision over a few hundred), names HyDE (hypothetical document embeddings) as the standard fix for the question-vs-answer shape mismatch, frames how 1M-2M-token context windows complement rather than displace RAG in 2026, and closes on indirect prompt injection as the structural security issue RAG creates and why it is harder to defend against than direct injection.Sun, 03 May 2026 00:00:00 GMTClawdemy25:00falseA lesson in Phase 6 (How models reason and act) in Track 5. The earlier lessons covered how the model works and how to talk to it. This one covers how to give it knowledge it does not already have. The lesson walks the retrieval-augmented generation pipeline (chunking, embedding, retrieval, prompt construction, grounded generation), explains the bi-encoder + cross-encoder two-stage retrieval pattern that production systems use (bi-encoder for fast recall over millions of chunks, cross-encoder for precision over a few hundred), names HyDE (hypothetical document embeddings) as the standard fix for the question-vs-answer shape mismatch, frames how 1M-2M-token context windows complement rather than displace RAG in 2026, and closes on indirect prompt injection as the structural security issue RAG creates and why it is harder to defend against than direct injection.Why transformers need stability to learn: LayerNorm, pre-norm, and RMSNormhttps://clawdemy.org/lessons/ai-foundations/layer-norm-and-rmsnorm/lesson/https://clawdemy.org/lessons/ai-foundations/layer-norm-and-rmsnorm/lesson/Lesson 5 of Phase 2 (How models think) in Track 5. Two changes the field made to the original transformer's normalization. The first is where the LayerNorm sits relative to the sub-layer (post-norm 2017 -> pre-norm modern). The second is what the normalization computes (LayerNorm -> RMSNorm, which skips the mean subtraction and the learned shift). Both show up explicitly in modern model cards. The lesson also corrects the historical motivation: the original BatchNorm internal-covariate-shift framing did not survive scrutiny (Santurkar et al. 2018); the currently accepted explanation is that normalization smooths the loss landscape, not that it reduces ICS.Sun, 03 May 2026 00:00:00 GMTClawdemy18:00falseLesson 5 of Phase 2 (How models think) in Track 5. Two changes the field made to the original transformer's normalization. The first is where the LayerNorm sits relative to the sub-layer (post-norm 2017 -> pre-norm modern). The second is what the normalization computes (LayerNorm -> RMSNorm, which skips the mean subtraction and the learned shift). Both show up explicitly in modern model cards. The lesson also corrects the historical motivation: the original BatchNorm internal-covariate-shift framing did not survive scrutiny (Santurkar et al. 2018); the currently accepted explanation is that normalization smooths the loss landscape, not that it reduces ICS.How modern models inject position into attention (RoPE)https://clawdemy.org/lessons/ai-foundations/position-embeddings-and-rope/lesson/https://clawdemy.org/lessons/ai-foundations/position-embeddings-and-rope/lesson/The Phase 2 lesson on the architectural shift from input-added position embeddings (sinusoidal, learned) to attention-injected ones. Covers the 'closer-tokens-more-similar' intuition that motivates the shift, the two intermediate schemes (T5 relative bias, ALiBi), and the RoPE deep-dive. Phase 1 covered the original 2017 schemes; this lesson covers what modern LLMs do differently.Sun, 03 May 2026 00:00:00 GMTClawdemy20:00falseThe Phase 2 lesson on the architectural shift from input-added position embeddings (sinusoidal, learned) to attention-injected ones. Covers the 'closer-tokens-more-similar' intuition that motivates the shift, the two intermediate schemes (T5 relative bias, ALiBi), and the RoPE deep-dive. Phase 1 covered the original 2017 schemes; this lesson covers what modern LLMs do differently.How instruction tuning makes a model helpfulhttps://clawdemy.org/lessons/ai-foundations/how-models-learn-to-be-helpful/lesson/https://clawdemy.org/lessons/ai-foundations/how-models-learn-to-be-helpful/lesson/Lesson 1 of Phase 4 (How models learn to be helpful) in Track 5. A pretrained transformer is a great autocompleter, not an assistant. Supervised fine-tuning (SFT) is the bridge: same next-token-prediction objective as pretraining, but on a much smaller curated set of instruction-response examples (typically thousands to hundreds of thousands rather than trillions). The lesson covers what SFT changes (response shape: when the model sees an instruction, the most-likely continuation is now a response rather than more text in the same style) and what stays the same (the knowledge already in the weights), why a few high-quality examples can transform surface behavior, where parameter-efficient methods like LoRA fit, what kind of model you have after SFT (instruction-following but not yet preference-aligned), and the structural limitation (no negative signal: SFT can teach what to predict but not what NOT to predict) that makes the next lesson on preference data necessary.Thu, 30 Apr 2026 00:00:00 GMTClawdemy18:00falseLesson 1 of Phase 4 (How models learn to be helpful) in Track 5. A pretrained transformer is a great autocompleter, not an assistant. Supervised fine-tuning (SFT) is the bridge: same next-token-prediction objective as pretraining, but on a much smaller curated set of instruction-response examples (typically thousands to hundreds of thousands rather than trillions). The lesson covers what SFT changes (response shape: when the model sees an instruction, the most-likely continuation is now a response rather than more text in the same style) and what stays the same (the knowledge already in the weights), why a few high-quality examples can transform surface behavior, where parameter-efficient methods like LoRA fit, what kind of model you have after SFT (instruction-following but not yet preference-aligned), and the structural limitation (no negative signal: SFT can teach what to predict but not what NOT to predict) that makes the next lesson on preference data necessary.Token by token: how a transformer generates texthttps://clawdemy.org/lessons/ai-foundations/how-text-is-generated/lesson/https://clawdemy.org/lessons/ai-foundations/how-text-is-generated/lesson/Lesson 1 of Phase 5 (How we steer models at inference) in Track 5. This one shows what a trained transformer actually does at runtime: predict the next token, sample from a distribution, append, then repeat. The lesson walks the autoregressive prediction loop end-to-end (forward pass, logits, softmax, sample, append), compares the decoding strategies that shape the sample step (greedy, pure sampling, top-k, top-p, plus temperature as a separate dial), explains KV caching honestly (it removes the recompute cost that would have made naive generation grow quadratically with output length, so per-token cost grows linearly with cache length, not constant; the dominant constant per-token model cost is what makes streaming feel steady until contexts get long), and closes on speculative decoding as the production layer on top (TensorRT-LLM, vLLM, SGLang ship it natively in 2026).Wed, 29 Apr 2026 00:00:00 GMTClawdemy22:00falseLesson 1 of Phase 5 (How we steer models at inference) in Track 5. This one shows what a trained transformer actually does at runtime: predict the next token, sample from a distribution, append, then repeat. The lesson walks the autoregressive prediction loop end-to-end (forward pass, logits, softmax, sample, append), compares the decoding strategies that shape the sample step (greedy, pure sampling, top-k, top-p, plus temperature as a separate dial), explains KV caching honestly (it removes the recompute cost that would have made naive generation grow quadratically with output length, so per-token cost grows linearly with cache length, not constant; the dominant constant per-token model cost is what makes streaming feel steady until contexts get long), and closes on speculative decoding as the production layer on top (TensorRT-LLM, vLLM, SGLang ship it natively in 2026).Multi-head attention: many lenses on the same sentencehttps://clawdemy.org/lessons/ai-foundations/multi-head-attention/lesson/https://clawdemy.org/lessons/ai-foundations/multi-head-attention/lesson/Lesson 2 of Phase 2 (How models think) in Track 5. One attention head can only weight every token one way per token, so it has to choose which structure to track in a sentence that has many running through it at once. Real transformers run 8 to 32 heads in parallel, each with its own Q, K, V projections, looking at the same sentence through a different lens. The lesson builds the one-head-isn't-enough intuition (back to the animal-street-it example), walks the split-run-concatenate pattern (h smaller heads each at d_k = d_model / h, concatenated and projected through W_O), traces the dimension flow on a 12-head 768-dim example, and closes with what real model specs mean by head counts and the 2026 production variants beyond vanilla MHA (MQA, GQA, MLA).Wed, 29 Apr 2026 00:00:00 GMTClawdemy22:00falseLesson 2 of Phase 2 (How models think) in Track 5. One attention head can only weight every token one way per token, so it has to choose which structure to track in a sentence that has many running through it at once. Real transformers run 8 to 32 heads in parallel, each with its own Q, K, V projections, looking at the same sentence through a different lens. The lesson builds the one-head-isn't-enough intuition (back to the animal-street-it example), walks the split-run-concatenate pattern (h smaller heads each at d_k = d_model / h, concatenated and projected through W_O), traces the dimension flow on a 12-head 768-dim example, and closes with what real model specs mean by head counts and the 2026 production variants beyond vanilla MHA (MQA, GQA, MLA).The transformer block: where everything comes togetherhttps://clawdemy.org/lessons/ai-foundations/transformer-block/lesson/https://clawdemy.org/lessons/ai-foundations/transformer-block/lesson/Lesson 3 of Phase 2 (How models think) in Track 5. Tokens, embeddings, attention, multi-head attention. All the load-bearing pieces. This lesson assembles them into a real transformer block: the repeating unit stacked many times to build a real model. Covers the four wrapping pieces (position encoding, feed-forward network, residual connections, layer normalization), why each one is structurally required (attention alone is order-blind, has no per-token nonlinearity, suffers rank collapse without an FFN, and cannot be stacked deep without residuals + normalization), the Pre-LN vs Post-LN ordering (Pre-LN is the modern default; the original 2017 paper used Post-LN), and what every component on the canonical 'Attention Is All You Need' architecture diagram represents.Wed, 29 Apr 2026 00:00:00 GMTClawdemy25:00falseLesson 3 of Phase 2 (How models think) in Track 5. Tokens, embeddings, attention, multi-head attention. All the load-bearing pieces. This lesson assembles them into a real transformer block: the repeating unit stacked many times to build a real model. Covers the four wrapping pieces (position encoding, feed-forward network, residual connections, layer normalization), why each one is structurally required (attention alone is order-blind, has no per-token nonlinearity, suffers rank collapse without an FFN, and cannot be stacked deep without residuals + normalization), the Pre-LN vs Post-LN ordering (Pre-LN is the modern default; the original 2017 paper used Post-LN), and what every component on the canonical 'Attention Is All You Need' architecture diagram represents.Embeddings: how words become vectors with meaninghttps://clawdemy.org/lessons/ai-foundations/how-words-become-vectors/lesson/https://clawdemy.org/lessons/ai-foundations/how-words-become-vectors/lesson/Lesson 2 of Phase 1 (How models read text) in Track 5. Token IDs are just arbitrary numbers with no meaning attached. Embeddings are the dense vectors that fix that, by carrying meaning into the model as geometry: similar words are close together on a high-dimensional map, and certain consistent kinds of difference (gender, tense, country and capital) point along consistent directions. The lesson builds the words-on-a-map intuition, walks through the lookup-table mechanism (one row per token, the embedding matrix W_E), pays off the king-queen demonstration as actual vector arithmetic (Mikolov et al., Word2Vec 2013, predating transformers by four years), and lands on why every modern semantic-search and retrieval-augmented system in production runs on this idea.Tue, 28 Apr 2026 00:00:00 GMTClawdemy22:00falseLesson 2 of Phase 1 (How models read text) in Track 5. Token IDs are just arbitrary numbers with no meaning attached. Embeddings are the dense vectors that fix that, by carrying meaning into the model as geometry: similar words are close together on a high-dimensional map, and certain consistent kinds of difference (gender, tense, country and capital) point along consistent directions. The lesson builds the words-on-a-map intuition, walks through the lookup-table mechanism (one row per token, the embedding matrix W_E), pays off the king-queen demonstration as actual vector arithmetic (Mikolov et al., Word2Vec 2013, predating transformers by four years), and lands on why every modern semantic-search and retrieval-augmented system in production runs on this idea.How AI reads: turning text into tokenshttps://clawdemy.org/lessons/ai-foundations/how-ai-reads-tokens/lesson/https://clawdemy.org/lessons/ai-foundations/how-ai-reads-tokens/lesson/The opener of Phase 1 (How models read text) in Track 5. A transformer never sees raw text; it sees a sequence of integer IDs called tokens. This lesson walks why neither whole words nor individual characters work as units, what byte-pair encoding does (with one merge worked by hand), why a token is atomic to the model (the structural reason older models fail at letter-counting), and how special tokens like BOS, EOS, and chat-role markers create the prompt-injection surface that becomes load-bearing later.Mon, 27 Apr 2026 00:00:00 GMTClawdemy20:00falseThe opener of Phase 1 (How models read text) in Track 5. A transformer never sees raw text; it sees a sequence of integer IDs called tokens. This lesson walks why neither whole words nor individual characters work as units, what byte-pair encoding does (with one merge worked by hand), why a token is atomic to the model (the structural reason older models fail at letter-counting), and how special tokens like BOS, EOS, and chat-role markers create the prompt-injection surface that becomes load-bearing later.Inside the transformer: how attention decides which word goes with whichhttps://clawdemy.org/lessons/ai-foundations/how-attention-works/lesson/https://clawdemy.org/lessons/ai-foundations/how-attention-works/lesson/The opener of Phase 2 (How models think) in Track 5. Self-attention is how a transformer figures out, for every word, which other words in the sentence it should be paying attention to. The lesson opens on the canonical *the animal didn't cross the street because it was too tired* example (your reading brain connects *it* to *animal*, not *street*, without effort), names what RNNs structurally couldn't do (long-range decay, no parallelism), builds the query-key-value (Q-K-V) library analogy, walks the three-step formula (similarity / scale by sqrt(d_k) / softmax-weighted sum) without burying you in linear algebra, distinguishes self-attention from cross-attention by which sequence supplies each vector, and works one full attention computation by hand on three tokens so the formula stops being a black box.Sun, 26 Apr 2026 00:00:00 GMTClawdemy25:00falseThe opener of Phase 2 (How models think) in Track 5. Self-attention is how a transformer figures out, for every word, which other words in the sentence it should be paying attention to. The lesson opens on the canonical *the animal didn't cross the street because it was too tired* example (your reading brain connects *it* to *animal*, not *street*, without effort), names what RNNs structurally couldn't do (long-range decay, no parallelism), builds the query-key-value (Q-K-V) library analogy, walks the three-step formula (similarity / scale by sqrt(d_k) / softmax-weighted sum) without burying you in linear algebra, distinguishes self-attention from cross-attention by which sequence supplies each vector, and works one full attention computation by hand on three tokens so the formula stops being a black box.AI won't replace you. But it will expose you.https://clawdemy.org/lessons/getting-started/ai-wont-replace-you/lesson/https://clawdemy.org/lessons/getting-started/ai-wont-replace-you/lesson/The mission lesson. AI amplifies capability, but only if there's something worth amplifying. That something is your human delta. Start here.Wed, 22 Apr 2026 00:00:00 GMTClawdemy12:00falseThe mission lesson. AI amplifies capability, but only if there's something worth amplifying. That something is your human delta. Start here.