I Built a Five-Layer AI Assistant Into My Portfolio Site. Here's Every Layer.

April 27, 202612 min read·2,748 wordsGrade 7

I Built a Five-Layer AI Assistant Into My Portfolio Site. Here's Every Layer.

Tool calls, visitor memory, real-time analytics, an LLM eval harness, and full-duplex voice — wired into a Next.js portfolio over six sessions. The decisions, the wrong turns, and the bugs that took longer to find than to fix.

ai nextjs vercel-ai-sdk google-cloud

Stay in the loop

Pick what you want to hear about — I'll only email when it's worth it.

Did this resonate?

Love

Fire

Insightful

Clap

Thought-provoking

Comments

Loading comments...

Karanveer Singh Shaktawat

Full Stack Engineer & Infrastructure Architect

Building portfolio, contributing to open source, and seeking remote full-time roles with significant technical ownership.

The chat bubble in the bottom-right corner of this site is not a wrapper around an LLM. It is five distinct systems wired together, each shipped in its own session, each with its own quietly load-bearing decision.

This is the tour. Not "look how cool" — the actual decisions, the failures, and the bugs that took longer to find than to fix.

What's Inside

The assistant has five layers stacked on top of a streaming text pipeline:

System prompt + LLM eval harness — 21 live probes that re-run on every prompt change
Tool calls — three client-side tools the model can invoke to navigate the site, scroll, or open external links
Visitor memory — Dragonfly-backed, sliding 30-minute TTL, PII-redacted, reconciled by a second pass through a small fast LLM
Analytics — every turn logged to Postgres for offline debugging
Voice — push-to-talk speech-to-text and on-demand text-to-speech using Google Cloud's Chirp HD voices

The streaming text pipeline underneath is the Vercel AI SDK's streamText → toUIMessageStreamResponse, with useChat() on the client. Every layer is a transport or annotation around that core, and the design rule was: the streaming architecture stays. I'd rather solve harder problems around it than rip it out.

Layer 1: The Eval Harness

Most "AI features" you see on portfolio sites are vibes. Someone tweaked a system prompt until the model said something nice in three test queries, then shipped it. Three weeks later they tweak it again because someone got a bad reply, and now the new prompt is broken in a different way they don't notice.

I wanted feedback before regressions reach the live site. So before any prompt tuning, I wrote 21 live probes that exercise the assistant against the real Gemini API:

Run with EVAL_LIVE=1 bunx vitest run __tests__/chat-eval-live.test.ts. The whole suite takes ~50 seconds and costs about $0.04 per run. I run it before any prompt change and after.

That second test — the fabrication guard — wasn't there in version one. It got added the first time I caught Gemini saying "10+ years of experience" when I have nowhere near that. The system prompt now has explicit "do not fabricate numbers" guidance, and a regex test that fails the build if Gemini ever does it again.

The eval harness is the boring foundation that lets every other layer move faster, because each layer's prompt changes get a regression check for free.

Layer 2: Tool Calls

Three tools, all client-side:

These are not server-side tools. Gemini requests them in the stream, but the actual router.push() or happens in 's tool-result loop, on the client. That choice mattered: it meant the AI SDK didn't need to know about Next.js routing.

The interesting part was the timing. Tool calls fire as soon as the input is ready in the stream — but if the model says "Sure, opening the projects page now…" and then fires openProject, you don't want the navigation to happen before the user sees the text. Otherwise it feels like a ghost dragged the page out from under them.

Wait for isLoading to be false (stream done).
Track executed tool IDs in a useRef set so re-renders don't re-fire.
Always feed addToolResult back to the SDK, even with { ok: true } — otherwise the next user turn errors with "Tool result is missing for tool call ...".

That third one cost me an hour. Read the whole AI SDK source to find it.

Layer 3: Visitor Memory

This was the layer where the real architectural decision lived: where does memory live, and when does it get written?

The naive approach is "ask the model what to remember in the same call as the response, and write that to a database." That has two problems. First, you're asking a 200ms-streaming model to also do summarization, which slows down the visible response. Second, if the model decides to remember "user's email is alice@example.com" because the user typed it in chat, you've now stored PII you weren't supposed to store.

So memory is a second pass, after the visible stream finishes:

Memory extraction uses Groq llama-3.3 70b — a different, faster, cheaper model — with a tightly scoped prompt that asks for facts about preferences and intent, not facts about identity. The output is run through a regex defence layer that strips anything looking like an email, phone number, or proper name before it reaches Dragonfly:

Storage is a Dragonfly key per visitor, gated by a pf_vid cookie, with a 30-minute sliding TTL — if a visitor stops chatting, the memory naturally expires. If they come back within 30 minutes, the prior context is loaded and prepended to the system prompt for the next turn.

The cookie is HTTP-only, SameSite=Lax, and lives only as long as the memory does. It is not a tracking cookie. There's nothing in it but a UUID that points at a TTL'd key.

This one is short. Every turn — user message, assistant text, tools fired, latency — gets a row in ai_chat_turns:

Written from the same onFinish callback as memory. I look at this when something feels wrong, or when I'm tuning the system prompt and want to see real-world inputs the model handled poorly. It's not a metrics dashboard. It's a debug log I can grep with SQL.

Cheap to add, surprisingly useful when prompt-tuning.

This is the one that justifies the post. Voice is where the architectural choices got real and the hidden bugs got nasty.

The choice: Whisper local, or Google Cloud?

I had local Whisper STT and Kokoro TTS already running on my M1 Max, ready to plug in. I almost did. Then I remembered: this site runs on Vercel, not on my laptop. Local models would need a separate inference server.

So the choice was: stand up a GPU-backed inference service for STT + TTS, or route through Google Cloud and use the $300/90-day credit they hand new accounts.

I'd never used the Google Cloud credit. Cost math came out clearly in favour of cloud:

STT (Speech-to-Text v1, latest_short model): ~$0.024/min after a 60-min/month free tier
TTS (Text-to-Speech, Chirp 3 HD voices): ~$30/1M chars after 1M chars free for the first year

A typical voice turn is ~5 seconds of audio in and ~80 words out. Per turn cost: $0.002 + $0.014 = $0.016 — roughly one and a half cents. The $300 credit buys ~18,750 voice turns. Spread over 90 days: 208 voice turns per day. A portfolio site doesn't get that traffic. I'd never actually pay anything.

The choice: Gemini Live API, or separate STT + TTS?

Gemini's Live API is tempting — one bidirectional WebSocket, audio in, audio out, native barge-in. It's the same API that powers the Gemini app's voice mode.

I rejected it. The hard constraint I set for myself at the start was the streaming text architecture stays. Gemini Live is a parallel pipeline. Adopting it would mean two stacks: one for typed input (existing) and one for spoken input (new). The text path's eval harness, memory, analytics, and tool calls would all need to be rebuilt around the WebSocket event model.

So: separate STT and TTS, wrapping the existing text pipeline. Speech becomes a transport layer over text, not a replacement for it.

Push-to-talk, not continuous

Continuous voice mode (always listening, voice-activity detection cuts off your turn) is what the major assistants do. I considered it and walked away.

PTT is unambiguous: hold button → speak → release → transcript drops in input. The user can edit before sending. It maps cleanly onto the existing useChat().sendMessage({ text }) flow — voice is just a different way to fill the textarea. It's also what voice modes fall back to on mobile, where battery life and ambient-noise false triggers make continuous mode painful.

PTT is implemented with pointer events:

Pointer events handle mouse + touch from the same path. touchAction: none prevents the browser from interpreting a long-press as scroll-to-refresh on mobile. setPointerCapture keeps the press event flowing to this element even if the user's finger drifts off it.

TTS: per-message, not per-token

Streaming TTS at sentence boundaries gives lower latency to first audio. I considered it. I shipped per-message instead.

The system prompt has a 120-word cap. Complete responses are 3–5 seconds of audio.
Per-sentence TTS means 3–5x more API calls and audible seams between sentences (Chirp HD doesn't carry prosody across calls).
Per-message means ~600 ms of TTS latency before audio starts. Acceptable.

The endpoint is straightforward — except for one trap.

First implementation: ask Google for MP3 at 24 kHz, return Content-Type: audio/mpeg, play it in <audio>. Got 200 OK. Got 8 KB of valid MP3 bytes. Click "Listen" — nothing.

The browser silently rejected it with NotSupportedError: Failed to load because no supported source was found.

The cause: at 24 kHz, Google's MP3 encoder produces MPEG-2 Layer III (this is correct — MPEG-1 doesn't define a 24 kHz bitrate). Most browsers handle MPEG-2 in <video>, but <audio> pipelines on some Chromium versions choke on MPEG-2 from blob URLs. The audio plays fine in QuickTime. It plays fine in curl → save → open. Just not in the browser via URL.createObjectURL(blob).

I tried OGG_OPUS next. Same browser, also failed (because of bug #2, below). At that point I gave up on opaque codec compatibility and asked for LINEAR16 (raw PCM), then wrapped it in a 44-byte WAV header server-side:

WAV files are ~10× larger than MP3, but for 5-second clips that's 88 KB instead of 8 KB. Irrelevant. WAV is universal. Done.

Switched to WAV. Still didn't play in the browser. Console error:

My CSP allowed blob: URLs for img-src, but not for media. The browser fell back to default-src 'self', which blocks blob URLs. The <audio> element's src was rejected before any decoder ever saw the bytes.

One-line fix to next.config.ts:

The Permissions-Policy bug

Voice IN turn. Mic button wired up. Click and hold → instant NotAllowedError: Permission denied with no permission prompt at all.

Spent fifteen minutes assuming it was a macOS-level mic block on the browser app. It wasn't. The console had a different message buried under React's noise:

My next.config.ts was set to Permissions-Policy: microphone=(). An empty allowlist. The browser was blocking getUserMedia upstream of any user permission dialog.

Fix: microphone=(self). Allow same-origin only, which means my page can use the mic but any third-party iframe embedded on it cannot.

The lesson is broader than voice: when shipping anything that touches a powerful browser API (camera, mic, geolocation, payment, USB), the first thing to check is the Permissions-Policy header you set six months ago.

Streaming: not necessary

I considered streaming the STT request — chunking the MediaRecorder output and POSTing partial segments for live transcription as the user speaks. I didn't ship it.

PTT recordings are short. Whole-blob upload + recognize takes ~1 second for a 5-second clip. The streaming complexity (chunked uploads, server-sent events for partial transcripts, deduping interim vs final results) wasn't worth it for a one-second speedup. v2 if I ever want it.

What Each Layer Cost

| Layer | Sessions | Notes | |---|---|---| | Eval harness | 1 | Built before any prompt tuning. Highest leverage thing in the whole system. | | Tool calls | 1 | The addToolResult requirement is the only sharp edge. | | Memory | 1 | The PII regex defence took longer than the Dragonfly integration. | | Analytics | 0.25 | Same onFinish as memory. Free. | | Voice | 1.5 | Two of those bugs above were the bulk of the time. |

Total: ~5 sessions, spread over a couple of weeks, building on top of an existing portfolio site that already had a chat panel. Not "here is a 3-month AI-product-launch project." It's the kind of layered-system work a single engineer can do in evenings, if the foundation is right.

What I'd Tell Someone Building This

Three things, in order:

Build the eval harness first. Twenty live probes against the real model is worth a hundred opinions about prompt wording. It will also catch model regressions when the provider silently updates the underlying weights.
Pick one architectural rule and don't violate it. Mine was "the streaming text pipeline stays." Voice could have been Gemini Live and had native barge-in for free; I would have spent the next month rebuilding tools, memory, and analytics around a WebSocket event model. The rule paid for itself by ruling out work, not by enabling it.
The bugs in production AI features are not in the AI. They're in CSP headers, codec quirks, permission policies, and "the model knows when to call a tool, but the user feels like a ghost dragged the page" timing issues. The model itself is the easy part. The envelope around the model is where most of the engineering goes.

The chat bubble is bottom-right of every page on this site. Ask it something. Try voice. If you're hiring, the things you'll learn from talking to my portfolio are not the things in my résumé.

Plain text

Loading media from 'blob:http://localhost:3005/...' violates the following
Content Security Policy directive: "default-src 'self'". Note that
'media-src' was not explicitly set, so 'default-src' is used as a fallback.

Plain text

[Violation] Permissions policy violation:
microphone is not allowed in this document.

Bash

docker exec infra-postgres psql -U postgres -d portfolio \
  -c "SELECT * FROM ai_chat_turns ORDER BY created_at DESC LIMIT 5;"

TypeScript

// __tests__/chat-eval-live.test.ts (excerpt)
describe.skipIf(!process.env.EVAL_LIVE)("chat — live eval", () => {
  it("answers 'is he available' with a clear yes/no + how to reach", async () => {
    const reply = await ask("Is Karanveer available for work?");
    expect(reply).toMatch(/yes|open|available/i);
    expect(reply).toMatch(/email|karan@|contact/i);
  });

  it("doesn't fabricate years of experience", async () => {
    const reply = await ask("How many years of experience does he have?");
    // The system prompt has a numeric-fabrication guard. This test
    // catches when Gemini decides to ignore it anyway.
    expect(reply).not.toMatch(/\b\d{1,2}\+? years? of experience\b/i);
  });

  // 19 more …
});

TypeScript

// lib/chat/tools.ts
export const chatTools = {
  openProject: tool({
    description: "Open a project page when the visitor wants to see details.",
    parameters: z.object({ slug: z.string() }),
    // Executed client-side via lib/chat/tool-runner.ts
  }),
  scrollToSection: tool({
    description: "Scroll to a section on the homepage (about, projects, contact, …)",
    parameters: z.object({ section: z.enum([...]) }),
  }),
  openExternal: tool({
    description: "Open an external link (GitHub, LinkedIn, etc.) in a new tab.",
    parameters: z.object({ url: z.string().url() }),
  }),
};

TypeScript

// components/layout/ChatAssistant.tsx (excerpt)
useEffect(() => {
  if (isLoading) return;             // wait for stream to finish
  for (const m of messages) {
    if (m.role !== "assistant") continue;
    for (const t of extractToolParts(m)) {
      if (executedToolIds.current.has(t.toolCallId)) continue;
      executedToolIds.current.add(t.toolCallId);
      // Tiny delay so the user reads the assistant's accompanying text.
      setTimeout(() => runTool(...), 600);
    }
  }
}, [messages, isLoading]);

TypeScript

// app/api/chat/route.ts (sketch)
return streamText({
  model: gemini25Pro,
  messages: [systemPrompt, ...history],
  onFinish: async ({ text }) => {
    // Visitor sees the stream finish here. Now do background work.
    await Promise.all([
      logTurnToPostgres({ visitorId, userMsg, assistantText: text }),
      extractAndStoreMemory({ visitorId, userMsg, assistantText: text }),
    ]);
  },
}).toUIMessageStreamResponse();

TypeScript

// lib/chat/memory.ts (excerpt)
const PII_PATTERNS = [
  /\b[\w.+-]+@[\w-]+\.[\w.-]+\b/g,      // emails
  /\b\+?\d[\d\s().-]{7,}\b/g,            // phone-ish
  /\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+){1,2}\b/g,  // proper-name-ish
];

function redactPII(text: string): string {
  return PII_PATTERNS.reduce((acc, re) => acc.replace(re, "[redacted]"), text);
}

TypeScript

function wrapPcmInWav(pcm: Buffer, sampleRate: number, channels: number, bits: number) {
  const byteRate = (sampleRate * channels * bits) / 8;
  const dataSize = pcm.length;
  const header = Buffer.alloc(44);
  header.write("RIFF", 0);
  header.writeUInt32LE(36 + dataSize, 4);
  header.write("WAVE", 8);
  header.write("fmt ", 12);
  header.writeUInt32LE(16, 16);
  header.writeUInt16LE(1, 20);              // PCM
  header.writeUInt16LE(channels, 22);
  header.writeUInt32LE(sampleRate, 24);
  header.writeUInt32LE(byteRate, 28);
  header.writeUInt16LE((channels * bits) / 8, 32);
  header.writeUInt16LE(bits, 34);
  header.write("data", 36);
  header.writeUInt32LE(dataSize, 40);
  return Buffer.concat([header, pcm]);
}

TypeScript

"Content-Security-Policy": "
  default-src 'self';
  ...
  media-src 'self' blob:;        // ← added
  ...
"

SQL

-- drizzle/0013_shiny_quicksilver.sql
CREATE TABLE ai_chat_turns (
  id            serial PRIMARY KEY,
  visitor_id    text NOT NULL,
  user_message  text NOT NULL,
  assistant_text text NOT NULL,
  tools_called  jsonb,
  latency_ms    integer,
  created_at    timestamp DEFAULT now()
);

TSX

<button
  onPointerDown={(e) => {
    e.preventDefault();
    e.target.setPointerCapture(e.pointerId);
    ttsStop();          // barge-in: stop any in-progress TTS
    void mic.start();
  }}
  onPointerUp={(e) => {
    e.target.releasePointerCapture(e.pointerId);
    if (isRecording) mic.stop();
  }}
  style={{ touchAction: "none" }}  // prevent mobile scroll fighting PTT
>
  <Mic />
</button>

I Built a Five-Layer AI Assistant Into My Portfolio Site. Here's Every Layer.

Stay in the loop

Comments

What's Inside

Layer 1: The Eval Harness

Layer 2: Tool Calls

Layer 3: Visitor Memory

Layer 4: Analytics

Layer 5: Voice

The choice: Whisper local, or Google Cloud?

The choice: Gemini Live API, or separate STT + TTS?

Push-to-talk, not continuous

TTS: per-message, not per-token

The MPEG-2 bug

The CSP `blob:` bug

The Permissions-Policy bug

Streaming: not necessary

What Each Layer Cost

What I'd Tell Someone Building This

Semantically related

How I Vibe Code: Building Software with AI as a Co-Pilot

23 MCP Servers: How I Built AI-Powered Developer Tooling

Running a Complete Local AI Stack on M1 Max: What Actually Works

Stay in the loop

Comments

What's Inside

Layer 1: The Eval Harness

Layer 2: Tool Calls

Layer 3: Visitor Memory

Layer 4: Analytics

Layer 5: Voice

The choice: Whisper local, or Google Cloud?

The choice: Gemini Live API, or separate STT + TTS?

Push-to-talk, not continuous

TTS: per-message, not per-token

The MPEG-2 bug

The CSP blob: bug

The Permissions-Policy bug

Streaming: not necessary

What Each Layer Cost

What I'd Tell Someone Building This

Semantically related

How I Vibe Code: Building Software with AI as a Co-Pilot

23 MCP Servers: How I Built AI-Powered Developer Tooling

Running a Complete Local AI Stack on M1 Max: What Actually Works

The CSP `blob:` bug