Plain
Join now
← All posts

Notes

How we built meetings on LiveKit and Deepgram

A call is where decisions get made, and almost always the one place in your stack that leaves nothing behind. We built meetings and huddles into Plain so the call leaves a durable, structured record instead: LiveKit carries the room, and Deepgram turns speech into a stream of speaking turns, each one a first-class row the rest of the platform can point at, query, and link to. Here's how it fits together, and why the speaking turn is the primitive that makes it cheap.

Jamie DavenportJamie Davenport

13 min

Every surface in a developer platform leaves something behind. An issue has a number. A PR has a diff and a merge commit. A doc has a URL. You can link to them, query them, come back to them in six months, because the platform is built to produce durable records.

Then there's the call. The call is where you actually decided to build the thing, cut the scope, or agree the bug was real. And when it ends, it leaves a calendar event that already happened. The highest-bandwidth surface you have, the one where the decisions are made, is the one that produces no record at all.

So someone volunteers to take notes. Someone says "I'll file the issues after." And the action items get carried, by hand and from memory, out of the conversation and into the tracker later, if they get carried at all. That hand-off is a context switch: you leave the place where the decision happened, open another tool, and reconstruct what was just said well enough to write it down.

We wanted Plain's meetings to close that gap, so a call leaves a record the platform can actually read: not a transcript parked in another tool, but structured data in the same system as your issues, PRs, and docs. This post is how we built it: the media layer on LiveKit, and the transcript primitive fed by Deepgram that turns everything said in a call into a stream of addressable rows.

The workaround became the convention

The standard fix is a better note-taker, and the current generation of them is genuinely good. Drop a bot into the call, get a clean transcript and a tidy summary in your inbox afterward. Granola, Otter, Fireflies: they do the hard parts well, and we're not going to pretend otherwise.

But look at where the output lands. It lands in their product, or in a doc, or an email: next to the place your work actually lives, never inside it. The transcript is a blob of text in a system that has no idea what an issue, a PR, or a repo is. You can't link an issue to the moment it was raised, query a call the way you query your tracker, or come back to it as anything but a wall of text. The note-taking got automated. The transcript stayed inert.

That's the part we think is backwards. A transcript shouldn't be a dead artifact you file away in another tool. If the conversation lives in the same system as your issues, PRs, and docs, the transcript can be a first-class record like any other: addressable down to the sentence, queryable, something the rest of the platform can point at. Getting there starts with the media.

The room: everyone in it is a participant

Start with the media. A call needs audio, video, and screen sharing moving between people in real time, which is a problem with mature, boring answers, so we didn't reinvent it. LiveKit carries the room. Camera, screen share, and microphone are all just tracks a participant publishes, toggled through one small set of controls; the layout follows the tracks, so the moment someone publishes a Track.Source.ScreenShare track the grid reshapes into a speaker view with the screen on the main stage. Sharing a diff is a first-class thing to do on a call, because the call is a real working session.

The part that matters for everything downstream is who else is in the room. The bot that transcribes the call isn't a backend integration bolted onto the side of LiveKit. It's a participant in the room, exactly like a person, distinguished only by the grant on their token:

lib/calls/livekit.server.tsserver
// A human joins able to publish media and subscribe to others.
token.addGrant({
  room: roomName,
  roomJoin: true,
  canPublish: true,
  canSubscribe: true,
});

// The transcriber joins with a deliberately narrower grant: it can
// listen, and it can publish *data* (live captions), but it can
// never put audio or video into the room.
token.addGrant({
  room: roomName,
  roomJoin: true,
  canSubscribe: true,
  canPublish: false,
  canPublishData: true,
});

That one difference is load-bearing. "Is transcription on?" becomes "is the transcriber participant in the room?", and "stop transcribing" becomes "remove that participant", so the room itself is the control channel with no second signalling protocol to invent. And because the bot's authority is just a token grant, it can sit in a live call with no ability to hijack the media: it hears everything and can put nothing into the room but text.

The primitive: one speaking turn

The transcriber subscribes to exactly one thing. Not video, not screen-share audio, just each person's microphone:

packages/transcribe/src/session.tsserver
room.on(RoomEvent.TrackSubscribed, (track, publication, participant) => {
  if (track.kind !== TrackKind.KIND_AUDIO) return;
  if (publication.source !== TrackSource.SOURCE_MICROPHONE) return;
  if (participant.identity === TRANSCRIBER_IDENTITY) return;
  startTrack(track, publication, participant);
});

Each microphone track streams to Deepgram, and the fragments that come back get folded into the primitive the whole feature is built on: the speaking turn. A turn is one person talking until someone else takes over or they go quiet. It is one row, and the schema is the place to understand everything that follows:

packages/db/src/schema/transcripts.tsdb
export type TranscriptTurnKind = "speech";

export const transcriptTurn = pgTable(
  "transcript_turn",
  {
    id: text("id").primaryKey().$defaultFn(() => ulid()),
    organizationId: text("organization_id").notNull(),
    // Exactly one of call_id / meeting_id is set. conversation_id is
    // denormalized for huddle turns (meetings have no conversation).
    conversationId: text("conversation_id").references(() => conversation.id),
    callId: text("call_id").references(() => call.id),
    meetingId: text("meeting_id").references(() => meeting.id),
    // Per-room monotonic order, assigned by the transcriber session.
    // Ordering by seq (not started_at) keeps interleaved crosstalk stable.
    seq: integer("seq").notNull(),
    kind: text("kind").$type<TranscriptTurnKind>().notNull().default("speech"),
    // LiveKit identity: a user id for members, guest_<ulid> for guests.
    speakerIdentity: text("speaker_identity").notNull(),
    speakerUserId: text("speaker_user_id").references(() => user.id),
    speakerName: text("speaker_name").notNull(), // snapshotted for permanence
    text: text("text").notNull(),
    startedAt: timestamp("started_at", { withTimezone: true }).notNull(),
    endedAt: timestamp("ended_at", { withTimezone: true }).notNull(),
  },
  (t) => [
    // Reads are always "one call's turns" or "one meeting's turns", by seq.
    index("transcript_turn_call_idx").on(t.callId, t.seq),
    index("transcript_turn_meeting_idx").on(t.meetingId, t.seq),
  ],
);

A few decisions in that table do most of the work.

It stores turns, not raw speech-to-text fragments. We'll get to why in the next section, but the payoff is that a turn is a real object: one person, one thing they said, one ulid primary key. That id is what later lets anything in the platform point at an exact moment in the conversation, with no extra machinery.

speakerName is snapshotted onto the row rather than joined from the user table, so a meeting guest's turns stay attributable forever, even after the guest is gone and there was never an account to join to. speakerUserId is filled in only when the LiveKit identity maps to a real Plain user.

One table serves both surfaces. A huddle (ambient, attached to a conversation) and a scheduled meeting (with external guests) are different lifecycles around the same transcript primitive: exactly one of callId or meetingId is set, and everything built on turns works for both for free.

And seq: a per-room monotonic counter the session assigns. Ordering by seq rather than wall-clock keeps interleaved crosstalk stable, and it's what keeps the live panel and the post-call record in one order, a property a later section leans on.

Assembling turns from fragments

Streaming speech-to-text doesn't hand you turns. It hands you a fragment every few seconds, so one person talking for a minute yields a dozen of them: a lousy unit to read, and a lousy thing to attach an issue to. Turning that stream into turns is a small state machine, kept pure so the merging rules are testable without LiveKit or Deepgram in the loop:

packages/transcribe/src/turns.tsserver
// Fold one final STT fragment into the speaker's turn.
addFinal(speaker: TurnSpeaker, text: string, startMs: number, endMs: number): TurnEvent[] {
  const events = this.expire(startMs); // close anyone who's gone quiet for gapMs
  const trimmed = text.trim();
  if (trimmed.length === 0) return events;

  const existing = this.open.get(speaker.identity);
  if (existing) {
    existing.text = `${existing.text} ${trimmed}`; // grow the open turn in place
    existing.endedAtMs = endMs;
    if (existing.text.length >= this.opts.maxTurnChars) {
      this.open.delete(speaker.identity);
      events.push({ type: "close", turn: existing });
    } else {
      events.push({ type: "update", turn: existing });
    }
    return events;
  }

  // First fragment from this speaker: open a new turn and take the next seq.
  const turn = {
    id: this.reserved.get(speaker.identity) ?? this.opts.newId(),
    seq: ++this.seq,
    speaker,
    text: trimmed,
    startedAtMs: startMs,
    endedAtMs: endMs,
    lastFinalAtMs: endMs,
  };
  this.open.set(speaker.identity, turn);
  events.push({ type: "open", turn });
  return events;
}

The assembler keeps one open turn per speaker, so when two people talk over each other both turns stay open in parallel and each one's text stays coherent; they interleave in the panel by seq. A turn closes when its speaker goes quiet for gapMs, or when its text grows past maxTurnChars so a monologue can't produce one unbounded row.

The events it emits map straight onto the database, and this is where the turn-not-fragment decision pays off. An open is an INSERT; an update or close is an UPDATE of the same row, grown in place:

packages/transcribe/src/session.tsserver
async function applyTurnEvent(job: TranscriptionJob, event: TurnEvent): Promise<void> {
  const { turn } = event;
  if (event.type === "open") {
    await db.insert(transcriptTurn).values({
      id: turn.id,
      organizationId: job.organizationId,
      conversationId: job.conversationId ?? null,
      callId: job.callId ?? null,
      meetingId: job.meetingId ?? null,
      seq: turn.seq,
      speakerIdentity: turn.speaker.identity,
      speakerUserId: await speakerUserId(turn.speaker.identity),
      speakerName: turn.speaker.name,
      text: turn.text,
      startedAt: new Date(turn.startedAtMs),
      endedAt: new Date(turn.endedAtMs),
    });
    return;
  }
  await db
    .update(transcriptTurn)
    .set({ text: turn.text, endedAt: new Date(turn.endedAtMs), updatedAt: new Date() })
    .where(eq(transcriptTurn.id, turn.id));
}

The Deepgram side is deliberately ordinary. Each track gets its own streaming connection, nova-3, 16kHz linear PCM, interim results on so we can caption mid-sentence:

packages/transcribe/src/session.tsserver
const connection = await deepgram.listen.v1.connect({
  model: "nova-3",
  encoding: "linear16",
  sample_rate: String(STT_SAMPLE_RATE),
  channels: "1",
  interim_results: "true",
  smart_format: "true",
});

connection.on("message", (message) => {
  if (message.type !== "Results") return;
  const text = message.channel.alternatives[0]?.transcript ?? "";
  if (text.trim().length === 0) return;
  const endMs = Date.now();
  const startMs = endMs - Math.round((message.duration ?? 0) * 1000);
  if (message.is_final) {
    applyEvents(assembler.addFinal(speaker, text, startMs, endMs)); // persisted
    return;
  }
  // Interim results are caption-only and never hit Postgres.
});

Final fragments go through the assembler and into the database. Interim fragments are caption-only: they're published live so you see words appear as they're spoken, but they never become a row, because a row is a settled fact and an interim is a guess.

Captions are the same row, arriving live

Live captions ride LiveKit text streams on the ecosystem-standard lk.transcription topic, and the design rule there is the one detail that makes the client simple: every caption message carries the full current text of one turn, keyed by the turn's id.

packages/transcribe/src/protocol.tsserver
export const TRANSCRIPTION_TOPIC = "lk.transcription";
export const ATTR_FINAL = "lk.transcription_final";
// Our attributes: the turn row id + ordering + speaker attribution. The
// stream's sender is the transcriber, not the speaker, so speaker info
// rides in the message rather than coming from the publisher.
export const ATTR_TURN_ID = "plain.turn_id";
export const ATTR_TURN_SEQ = "plain.turn_seq";
export const ATTR_SPEAKER_IDENTITY = "plain.speaker_identity";
export const ATTR_SPEAKER_NAME = "plain.speaker_name";

Because the message holds the whole turn and is keyed by id, a client replaces rather than appends: interim updates overwrite each other in place, and the final update (the lk.transcription_final attribute) settles the turn. The id in the caption is the same id as the transcript_turn row, which collapses two things people usually build separately. A panel can seed its history from Postgres and merge live caption updates on top with no dedupe heuristics, because the live message and the stored row share a primary key. The live experience and the permanent record aren't two systems; one is just the other arriving in real time.

Where all this runs

Everything above is software architecture. The deployment behind it comes down to three facts: there's one media service we don't host, one worker service we do, and a queue made out of the database we already had.

The media plane is the one piece we don't run. LiveKit is the SFU, and it's LiveKit Cloud, not a box we operate. That's not only a build-versus-buy call; it's forced by where the rest of Plain lives. We host on Railway, and Railway's public ingress is HTTP and raw TCP only, with "no inbound UDP, period," as our own infra notes put it. A WebRTC SFU needs UDP to move media (SRTP, and TURN for the fallback), so an SFU simply can't accept traffic on Railway, not as a tuning problem but as a "that port doesn't exist" problem. So the media plane lives on LiveKit Cloud, every other service runs on Railway, and the two meet over wss:// and signed webhooks. We kept the self-hosted-LiveKit door open by avoiding Cloud-only features, but the day we walk through it is the day we have somewhere with UDP to put the SFU, and that somewhere isn't Railway.

The transcriber is its own service, and the reason is written into its package description: it exists separately "because the LiveKit client is a native module the web app's self-contained bundle can't carry (and audio decode is real CPU)." Both halves of that matter. The web app builds to one self-contained server artifact, which a native module can't ride along in; and decoding everyone's audio is genuine CPU work that has no business sitting next to request serving. So apps/transcriber is a thin process: a /healthz endpoint for Railway's probe, and a pg-boss worker holding up to eight live sessions at once, each session a room it has joined.

apps/transcriber/railway.jsonconfig
{
  "build": { "builder": "DOCKERFILE", "dockerfilePath": "apps/transcriber/Dockerfile" },
  "deploy": {
    "healthcheckPath": "/healthz",
    "restartPolicyType": "ON_FAILURE",
    "restartPolicyMaxRetries": 10
  }
}

The seam between web and transcriber is the database. There's no RPC between the two services and no internal HTTP. Starting transcription is a row in a queue:

packages/transcribe/src/queue.tsserver
export const TRANSCRIBE_QUEUE = "transcribe.start";

export async function enqueueTranscription(data: TranscriptionJob): Promise<void> {
  const boss = await getProducer();
  await boss.send(TRANSCRIBE_QUEUE, data, {
    // Collapses double-starts (two people hitting the button) while a job
    // for the room is queued or active.
    singletonKey: data.roomName,
    // A transcription session isn't retryable: if it dies, the transcriber
    // participant just vanishes from the room and someone toggles it back on.
    retryLimit: 0,
    // The job stays active for the whole call, so its expiry sits just above
    // the session's own 4h hard cap.
    expireInSeconds: 4 * 60 * 60 + 300,
  });
}

pg-boss runs over the same Postgres everything else uses, so "start transcribing this room" is an insert the web app makes and the transcriber picks up, no service-to-service networking required. Stopping is the part I like most: it's deliberately not a second queue message, because a queued stop could race a still-queued start into a ghost session. Instead the web app kicks the transcriber participant out of the LiveKit room (removeParticipant), and the session, seeing itself disconnected, flushes its open turns and exits. The room is the control channel for its own transcription, the same way a participant leaving is how a call winds down.

So the whole infrastructure fits in a sentence: the media plane is the one service we don't run, the transcript is assembled in a CPU-heavy worker sitting next to it, and a queue plus one shared Postgres are all the glue between that worker and the rest of the platform.

What falls out for free

Get the primitive right and the rest is small.

Summaries are a row that's also a job. When a transcribed call ends, the same LiveKit webhook that closes the room enqueues a summary. The summary row doubles as its own job state: pending until a worker reads the turns, runs the model, and flips it to complete with the markdown (or failed). It's automatic on end, or on demand from the panel, and because it's the turns it reads, it works identically for a huddle and a scheduled meeting.

Pointing at a moment is just a relation. Every turn has a stable ulid, so attaching an issue, a doc, or a follow-up to the exact sentence that was said is one row in the relation table we already have. No new concept, no migration. The transcript is addressable down to the turn, which is the hook everything we build on top of it hangs from.

Late joiners and the post-call record are the same query. History seeds from Postgres ordered by seq; live updates arrive keyed by the same ids and merge on top. Whether you open the panel mid-call or a week later, you're reading the same rows in the same order.

What we're not claiming

The dedicated meeting tools have a head start on the craft, and it would be silly to pretend otherwise. Years of tuning have gone into their diarization, their noise handling, their summary quality. A young feature riding nova-3 and a good prompt is not going to out-transcribe them on day one, and we won't claim it does. If the most important thing to you is the most accurate possible transcript of a call, those tools are very good at that, and that's a real answer.

What they structurally cannot do is the thing this whole post is about. They aren't the place your work lives, so the transcript they produce is always a record about your work, sitting outside it. We didn't set out to be better at transcription. We built the call inside the platform, so the transcript is a first-class record in the same database as your issues, PRs, and docs: addressable down to the turn, queryable like anything else, something the rest of the platform can read. That's an advantage of location, not of cleverness, and it's the one we actually have.

The shape of the bet

None of the pieces is exotic. LiveKit moves the media. Deepgram turns audio into text. The work was in the seams between them: making the speaking turn a first-class row with a stable id and a shared order, so a transcript isn't a blob of text but a stream of addressable moments the rest of the platform can point at, query, and link to.

The bet is that the value of a call was never the recording: it was everything the platform can do once the call is data it can read. Every tool that stops at a transcript leaves that to a tired human at the end of a meeting. We think it belongs to the platform, because the platform is already where the work goes. We've made the call leave a stream of addressable moments; what reads that stream, and turns what was said into the work that comes out of it, is a story for the next post.


This is an early take, and the design is still moving. Plain is in early access, and we're looking for a handful of teams to partner with closely as we build it. If that's you, join the alpha.

[ Get started ]

One platform to build, ship, and run software.

Code, issues, docs, chat, CI, packages, and AI agents in one product with one source of truth. Start building today.