One Workspace, Many Services: The Rust Architecture Behind Clickk
When people look at the Clickk backend for the first time, the reaction is usually some version of “why is this built like a company of fifty engineers when the team has never been more than a handful?”
It is Rust. It is microservices, a set of small independent programs rather than one big one. It runs on EKS (managed Kubernetes on AWS), set up with Terraform and deployed by GitHub Actions and ArgoCD. On paper that is the kind of stack you reach for when you have a platform team and a pager rotation. Clickk has had neither.
This is a deep dive into that gap, and an honest account of why the backend is built the way it is.
Two answers run underneath everything below, and neither is really about Rust.
The first is the team. I co-founded Clickk, and I architected and built the backend, but the original product idea was my co-founder’s, not mine, and the code was never the work of one pair of hands. A number of engineers came and went over the life of the project, some stayed, and several were crucial to what Clickk became. A codebase that real people rotate through has a particular need: it has to hold onto the knowledge that leaves with them. A lot of these decisions are really about making the system, rather than any one person’s memory, the place that knowledge lives.
The second is how we planned to reach people. The plan was never to buy ads and hope. It was to launch through a warm network of influencers, creators that people close to the company already had relationships with, and let them bring their audiences. That single choice has a strange consequence for a backend. A creator is not a steady stream of users; a creator is a switch. One person with the right following points their fans at a Clickk page and you do not get a gentle ramp, you get a flash flood, and the size of it depends entirely on who flipped the switch. A niche creator brings a trickle; a big one can bring orders of magnitude more. The load is clustered, creator-shaped, and genuinely impossible to predict in advance.
If that is your launch strategy, you cannot build a system that assumes smooth, uniform traffic. You need to be able to stand a capability up quickly when a beta creator teaches you something surprising, pour resources into the parts that get hammered, and starve or retire the parts that do not, independently and fast. Some pieces of Clickk will always run hotter than others. The service that serves offers and the one that records engagement are where a creator’s audience actually lands, so they are the obvious hot spots, while something like lead deletion sits quiet most of the time. That requirement, elastic and lopsided and unpredictable load allocated by need, shaped these decisions at least as much as any benchmark did.
This post walks through the decisions that look unorthodox. I am not going to pretend they were all correct. Some I would make again without hesitation; a couple cost more than they should have. I have tried to write it so that if you ship Rust for a living you get the specifics, and if you just like understanding how things are built you never get lost.
Decision 1: Rust, when nothing about the product demanded it
The conventional advice for a small, early team is to optimize for velocity. Reach for the language with the deepest ecosystem and the fewest ceremony taxes, ship, and rewrite later if you are lucky enough to need to. By that logic this should have been TypeScript or Python.
I chose Rust anyway, and the reasons were boring on purpose:
- The compiler is institutional memory. Rust refuses to compile code with whole categories of bug in it: a value that might be empty when you forgot to check, an error you never handled, two parts of the program writing to the same data at once. On a team where people come and go, that matters more than it looks. When someone new touches a module whose original author has moved on, the compiler is the part of the system that still remembers the rules. Those bugs do not reach production because they do not reach a green build.
- One language, top to bottom. The lambdas, the workers, the services, and the shared library are all Rust. There is no context switch between a service language and a glue language, and just as importantly there is one domain for everyone to learn. When an engineer joins, there is a single mental model to onboard into; when one leaves, the next person is reading the same language they already know. A changing team is cheapest to run when everyone can read everything.
- Performance is a side effect, not the point, until the day it is. I did not pick Rust to win benchmarks. But remember the load shape. When a creator flips the switch, the cost of surviving that flood is partly how much each request costs to serve, and Rust’s small, predictable footprint means the same modest cluster absorbs a spike that a heavier runtime would need more machines to handle.
The honest cost: Rust slows you down at exactly the moments an early team can least afford it. Compile times tax every iteration, which sits in real tension with the “stand a service up quickly” goal I just spent several paragraphs justifying. The ecosystem, while good, has sharp edges where Python would have a mature library ready to import. And “one shared language” cuts both ways: it makes the codebase legible to whoever joins, but it also narrows who can join, because the pool of people fluent in Rust is smaller than the pool fluent in TypeScript, and that bites hardest on the day you actually need to hire. I made the bet anyway, because Clickk is meant to be maintained for years and the upfront tax buys a maintenance discount later. If the only goal were to validate an idea this quarter, it would be the wrong call.
Decision 2: Microservices that live in a single Cargo workspace
This is the one that draws the most confused looks. Microservices and a monorepo are usually pitched as opposites. Clickk is both at once.
A Cargo workspace is one repository that holds many separately shippable programs but builds and version-manages them as a single unit. Every Clickk service is a member of it:
[workspace]
resolver = "2"
members = [
"auth_service",
"user_service",
"content_offer_service",
"engagement_service",
"kinesis_lambda",
"common",
"osint_service",
"lead_deletion_lambda",
"metrics_worker",
"aeo_service",
"moment_render_worker"
]
[workspace.dependencies]
actix-web = "4.5.1"
serde = "1.0.228"
tokio = { version = "1.0", features = ["full"] }
# ...
There are two separate ideas bundled in here, and they earn their keep for different reasons.
The “many services” half is what makes the load story possible. Because the offer service and the engagement service are their own programs, they are their own deployable, scalable units. When a creator’s audience floods in, Kubernetes can add more copies of just those two, the hot spots, and leave the quiet services alone. We can ship a fix to engagement at noon without redeploying the world. That independence is what lets resources follow demand instead of being spread evenly across services that do not need them equally. A single big program could scale too, but only as one indivisible block: to give the busy part more headroom you would have to clone the idle parts along with it.
The “single workspace” half is about keeping a many-service, many-author codebase coherent. That [workspace.dependencies] block means every service pins the same Actix, the same Tokio, the same Serde. There is no drift where one service is two minor versions behind and behaves subtly differently under load, and no situation where the engineer who set up service A quietly made different version choices than whoever set up service B. Upgrading is one edit, then one cargo build that either compiles the whole world or tells me precisely what broke.
The glue is a crate literally named common. It is where every cross-cutting concern lives exactly once:
// common/src/lib.rs
pub mod ai;
pub mod clickk_config;
pub mod crypto;
pub mod database;
pub mod dynamodb;
pub mod errors;
pub mod events;
pub mod logging;
pub mod metrics;
pub mod metrics_middleware;
pub mod middleware;
pub mod telemetry;
pub mod swagger;
Connection pooling, the auth middleware, telemetry wiring, the DynamoDB client, the event schemas, the error types. A new service imports common, gets a database pool, metrics, tracing, and token verification, and starts handling requests. There is no copy-paste boilerplate across services because there is one source of truth that the compiler enforces across all of them. In practical terms: when a beta creator teaches us something and we need to stand up a new capability, the new service starts the day with all the plumbing already attached. The marginal cost of one more service is low by design, which is exactly what a spin-it-up-fast launch plan needs.
A direct consequence of this design is that the services almost never talk to each other. Each one owns its data and exposes HTTP. The orchestration (call user-service, then content-offer-service, then engagement-service to compose a page) happens in the Next.js layer, not in a backend service mesh. There is exactly one place that fans out across services, and it is the frontend.
The honest tradeoffs:
- This is microservices for the deployment, data-ownership, and independent-scaling benefits, deliberately without the operational tax of inter-service networking. Each service scales and ships on its own, but nobody has to debug a distributed call graph at 2am because there isn’t one. The cost is that the frontend carries orchestration logic that a backend purist would want behind an API gateway or a BFF service.
- The hot-spot story is a hypothesis, not a measurement, yet. We have predicted that the offer and engagement services are where the load will land, and shaped the system around that prediction before a single big creator has actually flipped the switch. That is premature scaling wearing a reasonable disguise. If real traffic teaches us the hot spots are elsewhere, the modular boundaries become boundaries in the wrong places, and redrawing them is not free.
- “Almost never” is not “never.” The
moment_render_workerdoes PUT status back to content-offer-service, and there is anINTERNAL_API_TOKENshared secret to make service-to-service calls possible when they are genuinely needed. It is the exception that proves we tried hard to avoid the rule. - The workspace makes coupling cheap, which is a double-edged sword. It is trivially easy to reach into
commonand create a dependency that quietly couples two services through shared types. Keepingcommonto genuinely cross-cutting concerns is a discipline the compiler will not enforce for you.
If Clickk were spread across many repos with parallel release cadences, this single-workspace approach would not survive contact. For where it is now, it is the best of both worlds: the isolation of services with the refactor-everything-at-once ergonomics of a monolith.
Decision 3: Postgres and DynamoDB, split by write pattern
Clickk runs two databases and the split is not arbitrary. It follows the shape of the writes, and, not coincidentally, it mirrors the load story. The relational core changes slowly; the engagement firehose is what a creator’s audience actually generates at volume.
Postgres (through Diesel) holds the relational, transactional core. Users, content, offers, the join table that places an offer on a video at a timestamp, subscription billing fields. This is data with relationships I want to enforce, queries I want to express as joins, and a correctness bar where a foreign key constraint is a feature.
DynamoDB holds the high-write engagement firehose. Per-visitor contact metadata, fan-to-creator relationships, interaction history, offer metrics, comments. This is exactly the data that scales with audience size: when a creator brings a flood of fans, this is the table taking the brunt of it. It arrives fast, in volume, keyed by a known id, and we want predictable write throughput far more than ad hoc relational queries. DynamoDB is built for that “huge volume of simple, keyed writes” shape in a way a relational database is not.
The events that feed the high-write side are versioned and contract-tested, which is the part I am most glad I did:
// common/src/events/lead_created.rs
pub const LEAD_CREATED_EVENT_VERSION: &str = "1";
pub const LEAD_CREATED_EVENT_NAME: &str = "lead.created";
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize, JsonSchema)]
pub struct LeadCreatedEvent {
pub version: String,
pub event: String,
pub event_id: String,
pub occurred_at: DateTime<Utc>,
pub creator_id: Uuid,
pub lead: Lead,
}
A test in common compiles the JSON schema and asserts the exact set of top-level keys, so a careless field rename in the producer fails CI instead of silently breaking the Kinesis consumer downstream.
The honest tradeoff: two databases means two mental models, two sets of failure modes, and no way to join across the boundary. If a question spans both stores, you answer it in application code, which is slower to write and easier to get wrong than a SQL join. We accept that because the alternative (forcing the engagement firehose into Postgres) trades a small amount of query convenience for a connection-pool and write-throughput problem nobody wants on the exact day a creator goes big. The split is a bet that the two workloads are genuinely different, and so far it has held.
Decision 4: Async Diesel, and the migration off sqlx
This is the most Rust-specific decision and the one I am quietly proudest of. (If “async” is unfamiliar: it means the program can keep thousands of requests in flight at once without dedicating a whole thread to each one while it waits on the database. Under spiky, creator-driven load, that efficiency is the difference between absorbing a flood and tipping over.)
If you have written Rust with Diesel, you know the usual shape: Diesel is synchronous, so you wrap every query in web::block to push it onto a blocking threadpool and avoid stalling the async runtime. It works, but it is boilerplate on every query and it spends threads to paper over a sync/async mismatch.
Clickk does not do that. It runs diesel-async with a bb8 pool, so queries are genuinely async and awaited directly in handlers:
# common/Cargo.toml
diesel = { version = "2.2", features = ["postgres", "uuid", "chrono", "r2d2", "serde_json", "32-column-tables"] }
diesel-async = { version = "0.7.3", features = ["postgres", "bb8"] }
// user_service/src/db_access/users.rs
pub async fn create_user_db(pool: &DbPool, new_user: CreateUser) -> Result<User, ServiceError> {
let mut conn = get_conn(pool).await?;
Ok(diesel::insert_into(users)
.values(&new_user)
.get_result(&mut conn) // no web::block, this is a real await
.await?)
}
No threadpool dance, no web::block. The query is an async function call like any other.
The genuinely fiddly part is that all of this sits behind PgBouncer in transaction-pooling mode. (PgBouncer is a connection pooler: it lets many copies of a service share a small pool of real database connections, which matters once Kubernetes is spinning up extra replicas under load.) Transaction pooling does not play nicely with the prepared-statement caching that a long-lived connection assumes, so the pool’s connection setup explicitly defuses it:
// common/src/database.rs
// Disable prepared statement caching for PgBouncer transaction mode.
// This MUST happen after the connection is fully established.
diesel::sql_query("SET plan_cache_mode = force_custom_plan")
.execute(&mut conn)
.await?;
Getting here was not a clean greenfield decision. The backend started on sqlx and moved to Diesel, and the codebase still wears the scars: there is a migrations_sqlx_backup/ directory, the generated schema still references a _sqlx_migrations table, and the dev tooling has both worlds living side by side. The old sqlx path:
# db.just
DATABASE_URL={{DB_URL}} sqlx migrate run --source {{MIGRATIONS_DIR}}
The new Diesel path:
# justfile
migrate:
docker compose exec user-service diesel migration run \
--migration-dir /usr/src/clickk-backend/common/migrations
The honest tradeoff: diesel-async is younger and less traveled than sqlx. We traded a popular, well-documented tool for Diesel’s type-safe query DSL and schema generation, and paid for it once, in full, during the migration. The PgBouncer plan-cache issue is exactly the kind of sharp edge you hit on the less-worn path. I would make the same call again, because the compile-time guarantees of Diesel’s query builder are worth it, but “we rewrote our data layer mid-project” is not advice to hand out lightly.
Decision 5: Terraform, GitHub Actions, and EKS, where the system is the documentation
The infrastructure looks the most over-engineered of anything here, and I understand why. The entire AWS footprint (VPC, EKS, RDS, DynamoDB, Kinesis, S3, Lambda, IAM) is defined in Terraform modules, the whole cluster described as code instead of clicked together by hand. GitHub Actions builds and pushes images; ArgoCD watches the Helm charts and syncs the cluster to match. There is no “ssh in and deploy” step anywhere.
Part of the reason is the launch story. A system described entirely in code is a system we can change quickly and safely, adding a service, scaling a hot one, rolling something back, without anyone remembering a sequence of manual console clicks under pressure. The other part goes back to the team. When engineers come and go, undocumented infrastructure is the most dangerous kind of knowledge, because it walks out the door with whoever set it up. If the cluster is described in Terraform and the deploys are described in YAML, then the system is the documentation. Someone can come back after three weeks heads-down on a feature, or start the week after someone else left, and not have to reverse-engineer an AWS console that nobody wrote down.
One detail that follows from this discipline: database migrations are not run when a service boots. They run as an explicit step in the build pipeline.
# .github/workflows/backend-cicd-build.yml
diesel migration run
This is deliberate. In a multi-replica Kubernetes setup, the same setup that lets us scale the hot services, running migrations at startup means every replica races to migrate the same database on every rollout, which is a category of outage worth not inventing for yourself. Making migration a discrete, ordered step decouples “change the schema” from “start the service” and removes the race entirely.
The honest tradeoff: this is real infrastructure with a real learning curve, and the day it breaks there is no separate platform team to page, because the platform team is whoever is on engineering that week. The bet is that the cost of learning Terraform and GitOps once is smaller than the cost of hand-managed infrastructure nobody can reason about later. For a product meant to run for years and scale on a creator’s schedule, that math works. For a weekend prototype it would be madness.
What I actually believe about all this
None of these decisions are clever for the sake of being clever. Each is a bet that the upfront cost buys a maintenance discount over the life of the product, and a bet that the way we mean to grow, one creator at a time in unpredictable bursts, is better served by a system that can flex than by one that is merely simple. The vision these bets serve is my co-founder’s as much as mine, and the cost of getting them wrong is carried by everyone who touches the code, me most of all as the person who chose them.
Some are unambiguous wins. The single workspace with a shared common crate is the thing that keeps a many-service backend tractable for a small team, especially one whose members change. The Postgres and DynamoDB split has held up under exactly the workload shapes it was designed for.
Some I am honest about. Rust slows down early iteration and fights the very “spin it up fast” goal I used to justify it. Two databases means no joins across the boundary. The frontend carries orchestration a purist would push down. The async-Diesel migration cost real time and left scars in the repo. And the biggest one: a lot of this is built for a flood that has not arrived yet. Designing for the day a creator brings a million fans is prudent if that day comes and premature if it does not, and we will not know which until it does.
If there is a single thread running through all of it, it is this: an architecture is not just how a system runs, it is where a team’s knowledge lives once the people who wrote it move on. Every piece of complexity I added is complexity someone has to carry, so each one had better be defensible out loud. This post is me defending them out loud. Some I would say again with full conviction. A couple I say with a wince. That, I think, is the honest state of any real system.