When I was building Momento Baby—an AI search engine for photos and videos—I thought I had the hard parts figured out:
- download media from Google Photos
- run
ffmpeg/ffprobe - call OpenAI Vision
- generate embeddings
- store everything in Postgres
It all worked beautifully… in development.
Then I shipped it to production and started doing real imports. Not “one photo”, but “a real user import”: 40–100 photos, plus videos. That’s when the truth of multimedia pipelines hits you:
everything fails, and it fails in ways you can’t reproduce locally.
At first, I focused on fault tolerance (separate DB pools, split responsibilities, isolate retries). I wrote about that in Designing Fault-Tolerant AI Pipelines.
But there was a second, sneakier problem that appeared once retries started working:
Retrying makes your system correct… and then it makes it wrong again if your side effects aren’t idempotent.
This post is about that second problem: idempotency.
How do you make a pipeline safe to retry without creating duplicates, breaking foreign keys, or wasting OpenAI money?
The symptom: retries that “work” but corrupt your data
Here’s a fun production story.
I added a “video chunking” feature to Momento Baby: split a short video into 10-second chunks, run multi-frame vision analysis per chunk, generate an embedding, and store that chunk so you can search your memories with natural language.
Conceptually:
- Import a video
- Insert the parent video row (e.g., into a
videostable) - For each chunk: analyze → embed → insert
video_chunk
Now add production reality:
- network timeouts
ffprobestallsffmpegweirdness- OpenAI returns 5xx
- tokens expire
So of course you retry (Oban makes this easy).
But then you get:
- the same chunk inserted twice
- a chunk insert that crashes because the parent video row is missing (FK violation)
- jobs that keep retrying forever because we crash instead of classifying the failure (retry vs discard)
- and my personal favorite: thumbnails that return
200 OKbut don’t render because the bytes are invalid
In practice, the “invalid bytes” bug only stopped once I started validating artifacts (e.g., running ffprobe and rejecting partial/empty downloads) before persisting anything downstream.
If that sounds chaotic, it is. The fix is not “more retries”.
The fix is: make retries safe.
What “idempotent” means for a multimedia pipeline
“Idempotent” is a fancy word for:
“You can run it multiple times and the end state is the same as running it once.”
In pipelines, it helps to separate:
- idempotent compute: you can recompute embeddings/metadata anytime
- idempotent side effects: you must not create duplicate DB rows or upload infinite thumbnails
The hard part is always the second one.
Principle 1: choose the right dedupe key
Before you touch Oban, you need one thing: a stable identity.
In Momento Baby, the natural keys were already there:
- Photo: (
email,google_photos_id) - Video: (
email,google_photos_id) - Video chunk (“moment”): (
video_id,start_ms,end_ms)
This matters because once you have a dedupe key, the database can help you enforce idempotency.
Principle 2: enforce idempotency in Postgres (unique index)
The most common anti-pattern I see is:
- check if row exists
- if not, insert
This works in development, and it fails the first time you have concurrency.
Two workers can run at the same time, both see “no row”, and both insert. Race condition.
The correct place to enforce idempotency is the database.
For chunks, the rule is:
One chunk per (
video_id,start_ms,end_ms)
So the DB gets a unique index:
CREATE UNIQUE INDEX video_chunks_video_id_start_ms_end_ms_uniq
ON video_chunks (video_id, start_ms, end_ms);Principle 3: treat constraint failures as expected outcomes (don’t crash jobs)
Another fun one: foreign keys.
At some point I started seeing this in Oban:
Ecto.ConstraintErroronvideo_chunks_video_id_fkey
Meaning: a chunk insert referenced a video_id that didn’t exist in the parent table that stores videos (often videos).
In other words: the chunk job ran (or retried) before the parent video row existed—either because the parent insert failed, the parent job was still in progress, or the work got reordered by retries/concurrency.
You can solve this in two steps:
1) Tell Ecto about constraints
If you don’t define constraints on a changeset, Postgres raises and Ecto raises too. Oban sees a crash and retries.
Instead, you want:
- a clean
{:error, changeset}that you can classify as retryable vs discardable
So add:
foreign_key_constraint/3for the FKunique_constraint/3for the dedupe index
Conceptually:
changeset
|> foreign_key_constraint(:video_id, name: "video_chunks_video_id_fkey")
|> unique_constraint([:video_id, :start_ms, :end_ms],
name: "video_chunks_video_id_start_ms_end_ms_uniq"
)2) Make “duplicate insert” a successful no-op
Once the database is enforcing uniqueness, your application should treat “already exists” as a normal outcome.
In Ecto, that usually means inserting with a conflict rule:
Repo.insert(changeset,
on_conflict: :nothing,
conflict_target: [:video_id, :start_ms, :end_ms]
)Now a retry (or concurrent insert) won’t create duplicates—and you don’t need “check then insert” logic.
At the Oban layer, the key is: don’t crash for known, non-retryable outcomes.
- If something is retryable (timeouts, 5xx), return
{:error, reason}(or raise). - If something is not retryable in this job (e.g., missing parent row / auth), return
{:discard, reason}so the job stops.
3) Build an error taxonomy: retry vs discard
Not every failure should be retried.
My rules of thumb:
- Retry:
- network timeouts
- OpenAI 5xx
- transient ffmpeg failures
- “download incomplete”
- Discard:
- missing parent video (FK error)
- missing auth (no refresh token)
- invalid input
This is huge in practice: it keeps your queues healthy.
Principle 4: one job = one unit of work
This is where Oban shines.
The monolithic job approach looks like this:
download → ffmpeg → vision → embedding → DB insert → storage upload
When it fails, it retries everything. Worst case: you redo expensive work and create duplicates.
In Momento Baby, I moved towards:
- one job for video ingestion (create video record, store a single thumbnail)
- one job per chunk analysis
That gives you:
- isolated retries (a chunk retry doesn’t redo the whole video)
- natural backpressure (queue concurrency controls throughput)
- better UX (video appears quickly, analysis happens in the background)
Principle 5: make retries cheap
This is the part that saves money.
Two changes had a huge effect in Momento Baby:
1) “One thumbnail per video”
Initially, each chunk had its own thumbnail. That’s a lot of extra ffmpeg + storage work.
Then I realized: my UI doesn’t need per-chunk thumbnails. I only need a single thumbnail for the video card. Search quality is driven by embeddings, not thumbnails.
So chunks just reuse the parent video’s thumbnail URL.
2) Early exit if the chunk already exists
If the chunk row exists, the worker should stop immediately.
That’s idempotency at the application level, backed by the DB unique index.
One nuance: this early exit is an optimization (it saves compute/money). The correctness guarantee still comes from the database constraint.
Principle 6: prevent duplicate enqueues (Oban unique)—but don’t rely on it for correctness
There are two distinct “duplicate” problems:
- Duplicate writes (DB rows, S3 objects): solve with database constraints and deterministic storage keys.
- Duplicate enqueues (the same job inserted multiple times): solve with Oban’s
uniqueoption.
Oban unique is great for reducing load and avoiding job storms, especially when your pipeline chains jobs (download → analyze → store).
But it’s not a substitute for a DB uniqueness rule—because even a perfectly unique job can run twice in edge cases, and because uniqueness windows expire.
If you chain jobs, I recommend:
uniqueto keep the queue clean- DB uniqueness + safe
on_conflicthandling to keep your data correct
Principle 7: make external side effects idempotent (deterministic keys + content validation)
Databases are easy: they can enforce uniqueness.
External side effects are where pipelines get messy:
- writing thumbnails to object storage
- saving downloaded files to disk
- calling paid APIs (OpenAI)
The pattern that holds up in production is: make every side effect addressable by a deterministic key.
For example, instead of “upload a thumbnail”, upload to:
thumbnails/videos/{video_id}.jpg
so a retry overwrites the same object rather than creating a new one.
And because “200 OK” doesn’t mean “valid bytes”, validate artifacts before you persist references:
- confirm the downloaded file is readable (
ffprobelocally) - reject empty/partial downloads
- only then proceed to chunking / analysis
Principle 8: instrument idempotency (retries, dedupes, and spend)
Idempotency is a correctness concern, but in production it’s also a cost concern.
If you can’t answer these, you’ll eventually pay for them:
- What percentage of jobs are retries?
- How often do you hit uniqueness constraints (i.e., “dedupe saved us”)?
- Which step is burning the most OpenAI tokens?
The simplest pattern is to attach a stable trace/correlation ID to each unit of work (video import, chunk analysis) and emit metrics/events per step.
If you want a deeper dive on AI cost + tracing, I wrote about it in Building AI Observability Before Your First Deploy.
Closing: the checklist I use now
- Pick stable dedupe keys (photo, video, chunk)
- Enforce uniqueness in Postgres (unique indexes)
- Convert DB constraint failures into changeset errors
- Decide retry vs discard rules
- Split monolithic work into smaller Oban jobs
- Make retries cheap (reuse artifacts, early exits)
- Prevent duplicate enqueues (Oban
unique) to keep queues healthy - Make external side effects idempotent (deterministic storage keys + validation)
- Instrument retries/dedupes/spend so you can see when idempotency is saving you
If you do just one thing from this post, do this:
Put the uniqueness rule in the database and teach your app to treat it as a successful outcome.
That’s how you get safe retries in the messy world of multimedia + AI.