Journal
How we curate 14,320+ literary quotes for Canto
Every passage in Canto passed through a multi-stage quality pipeline. Here is what that process actually looks like.
Canto has 14,320+ literary quotes in its library. Getting there required building a curation pipeline that could evaluate quality at scale without sacrificing the standards that make the feed worth opening.
Here is what that process actually looks like.
The problem with unfiltered quote libraries
There are apps and websites with millions of quotes. Type any author’s name into Goodreads and you will find hundreds of lines attributed to them — many misattributed, decontextualized, or simply not that good.
The internet defaults to the most shareable version of anything. For quotes, that means passages that generalize well, fit inside a caption, and could apply to any life. They are not usually the ones that are specifically, irreducibly good writing. A feed built from that data feels like a highlights reel for the least interesting part of literature.
We needed a way to filter 500,000+ raw quote candidates down to a library that actually earns a reader’s attention.
The pipeline
Canto’s curation process runs in multiple stages.
Stage 1 — Pre-filtering. The raw dataset is filtered by basic quality signals: minimum length, language detection (English only), and removal of duplicates, fragments, and entries missing author attribution. This eliminates the bulk of noise before any deeper evaluation.
Stage 2 — Automated scoring. Each candidate is scored across five dimensions:
- Literary quality — is the writing itself good? Does the sentence structure, word choice, and rhythm reflect craft?
- Emotional resonance — does it land? Does it have the quality of a line someone would want to return to?
- Clarity — does it work outside its original context, or does it require the surrounding chapter to make sense?
- Shareability — is it complete enough to stand on its own?
- Attribution confidence — can we verify the author and source with reasonable certainty?
Each dimension is scored independently. Quotes that do not clear minimum thresholds across all five are dropped. Only the top tier advances.
Stage 3 — Metadata enrichment. Surviving quotes are matched against book metadata — title, author, genre, publication details — to ensure every passage that reaches the app is correctly attributed and linked to the right source.
Stage 4 — Genre classification and editorial review. Scored and enriched quotes are organized by genre and reviewed against genre-specific editorial criteria. A romance quote and a philosophy quote are good in different ways; the pipeline accounts for that.
Why this approach
The honest reason we built an automated pipeline is that the alternative — reading every quote by hand — does not scale to a library of this size and does not update weekly.
The honest reason we did not simply take everything that scored above a threshold is that automated scoring is not perfect. The pipeline is calibrated to be conservative: it is better to drop a good quote than to surface a bad one. The library is smaller than it could be, and more consistent than it would be otherwise.
What makes a quote earn its place
Across all stages, the criteria come back to the same question: would a reader stop mid-swipe for this line?
Not because it is famous. Not because the author is famous. Because the sentence itself has something — a precision, a feeling, a rhythm — that makes the reader want to read it again.
That is the standard the pipeline is calibrated against. It is imperfect. We keep adjusting it. If you find a quote in the app that should not be there, or a book you love that is not represented, email us at [email protected].