Pipeline
audia uses a LangGraph linear state machine to convert a PDF into audio. The graph runs in four steps; STT + LLM query distillation are optional pre-steps that run before the graph when using voice or text queries.
Entry points
Entry point |
Command |
|---|---|
Voice input |
|
Text query |
|
Local PDF |
|
[voice input] [text query]
│ │
▼ │
faster-whisper STT │
│ │
▼ │
LLM query distillation │ ← extracts concise ArXiv search terms
│ │ from natural speech
▼ │
Confirm / re-record? │
│ yes │
▼ ▼
Step 0 ─ ArXiv search (or local PDF)
│ fetch metadata, download PDF
│
▼
Step 1 ─ PDF extraction
│ PyMuPDF: text + metadata per page
│
▼
Step 2 ─ Heuristic pre-pass
│ Regex: strip citations, LaTeX artefacts, figure captions
│
▼
Step 3 ─ LLM curation
│ Chunked LLM pass: math → English, tables → sentences,
│ smooth spoken transitions via tail-context stitching
▼
Step 4 ─ TTS synthesis
edge-tts / kokoro / OpenAI: split into ~3800-char chunks,
synthesise in parallel, concatenate → .mp3
Step 1 — PDF extraction (pdf_processor.py)
Uses PyMuPDF to extract text and metadata from every page.
The result is a PDFResult dataclass with:
text— full extracted texttitle— document title (from metadata or first heading)num_pages— page count
Step 2 — Heuristic pre-pass (text_cleaner.py)
A fast regex-based step that strips common noise before the LLM sees the text:
Inline citations like
[1],[Smith et al. 2024]LaTeX math commands (
\alpha,\frac{...}{...}, display equations)Figure and table captions
Running headers / footers
Excess whitespace
This step reduces LLM token cost and prevents the model from being distracted by artefacts.
Step 3 — LLM curation (text_cleaner.py — llm_curate)
The core of audia. Long texts are split at paragraph boundaries into chunks of
AUDIA_LLM_MAX_CHUNK_CHARS characters. Each chunk is sent to the configured LLM with:
A system prompt instructing the model to rewrite for spoken audio
For chunks after the first: the tail of the previous curated chunk as transition context
This tail-context stitching produces smooth spoken transitions across chunk boundaries.
The system prompt instructs the LLM to:
Rewrite mathematical notation into plain English
Convert tables into narrative sentences
Condense acknowledgements to one sentence
Remove citation artefacts entirely
Maintain a natural, engaging speaking tone
Step 4 — TTS synthesis (tts.py)
The curated text is split into AUDIA_TTS_CHUNK_CHARS-character chunks and sent to the
selected TTS backend. For edge-tts, synthesis is async with a 90-second per-chunk timeout.
All chunk audio files are concatenated via soundfile into one final .mp3.
State (state.py)
The LangGraph pipeline uses a typed TypedDict state:
Key |
Type |
Description |
|---|---|---|
|
|
Input PDF path |
|
|
PyMuPDF output |
|
|
After heuristic pass |
|
|
After LLM curation |
|
|
Final .mp3 path |
|
|
Extracted document title |
|
|
Page count |
|
|
Error message if a node fails |
Debug output
Every run saves text snapshots for inspection:
~/.audia/debug/<run_id>/
1_raw.txt ← PyMuPDF output
2_preprocessed.txt ← after heuristic pass
3_curated.txt ← after LLM curation