Pipeline

audia uses a LangGraph linear state machine to convert a PDF into audio. The graph runs in four steps; STT + LLM query distillation are optional pre-steps that run before the graph when using voice or text queries.

Entry points

Entry point	Command
Voice input	`audia listen`
Text query	`audia research "<query>"`
Local PDF	`audia convert paper.pdf`

 [voice input]           [text query]
      │                       │
      ▼                       │
  faster-whisper STT          │
      │                       │
      ▼                       │
  LLM query distillation      │   ← extracts concise ArXiv search terms
      │                       │      from natural speech
      ▼                       │
  Confirm / re-record?        │
      │  yes                  │
      ▼                       ▼
Step 0 ─ ArXiv search    (or local PDF)
 │        fetch metadata, download PDF
 │
 ▼
Step 1 ─ PDF extraction
 │        PyMuPDF: text + metadata per page
 │
 ▼
Step 2 ─ Heuristic pre-pass
 │        Regex: strip citations, LaTeX artefacts, figure captions
 │
 ▼
Step 3 ─ LLM curation
 │        Chunked LLM pass: math → English, tables → sentences,
 │        smooth spoken transitions via tail-context stitching
 ▼
Step 4 ─ TTS synthesis
          edge-tts / kokoro / OpenAI: split into ~3800-char chunks,
          synthesise in parallel, concatenate → .mp3

Step 1 — PDF extraction (`pdf_processor.py`)

Uses PyMuPDF to extract text and metadata from every page. The result is a PDFResult dataclass with:

text — full extracted text
title — document title (from metadata or first heading)
num_pages — page count

Step 2 — Heuristic pre-pass (`text_cleaner.py`)

A fast regex-based step that strips common noise before the LLM sees the text:

Inline citations like [1], [Smith et al. 2024]
LaTeX math commands (\alpha, \frac{...}{...}, display equations)
Figure and table captions
Running headers / footers
Excess whitespace

This step reduces LLM token cost and prevents the model from being distracted by artefacts.

Step 3 — LLM curation (`text_cleaner.py — llm_curate`)

The core of audia. Long texts are split at paragraph boundaries into chunks of AUDIA_LLM_MAX_CHUNK_CHARS characters. Each chunk is sent to the configured LLM with:

A system prompt instructing the model to rewrite for spoken audio
For chunks after the first: the tail of the previous curated chunk as transition context

This tail-context stitching produces smooth spoken transitions across chunk boundaries.

The system prompt instructs the LLM to:

Rewrite mathematical notation into plain English
Convert tables into narrative sentences
Condense acknowledgements to one sentence
Remove citation artefacts entirely
Maintain a natural, engaging speaking tone

Step 4 — TTS synthesis (`tts.py`)

The curated text is split into AUDIA_TTS_CHUNK_CHARS-character chunks and sent to the selected TTS backend. For edge-tts, synthesis is async with a 90-second per-chunk timeout. All chunk audio files are concatenated via soundfile into one final .mp3.

State (`state.py`)

The LangGraph pipeline uses a typed TypedDict state:

Key	Type	Description
`pdf_path`	`str`	Input PDF path
`raw_text`	`str`	PyMuPDF output
`preprocessed_text`	`str`	After heuristic pass
`cleaned_text`	`str`	After LLM curation
`audio_path`	`str`	Final .mp3 path
`title`	`str`	Extracted document title
`num_pages`	`int`	Page count
`error`	`str \| None`	Error message if a node fails

Debug output

Every run saves text snapshots for inspection:

~/.audia/debug/<run_id>/
  1_raw.txt            ← PyMuPDF output
  2_preprocessed.txt   ← after heuristic pass
  3_curated.txt        ← after LLM curation

Pipeline

Entry points

Step 1 — PDF extraction (pdf_processor.py)

Step 2 — Heuristic pre-pass (text_cleaner.py)

Step 3 — LLM curation (text_cleaner.py — llm_curate)

Step 4 — TTS synthesis (tts.py)

State (state.py)