You're feeding documents into a publishing pipeline, an archival system, or a custom processing script that wants structured XML. PDFs aren't a great input format for any of those. Convert first.
Where XML still matters
Publishing workflows: academic journals, book production, legal documents — many of these use XML (JATS, DocBook, custom schemas) as the canonical format. PDFs are the rendered output, not the source.
Archives: government and library systems often standardise on XML for long-term storage. Data extraction pipelines: custom scripts parsing XML are usually easier to write than scripts parsing PDFs directly.
Conversion to XML
Flint's convert hub outputs XML. The default schema is a generic document XML with sections, paragraphs, lists and tables tagged. For specific schemas (JATS, DocBook), use the generic XML as input to a transformation script — much easier than parsing the PDF directly.
Structure vs presentation
XML output preserves logical structure but discards visual presentation. Headings, paragraphs and lists carry across; specific font sizes, spacing and colours don't. This is usually a feature, not a bug — XML consumers care about content, not styling.
Validating the output
Always validate generated XML against your target schema before piping into production. Even clean conversions can have edge cases — odd characters, empty elements, ordering issues. A quick validator pass catches everything before it breaks downstream.
FAQ
What schema does the output use?
A generic document XML by default. Transform to your specific schema (JATS, DocBook, custom) afterwards.
Will metadata carry across?
PDF metadata (title, author, date) carries into XML attributes when present.
Can I get the XML embedded in a ZIP with images?
Yes — images export as separate files referenced from the XML, packaged in a ZIP for distribution.
Is this the same as PDF/A?
No — PDF/A is an archival flavour of PDF itself. XML is a separate text-based format.
Structured output for serious workflows. Convert your PDF to XML and pipe onward.