Extract Table of Contents from a PDF File

Daniel Weibel

Created 29 Jun 2016

Variant 1: With PDFMiner

This Python-based variant extracts the table of contents in a (pseudo) XML format.

Requires Python $\geq$ 2.6, but < 3.0.

Download source code from https://pypi.python.org/pypi/pdfminer/
- The project is also on GitHub https://github.com/euske/pdfminer/

Compile and install:

  tar xzf pdfminer-20140328.tar.gz && cd pdfminer-20140328
  python setup.py install

Now there should be the executables /usr/local/bin/pdf2txt.py and /usr/local/bin/dumppdf.py.

dumppdf.py -T myfile.pdf

This variant extracts the table of contents in plain text format.

Now, there should be the executables /usr/local/bin/mupdf-x11 (PDF viewer) and /usr/local/bin/mutool.

mutool show myfile.pdf outline

Note: subsections are indented by tabs, and page numbers are separated by tabs as well.