pdf2textbox’s doc¶

A PDF-to-text converter based on pdfminer2 (which is based on pdfminer.six which is based on pdfminer). Converts PDF files with up to 3 columns and a header (optional) to text and avoids many caveats that multi-columned PDF files have in store for PDF conversion.

Allows command line parameter -s (–slice) to indicate that only part of the PDF document is of interest. Start and end page will then be either retreived from the document’s name using ‘_’ or ‘|’ as delimiters or - if start and end page cannot be found - user input is requested.

What’s the difference?¶

While pdfminer2 works well for plain text there will be issues when that text is in two or more columns, when there are indented quotes, headers on every page, pagenumbers, dates, and more. The solutions that can be found (i.e. pyPDF, pdfminer2, pdfx, …)

Will include headers into the text flow
Will return small or very small (DetailledAggregator) text units
Will return text from left to right without acknowledging the columns
Will mix text from different columns into one text flow

Features¶

Convert PDF to text in the original order. This works well for PDF-files without tables, graphs, and other stuff.

After extracting the text in boxes, there still has to be run another function to strip the text of special signs, zeroes and the like.

Note¶

Often the textboxes will NOT be identical with paragraphs of the PDF-file. There might be cases when a textbox ends mid-sentence just to be continued with the next box. This is due to the PDF file’s graphic-oriented organization of content. However, the order of text will be correct.

pdf2textbox’s doc¶

What’s the difference?¶

Features¶

Note¶

Indices and tables¶

Table Of Contents

Related Topics

This Page