Allow parsing off-spec PDF files with prefixes before the header #362

gmalette · 2024-12-12T02:52:57Z

Whether this should be supported is a philosophical question, but the fact is that some PDF files have prefixes before the header.

As I understand it, this is contrary to the spec, but dates to the origins of Acrobat, which supported up to 1024 bytes of garbage before the actual header.

Implementations now either rely on it or generate it accidentally, but such files are still generated.

I think lopdf should support this bug, and allow reading more PDFs even if they're not to-spec.

gmalette · 2024-12-12T03:05:01Z

Eh unfortunately this doesn't work, the offsets need to be adjusted as they're relative to the header :/

Heinenen · 2024-12-12T12:19:18Z

I'm not totally sure if I understand the spec correctly, but I would say the spec even allows this, cf. section 7.5.2

The PDF file begins with the 5 characters “%PDF–” and byte offsets shall be calculated from the
PERCENT SIGN (25h).
NOTE 1 This provision allows for arbitrary bytes preceding the %PDF- without impacting the viability of
the PDF file and its byte offsets.

The newer versions of the spec do not seem to mention anything of 1024 bytes, I only found it in the Acrobat (not ISO) PDF1.7 spec (and also in specs that date further back).

Appendix H
13. Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

Now the question is if we want to change the arbitrary number of 1024 to something else?
But that can easily be changed in the future, so this question shouldn't be a blocker for this PR.

gmalette · 2024-12-14T19:51:16Z

@J-F-Liu unfortunately this patch will not work. The offsets need to be relative to the %PDF-{version} header, but this produces them relative to the start of the document, leading to a "invalid footer" error. If this behaviour is desirable I'll rework this PR so that it works

J-F-Liu · 2024-12-15T12:56:01Z

Yes, the offsets should be relative to the %PDF-{version} header

Allow parsing off-spec PDF files with prefixes before the header

dfb170a

gmalette force-pushed the gm/parse-invalid-header branch from 8e9741e to dfb170a Compare December 12, 2024 02:58

J-F-Liu merged commit d0874c3 into J-F-Liu:main Dec 13, 2024
8 checks passed

gmalette deleted the gm/parse-invalid-header branch December 19, 2024 00:44

gmalette mentioned this pull request Dec 19, 2024

Properly support document prefixes #365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow parsing off-spec PDF files with prefixes before the header #362

Allow parsing off-spec PDF files with prefixes before the header #362

gmalette commented Dec 12, 2024

gmalette commented Dec 12, 2024

Heinenen commented Dec 12, 2024

gmalette commented Dec 14, 2024

J-F-Liu commented Dec 15, 2024

Allow parsing off-spec PDF files with prefixes before the header #362

Allow parsing off-spec PDF files with prefixes before the header #362

Conversation

gmalette commented Dec 12, 2024

gmalette commented Dec 12, 2024

Heinenen commented Dec 12, 2024

gmalette commented Dec 14, 2024

J-F-Liu commented Dec 15, 2024