Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow parsing off-spec PDF files with prefixes before the header #362

Merged
merged 1 commit into from
Dec 13, 2024

Conversation

gmalette
Copy link
Contributor

Whether this should be supported is a philosophical question, but the fact is that some PDF files have prefixes before the header.

As I understand it, this is contrary to the spec, but dates to the origins of Acrobat, which supported up to 1024 bytes of garbage before the actual header.

Implementations now either rely on it or generate it accidentally, but such files are still generated.

I think lopdf should support this bug, and allow reading more PDFs even if they're not to-spec.

@gmalette gmalette force-pushed the gm/parse-invalid-header branch from 8e9741e to dfb170a Compare December 12, 2024 02:58
@gmalette
Copy link
Contributor Author

Eh unfortunately this doesn't work, the offsets need to be adjusted as they're relative to the header :/

@Heinenen
Copy link
Collaborator

I'm not totally sure if I understand the spec correctly, but I would say the spec even allows this, cf. section 7.5.2

The PDF file begins with the 5 characters “%PDF–” and byte offsets shall be calculated from the
PERCENT SIGN (25h).
NOTE 1 This provision allows for arbitrary bytes preceding the %PDF- without impacting the viability of
the PDF file and its byte offsets.

The newer versions of the spec do not seem to mention anything of 1024 bytes, I only found it in the Acrobat (not ISO) PDF1.7 spec (and also in specs that date further back).

Appendix H
13. Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

Now the question is if we want to change the arbitrary number of 1024 to something else?
But that can easily be changed in the future, so this question shouldn't be a blocker for this PR.

@J-F-Liu J-F-Liu merged commit d0874c3 into J-F-Liu:main Dec 13, 2024
8 checks passed
@gmalette
Copy link
Contributor Author

@J-F-Liu unfortunately this patch will not work. The offsets need to be relative to the %PDF-{version} header, but this produces them relative to the start of the document, leading to a "invalid footer" error. If this behaviour is desirable I'll rework this PR so that it works

@J-F-Liu
Copy link
Owner

J-F-Liu commented Dec 15, 2024

Yes, the offsets should be relative to the %PDF-{version} header

@gmalette gmalette deleted the gm/parse-invalid-header branch December 19, 2024 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants