Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: invalid literal for int() with base 16 #385

Closed
rcarraretto opened this issue Mar 10, 2020 · 4 comments · Fixed by #389
Closed

ValueError: invalid literal for int() with base 16 #385

rcarraretto opened this issue Mar 10, 2020 · 4 comments · Fixed by #389
Labels
component:characters Anything with encodings, character mappings or CJK languages type:anomaly Errors caused by deviations from the PDF Reference

Comments

@rcarraretto
Copy link

First of all, thanks for the tool! My team has been successfully using this.

Describe the bug

With Python 3.7.6 and pdfminer 20200124.

Given this file, when running

python pdf2txt.py -V --output_type xml file.pdf

I get an incomplete xml output with a stacktrace

<?xml version="1.0" encoding="utf-8" ?>
<pages>
Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 188, in <module>
    sys.exit(main())
  File "/usr/local/bin/pdf2txt.py", line 182, in main
    outfp = extract_text(**vars(A))
  File "/usr/local/bin/pdf2txt.py", line 56, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/usr/local/lib/python3.7/site-packages/pdfminer/high_level.py", line 85, in extract_text_to_fp
    interpreter.process_page(page)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 906, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 354, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 187, in get_font
    font = PDFTrueTypeFont(self, spec)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdffont.py", line 620, in __init__
    PDFSimpleFont.__init__(self, descriptor, widths, spec)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdffont.py", line 580, in __init__
    self.cid2unicode = EncodingDB.get_encoding(name, diff)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/encodingdb.py", line 108, in get_encoding
    cid2unicode[cid] = name2unicode(x.name)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/encodingdb.py", line 54, in name2unicode
    unicode_digit = int(name_without_u, base=16)
ValueError: invalid literal for int() with base 16: 'bunt'

The file is a pdf with non-embedded fonts.

To help isolate the issue, I was able to reproduce it with this docker setup.

Could this be a bug?

Thanks!

@pietermarsman pietermarsman added component:characters Anything with encodings, character mappings or CJK languages type: bug labels Mar 10, 2020
@pietermarsman
Copy link
Member

Hi @rcarraretto, thanks for raising this issue!

I can replicate this error. I'll look into this now.

@pietermarsman
Copy link
Member

pietermarsman commented Mar 14, 2020

As far as I can see there is an error in your PDF file. More specific, the /Fo12S5 font that is defined in object 369 (after cleaning up with mutools):

369 0 obj
<<
  /Type /Font
  /Subtype /TrueType
  /Name /Fo12S5
  /BaseFont /Ubuntu-Medium
  /FirstChar 0
  /LastChar 125
  /Widths 366 0 R
  /Encoding 367 0 R
  /ToUnicode 368 0 R
  /FontDescriptor 345 0 R
>>
endobj

The encoding of this font has a list of differences wrt the Ubuntu-Medium font.

367 0 obj
<<
  /Type /Encoding
  /Differences [ 32 /uni1F9F /uni1FA0 /uni1FA1 /uni1FA2 /uni1FA3
      /uni1FA4 /uni1FA5 /uni1FA6 /uni1FA7 /uni1FA8 /uni1FA9 /uni1FAA
      /uni1FAB /uni1FAC /uni1FAD /uni1FAE /uni1FAF /uni1FB0 /uni1FB1
      /uni1FB2 /uni1FB3 /uni1FB4 /uni1FB6 /uni1FB7 /uni1FB8 /uni1FB9
      /uni1FBA /uni1FBB /uni1FBC /uni1FBD /uni1FBE /uni1FBF /uni1FC0
      /uni1FC1 /uni1FC2 /uni1FC3 /uni1FC4 /uni1FC6 /uni1FC7 /uni1FC8
      /uni1FC9 /uni1FCA /uni1FCB /uni1FCC /uni1FCD /uni1FCE /uni1FCF
      /uni1FD0 /uni1FD1 /uni1FD2 /uni1FD3 /uni1FD6 /uni1FD7 /uni1FD8
      /uni1FD9 /uni1FDA /uni1FDB /uni1FDD /uni1FDE /uni1FDF /uni1FE0
      /uni1FE1 /uni1FE2 /uni1FE3 /uni1FE4 /uni1FE5 /uni1FE6 /uni1FE7
      /uni1FE8 /uni1FE9 /uni1FEA /uni1FEB /uni1FEC /uni1FED /uni1FEE
      /uni1FEF /uni1FF2 /uni1FF3 /uni1FF4 /uni1FF6 /uni1FF7 /uni1FF8
      /uni1FF9 /uni1FFA /uni1FFB /uni1FFC /uni1FFD /uni1FFE /uni20B9
      /uniE0FF /uniEFFD /ubuntu /uniF0FF /uniF000 ]
>>
endobj

The second last element is /ubuntu and this is not a valid character specification. Because it starts with /u pdfminer.six tries to interpret it as a unicode, but obviously this is not possible.

I'm not sure how the /unicode value should be interpret. After replacing it with /.notdef the pdf file is parsed without errors.

@pietermarsman pietermarsman added type:anomaly Errors caused by deviations from the PDF Reference and removed type: bug labels Mar 14, 2020
@pietermarsman
Copy link
Member

A quick fix is to skip characters with exceptions. We already do this for KeyError and I will create a PR where ValueError is added to the list.

@rcarraretto
Copy link
Author

Cool, thanks for the fix @pietermarsman!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:characters Anything with encodings, character mappings or CJK languages type:anomaly Errors caused by deviations from the PDF Reference
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants