ValueError: invalid literal for int() with base 16 #385

rcarraretto · 2020-03-10T15:45:46Z

First of all, thanks for the tool! My team has been successfully using this.

Describe the bug

With Python 3.7.6 and pdfminer 20200124.

Given this file, when running

python pdf2txt.py -V --output_type xml file.pdf

I get an incomplete xml output with a stacktrace

<?xml version="1.0" encoding="utf-8" ?>
<pages>
Traceback (most recent call last):
  File "/usr/local/bin/pdf2txt.py", line 188, in <module>
    sys.exit(main())
  File "/usr/local/bin/pdf2txt.py", line 182, in main
    outfp = extract_text(**vars(A))
  File "/usr/local/bin/pdf2txt.py", line 56, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/usr/local/lib/python3.7/site-packages/pdfminer/high_level.py", line 85, in extract_text_to_fp
    interpreter.process_page(page)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 906, in render_contents
    self.init_resources(resources)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 354, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 187, in get_font
    font = PDFTrueTypeFont(self, spec)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdffont.py", line 620, in __init__
    PDFSimpleFont.__init__(self, descriptor, widths, spec)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/pdffont.py", line 580, in __init__
    self.cid2unicode = EncodingDB.get_encoding(name, diff)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/encodingdb.py", line 108, in get_encoding
    cid2unicode[cid] = name2unicode(x.name)
  File "/usr/local/lib/python3.7/site-packages/pdfminer/encodingdb.py", line 54, in name2unicode
    unicode_digit = int(name_without_u, base=16)
ValueError: invalid literal for int() with base 16: 'bunt'

The file is a pdf with non-embedded fonts.

To help isolate the issue, I was able to reproduce it with this docker setup.

Could this be a bug?

Thanks!

The text was updated successfully, but these errors were encountered:

pietermarsman · 2020-03-14T08:40:27Z

Hi @rcarraretto, thanks for raising this issue!

I can replicate this error. I'll look into this now.

pietermarsman · 2020-03-14T09:05:00Z

As far as I can see there is an error in your PDF file. More specific, the /Fo12S5 font that is defined in object 369 (after cleaning up with mutools):

369 0 obj
<<
  /Type /Font
  /Subtype /TrueType
  /Name /Fo12S5
  /BaseFont /Ubuntu-Medium
  /FirstChar 0
  /LastChar 125
  /Widths 366 0 R
  /Encoding 367 0 R
  /ToUnicode 368 0 R
  /FontDescriptor 345 0 R
>>
endobj

The encoding of this font has a list of differences wrt the Ubuntu-Medium font.

367 0 obj
<<
  /Type /Encoding
  /Differences [ 32 /uni1F9F /uni1FA0 /uni1FA1 /uni1FA2 /uni1FA3
      /uni1FA4 /uni1FA5 /uni1FA6 /uni1FA7 /uni1FA8 /uni1FA9 /uni1FAA
      /uni1FAB /uni1FAC /uni1FAD /uni1FAE /uni1FAF /uni1FB0 /uni1FB1
      /uni1FB2 /uni1FB3 /uni1FB4 /uni1FB6 /uni1FB7 /uni1FB8 /uni1FB9
      /uni1FBA /uni1FBB /uni1FBC /uni1FBD /uni1FBE /uni1FBF /uni1FC0
      /uni1FC1 /uni1FC2 /uni1FC3 /uni1FC4 /uni1FC6 /uni1FC7 /uni1FC8
      /uni1FC9 /uni1FCA /uni1FCB /uni1FCC /uni1FCD /uni1FCE /uni1FCF
      /uni1FD0 /uni1FD1 /uni1FD2 /uni1FD3 /uni1FD6 /uni1FD7 /uni1FD8
      /uni1FD9 /uni1FDA /uni1FDB /uni1FDD /uni1FDE /uni1FDF /uni1FE0
      /uni1FE1 /uni1FE2 /uni1FE3 /uni1FE4 /uni1FE5 /uni1FE6 /uni1FE7
      /uni1FE8 /uni1FE9 /uni1FEA /uni1FEB /uni1FEC /uni1FED /uni1FEE
      /uni1FEF /uni1FF2 /uni1FF3 /uni1FF4 /uni1FF6 /uni1FF7 /uni1FF8
      /uni1FF9 /uni1FFA /uni1FFB /uni1FFC /uni1FFD /uni1FFE /uni20B9
      /uniE0FF /uniEFFD /ubuntu /uniF0FF /uniF000 ]
>>
endobj

The second last element is /ubuntu and this is not a valid character specification. Because it starts with /u pdfminer.six tries to interpret it as a unicode, but obviously this is not possible.

I'm not sure how the /unicode value should be interpret. After replacing it with /.notdef the pdf file is parsed without errors.

pietermarsman · 2020-03-14T09:13:44Z

A quick fix is to skip characters with exceptions. We already do this for KeyError and I will create a PR where ValueError is added to the list.

rcarraretto · 2020-03-16T10:36:02Z

Cool, thanks for the fix @pietermarsman!

pietermarsman added component:characters Anything with encodings, character mappings or CJK languages type: bug labels Mar 10, 2020

pietermarsman added type:anomaly Errors caused by deviations from the PDF Reference and removed type: bug labels Mar 14, 2020

pietermarsman mentioned this issue Mar 14, 2020

Catch ValueError when converting font encoding differences to characters #389

Merged

6 tasks

pietermarsman closed this as completed in #389 Mar 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: invalid literal for int() with base 16 #385

ValueError: invalid literal for int() with base 16 #385

rcarraretto commented Mar 10, 2020

pietermarsman commented Mar 14, 2020

pietermarsman commented Mar 14, 2020 •

edited

Loading

pietermarsman commented Mar 14, 2020

rcarraretto commented Mar 16, 2020

ValueError: invalid literal for int() with base 16 #385

ValueError: invalid literal for int() with base 16 #385

Comments

rcarraretto commented Mar 10, 2020

pietermarsman commented Mar 14, 2020

pietermarsman commented Mar 14, 2020 • edited Loading

pietermarsman commented Mar 14, 2020

rcarraretto commented Mar 16, 2020

pietermarsman commented Mar 14, 2020 •

edited

Loading