markdown to docx unusually slow #2356

garrettgman · 2015-08-10T14:33:45Z

This file prints the numbers from 1 to 10000. It takes seconds to render to HTML or pdf:

pandoc foo.md -f markdown -t latex -o testdoc.pdf
pandoc foo.md -f markdown -t html -o testdoc.html

But it takes eight minutes to render to word:

pandoc foo.md -f markdown -t docx -o testdoc.docx

I notice this difference often when I use pandoc through R Markdown to report on data. If I try to do something more modest (like print the numbers from 1 to 1000) I do not notice much of a difference.

The text was updated successfully, but these errors were encountered:

jgm · 2015-08-10T16:30:16Z

Very strange! It's just a giant code block.

Looking at the code for Text.Pandoc.Writers.Docx, I can't see any obvious reason why there'd be a performance problem here, so this is a puzzle that needs looking into.

jgm · 2015-08-10T16:49:40Z

Some experiments: I changed the file from a fenced code block to an indented one, to allow testing arbitrary numbers of lines:

Lines	Seconds
10	0.05
20	0.09
40	0.25
80	0.99
160	9.22
320	76.94

I also tried a version where the code block has just one enormously long line (converting newlines into spaces), and that also takes forever.

garrettgman · 2015-08-10T16:51:12Z

Thanks for looking into this, John. I should've mentioned that it began as an issue over at the rstudio/rmarkdown repository, rstudio/rmarkdown#490

jgm · 2015-08-10T17:06:37Z

Further experiment, breaking it down to its core (now using code spans and just a string of xs):

% python -c 'print ("`" + 60000 * "x" + "`")' | pandoc -o 2356.docx

jgm · 2015-08-10T17:12:28Z

Also, --no-highlight has no real effect. This suggests that the problem is not specific to code spans. For a single line of unhighlighted text, not much more is going on than a single application of formattedString to the code. (And this is a simple function that just puts the code in some tags.)

Confirmation:

$ python -c 'print (60000 * "x")' | pandoc -o 2356.docx --no-highlight

also takes forever. This should just be a single long paragraph with regular text.

jgm · 2015-08-10T17:38:29Z

I found the cause: commit f3aa03e which strips out invalid characters. I think this can easily be fixed by doing the stripping in the XML file rather than the Pandoc structure (bottomUp from Text.Pandoc.Generic is inefficient.)

@mpickering

jgm · 2015-08-10T17:52:16Z

@mpickering, I solved this by doing the stripping in formattedString, avoiding the use of bottomUp.

mpickering · 2015-08-10T19:41:39Z

Sorry! Didn't realise the files got so large.

garrettgman mentioned this issue Aug 10, 2015

Very slow (impossible?) to render word_document with large data display. rstudio/rmarkdown#490

Closed

jgm closed this as completed in 0ad576e Aug 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

markdown to docx unusually slow #2356

markdown to docx unusually slow #2356

garrettgman commented Aug 10, 2015

jgm commented Aug 10, 2015

jgm commented Aug 10, 2015

garrettgman commented Aug 10, 2015

jgm commented Aug 10, 2015

jgm commented Aug 10, 2015

jgm commented Aug 10, 2015

jgm commented Aug 10, 2015

mpickering commented Aug 10, 2015

markdown to docx unusually slow #2356

markdown to docx unusually slow #2356

Comments

garrettgman commented Aug 10, 2015

jgm commented Aug 10, 2015

jgm commented Aug 10, 2015

garrettgman commented Aug 10, 2015

jgm commented Aug 10, 2015

jgm commented Aug 10, 2015

jgm commented Aug 10, 2015

jgm commented Aug 10, 2015

mpickering commented Aug 10, 2015