Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

markdown to docx unusually slow #2356

Closed
garrettgman opened this issue Aug 10, 2015 · 8 comments
Closed

markdown to docx unusually slow #2356

garrettgman opened this issue Aug 10, 2015 · 8 comments

Comments

@garrettgman
Copy link

This file prints the numbers from 1 to 10000. It takes seconds to render to HTML or pdf:

pandoc foo.md -f markdown -t latex -o testdoc.pdf
pandoc foo.md -f markdown -t html -o testdoc.html

But it takes eight minutes to render to word:

pandoc foo.md -f markdown -t docx -o testdoc.docx

I notice this difference often when I use pandoc through R Markdown to report on data. If I try to do something more modest (like print the numbers from 1 to 1000) I do not notice much of a difference.

@jgm
Copy link
Owner

jgm commented Aug 10, 2015

Very strange! It's just a giant code block.

Looking at the code for Text.Pandoc.Writers.Docx, I can't see any obvious reason why there'd be a performance problem here, so this is a puzzle that needs looking into.

@jgm
Copy link
Owner

jgm commented Aug 10, 2015

Some experiments: I changed the file from a fenced code block to an indented one, to allow testing arbitrary numbers of lines:

Lines Seconds
10 0.05
20 0.09
40 0.25
80 0.99
160 9.22
320 76.94

I also tried a version where the code block has just one enormously long line (converting newlines into spaces), and that also takes forever.

@garrettgman
Copy link
Author

Thanks for looking into this, John. I should've mentioned that it began as an issue over at the rstudio/rmarkdown repository, rstudio/rmarkdown#490

@jgm
Copy link
Owner

jgm commented Aug 10, 2015

Further experiment, breaking it down to its core (now using code spans and just a string of xs):

% python -c 'print ("`" + 60000 * "x" + "`")' | pandoc -o 2356.docx

@jgm
Copy link
Owner

jgm commented Aug 10, 2015

Also, --no-highlight has no real effect. This suggests that the problem is not specific to code spans. For a single line of unhighlighted text, not much more is going on than a single application of formattedString to the code. (And this is a simple function that just puts the code in some tags.)

Confirmation:

$ python -c 'print (60000 * "x")' | pandoc -o 2356.docx --no-highlight

also takes forever. This should just be a single long paragraph with regular text.

@jgm
Copy link
Owner

jgm commented Aug 10, 2015

I found the cause: commit f3aa03e which strips out invalid characters. I think this can easily be fixed by doing the stripping in the XML file rather than the Pandoc structure (bottomUp from Text.Pandoc.Generic is inefficient.)

@mpickering

@jgm jgm closed this as completed in 0ad576e Aug 10, 2015
@jgm
Copy link
Owner

jgm commented Aug 10, 2015

@mpickering, I solved this by doing the stripping in formattedString, avoiding the use of bottomUp.

@mpickering
Copy link
Collaborator

Sorry! Didn't realise the files got so large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants