-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write filter to support right-to-left direction in Persian text. #2191
Comments
One option would be to check each block element for Persian
where b is the original block. This would give you output in HTML like
I don't know if this would work in browsers. If not, you +++ Milad Khajavi [May 29 15 01:27 ]:
|
If you don't want to write a filter as jgm recommended, you can always mark it up manually:
you might also be interested in the RTL discussion on talk.commonmark.org. |
I think dealing with languages and directionality should become a functionality of pandoc itself rather than being delegated to filters. My suggestion would be to primarily rely on language tags in pandoc markdown:
Most of this is already available in pandoc:
generates
… which doesn’t look too bad as is, except for the facts that Unless declared explicitly, pandoc could then infer directionality from these language tags, and write, e.g.,
If For latex output, pandoc would just have to map
and |
@nickbart1980, wasn’t As far as I can understand, language direction may be specified in CSS: :lang(fa-IR) {
direction: rtl;
} |
Yes, for LaTeX a comma-separated list in the metadata variable However, As to CSS, I’m not quite sure. Adding your snippet to my HTML document above looks ok in a browser (again, with the exception of the full stops). On the other hand, https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/dir recommends “As the directionality of the text is semantically related to its content and not to its presentation, it is recommended that web developers use this attribute [ |
There is an issue (#1614) exactly on this topic. It may make sense to add comments there (so developers see the real demand for this fix).
Where is the fix needed? I must confess that I still don’t get it (we have already discussed this at #2174). How about using
I wonder whether this would work also with full stops: :lang(fa-IR) {
direction: rtl;
unicode-bidi: bidi-override;
}
The reasoning behind this recommendation would lead to avoid as many CSS properties as possible: “[t]hat way, the text will display correctly even on a browser that doesn't support CSS or has the CSS deactivated”. I don’t see the reason why the direction should also included in HTML (besides the language markup), if a given language can only have one direction. |
That’s not so great since you would always have to tweak the source file depending on the target format. Parsing
Unfortunately, no. |
John, On Fri, May 29, 2015 at 10:35 PM, John MacFarlane [email protected]
Milād Khājavi |
Writing On Mon, Jun 1, 2015 at 6:21 PM, nickbart1980 [email protected]
Milād Khājavi |
@nickbart1980, I don’t think so. Let’s consider the following sample:
If you have to tweak the source depending on your target, this isn’t due to the language information in the metadata. It has to do with the lack of translation among different language identification values [#1614]), non–existing special syntax for language attributes (#895) and missing syntax for raw division and raw inline elements (#168).
To the best of my knowledge, pandoc has four variables (metadata fields) to include language information in the metadata:
Applying Occam’s razor to these variables, I think it would read: “do not create any language variable unless strictly required”. I agree that But I think that adapting My final question is: wnat is wrong (or what does it need to be fixed) in using
@nickbart1980, I don’t think so. Let’s consider the following sample:
If you have to tweak the source depending on your target, this isn’t due to the language information in the metadata. It has to do with the lack of translation among different language identification values [#1614]), non–existing special syntax for language attributes (#895) and missing syntax for raw division and raw inline elements (#168).
To the best of my knowledge, pandoc has four variables (metadata fields) to include language information in the metadata:
Applying Occam’s razor to these variables, I think it would read: “do not create any language variable unless strictly required”. I agree that But I think that adapting My final question is: what is wrong (or what does it need to be fixed) in using |
@khajavi, I don’t think I understand your proposal. But first of all, why do you need language markup? If you only require it for text direction, I wonder whether this could be achieved without language or direction tagging. It is only a guess, but isn’t the Unicode bidirectional algorithm supposed deal with this? If you need markup for hyphenation or other language–dependent feature, then you need to mark up languages. |
I need language markup for text direction in outputs like html and latex My proposal is that pandoc able to detect the language of the text, here On Mon, Jun 1, 2015 at 11:00 PM, Pablo Rodríguez [email protected]
Milād Khājavi |
+++ Pablo Rodríguez [Jun 01 15 09:12 ]:
I think it's a good idea.
This makes sense to me. |
What it boils down to is, do we want
where
Both will work nicely with all formats (as soon as the latex writer maps |
@nickbart1980, many thanks for your reply. I’m afraid that the first proposal doesn’t behave as you expect in pandoc-1.14.0.1.
This gives the following HTML element: <html xmlns="http://www.w3.org/1999/xhtml"
lang="grc, it, fr, en, de, es"
xml:lang="grc, it, fr, en, de, es"> In XML From all formats that support language markup, only LaTeX needs the list of languages used in the document. This shouldn’t be the default in the way pandoc metadata deal with languages. This is the reason the And this is the reason why there is nothing to fix here. BTW, the proposal doesn’t work even with LaTeX (the final comma after the last language is wrong): \documentclass[grc, it, fr, en, de, es,]{article} If the LaTeX writer needs to be adapted to the way pandoc works, this should be done. But it is crazy to adapt pandoc to the way LaTeX works. (At least, one writer is easier to do than many writers.) |
Note that language and directionality are two independent properties and shouldn't be conflated:
The pandoc document metadata should have |
@ousia btw, |
As I said over at commonmark discuss, I think we should be fine with supporting In ConTeXt, we can use When using the So what about pdfLaTeX and LuaLaTeX? I guess we can forget about the former, but it would be good if we could output the same commands for both Lua- and XeLaTeX. Maybe we can redefine it somehow in our LaTeX template—that is if there is a general purpose rtl/bidi package for LuaLaTex (not only arabic or only farsi), is there? Otherwise, we'll just have to tell people to use either XeLaTeX or ConTeXt. Maybe @khaledhosny can shed some light on these questions, please? :) |
@mb21, as commented in #1614, do you really think that If each language has one and only one direction (and the number of languages is finite), I guess pandoc should assign direction to the language internally. Consider a dissertation in Arabic literature written in English (or any Western language). It is easy that it may have over a thousand passages in Arabic. What do you think it is easier to type: With ConTeXt, I had typeset a book in Spanish that had about a thousand passages in ancient Greek. And I really was relieved by the fact that I didn’t have to tag any of these texts. (Just in case you wonder, |
As I wrote above, language and scripts are two independent properties and shouldn't be conflated, e.g. Azerbaijani can be written using both right-to-left (Arabic) and left-to-right (Latin or Cyrillic) scripts. But I think it's a good idea to introduce |
@mb21, I think there are different issues involved here:
There is a question about languages that may use different scripts that I don’t understand. Language markup is relevant to apply resources to the tagged text, such as hyphenation dictionaries. How would you apply the right hyphenation dictionary for a language that may use more than a script if the language itself doesn’t contain which one should be? Directionality doesn’t help much here. This is why I think that
I know they are different issues, but also related. I wanted to discuss the issue on a simplified or special language attribute, so that it could be implemented at the same time this issue is implemented (the original issue has been opened for almost 26 months). |
True, but I think the (X)HTML folks have put a lot of thought into their docs and HTML remains one of the primary output targets of pandoc. Compared to LaTeX and ConTeXt their approach is much less of a mess and based on ISO standards. That's why I propose to model pandoc's model after the HTML model. But yeah, I guess pandoc could extract a script tag from the BCP 47 string, yet this would require us to come up with (and maintain) a long list of language-to-script- and script-to-direction-mappings. I'm sure it's doable and if @jgm is in favour and someone gets around to implement it, why not? Meanwhile, mirroring the HTML model provides a working model, relatively simply. |
To clarify, now you can write:
As soon as native syntax for |
I need to convert the Persian text like this:
# عنوان اول این متن فارسی باید راست به چپ نشان داده شود. This is the English paragraph, so it's direction in html should be left-to-right.
To HTML like this:
Any one could help me how can I write proper Pandoc filter in Haskell to solve this problem?
The text was updated successfully, but these errors were encountered: