javascript - Convert between Markdown elements

What are the options to parse Markdown document and process its elements to output an another Markdown document?

Let's say it

```
# unaffected #
```

# H1 #

H1
==

## H2 ##

H2
--

### H3 ###

should be converted to

```
# unaffected #
```

## H1 ##

H1
--

### H2 ###

### H2 ###

#### H3 ####

in Node environment. Target element may vary (e.g. #### may be converted to **).

The document may contain other markup elements that should remain unaffected.

How it can be obtained? Obviously, not with regexps (using regexp instead of full-blown lexer will affect # unaffected #). I was hoped to use marked but it seems that it is capable only of HTML output, not Markdown.

What are the options to parse Markdown document and process its elements to output an another Markdown document?

Let's say it

```
# unaffected #
```

# H1 #

H1
==

## H2 ##

H2
--

### H3 ###

should be converted to

```
# unaffected #
```

## H1 ##

H1
--

### H2 ###

### H2 ###

#### H3 ####

in Node environment. Target element may vary (e.g. #### may be converted to **).

The document may contain other markup elements that should remain unaffected.

Share Improve this question edited Feb 13, 2016 at 10:38 asked Feb 7, 2016 at 19:54 Estus Flask 223k78 gold badges470 silver badges607 bronze badges

1 You could use a markdown parser as pandoc and a custom filter like this one : gist.github./scoavoux/27e4ca56b4101a88313a (although it turns == into ## rather than --). There seems to be pandoc wrappers for node.js like this but I don't know much about node.js and how convenient that solution would be – scoa Commented Feb 10, 2016 at 17:39
@scoa Thanks for pointing at Pandoc. I'm not too fond of using external dependencies in Node environment, it plicates the setup. But this is the solution if there are no ready to use alternatives in JS domain (still not sure about this). I will take a look at this option later, feel free to post this as an answer. – Estus Flask Commented Feb 11, 2016 at 10:39

Add a ment |

5 Answers 5

Sorted by: Reset to default 5 +300

Here is a solution with an external markdown parser, pandoc. It allows for custom filters in haskell or python to modify the input (there also is a node.js port). Here is a python filter that increases every header one level. Let's save that as header_increase.py.

from pandocfilters import toJSONFilter, Header

def header_increase(key, value, format, meta):
    if key == 'Header' and value[0] < 7:
        value[0] = value[0] + 1
        return Header(value[0], value[1], value[2])

if __name__ == "__main__":
    toJSONFilter(header_increase)

It will not affect the code block. However, it might transform setex-style headers for h1 and h2 elements (using === or ---) into atx-style headers (using #), and vice-versa.

To use the script, one could call pandoc from the mand line:

pandoc input.md --filter header_increase.py -o output.md -t markdown

With node.js, you could use pdc to call pandoc.

var pdc = require('pdc');
pdc(input_md, 'markdown', 'markdown', [ '--filter', './header_increase.py' ], function(err, result) {
  if (err)
    throw err;

  console.log(result);
});

Have you considered using HTML as an intermediate format? Once in HTML, the differences between the header types will be indistinguishable, so the Markdown -> HTML conversion will effectively normalize them for you. There are markdown -> HTML converters aplenty, and also a number of HTML -> markdown.

I put together an example using these two packages:

https://www.npmjs./package/markdown-it for Markdown -> HTML
https://www.npmjs./package/h2m for HTML -> Markdown

I don't know if you have any performance requirements here (read: this is slow...) but this is a very low investment solution. Take a look:

var md = require('markdown-it')(),
    h2m = require('h2m');

var mdContent = `
\`\`\`
# unaffected #
\`\`\`

# H1 #

H1
==

## H2 ##

H2
--

### H3 ###
`;

var htmlContent = md.render(mdContent);
var newMdContent = h2m(htmlContent, {converter: 'MarkdownExtra'});
console.log(newMdContent);

You may have to play with a mix of ponents to get the correct dialect support and whatnot. I tried a bunch and couldn't quite match your output. I think perhaps the -- is being interpreted differently? Here's the output, I'll let you decide if it is good enough:

```
# unaffected #

```

# H1 #

# H1 #

## H2 ##

## H2 ##

### H3 ###

Despite its apparent simplicity, Markdown is actually somewhat plicated to parse. Each part builds upon the next, such that to cover all edge cases you need a plete parser even if you only want to process a portion of a document.

For example, various types of block level elements can be nested inside other block level elements (lists, blockquotes, etc). Most implementations rely on a vary specific order of events within the parser to ensure that the entire document is parsed correctly. If you remove one of the earlier pieces, many of the later pieces will break. For example, Markdown markup inside code blocks is not parsed as Markdown because one of the first steps is to find and identify the code blocks so that later steps in the parsing never see the code blocks.

Therefore, to acplish your goal and cover all possible edge cases, you need a plete Markdown parser. However, as you do not want to output HTML, your options are somewhat limited and you will need to do some work to get a working solution.

There are basically three styles of Markdown parsers (I'm generalizing here):

Use regex string substitution to swap out the Markdown markup for HTML Markup within the source document.
Use a render which gets called by the parser (in each step) as it parses the document outputting a new document.
Generate a tree object or list of tokens (specifics vary by implementation) which is rendered (converted to a string) to a new document in a later step.

The original reference implementation (markdown.pl) is of the first type and probably useless to you. I simply mention it for pleteness.

Marked is of the second variety and while it could be used, you would need to write your own renderer and have the renderer modify the document at the same time as you render it. While generally a performat solution, it is not always the best method when you need to modify the document, especially if you need context from elsewhere within the document. However, you should be able to make it work.

For example, to adapt an example in the docs, you might do something like this (multiplyString borrowed from here):

function multiplyString (str, num) {
    return num ? Array(num + 1).join(str) : "";
}

renderer.heading = function (text, level) {
    return multiplyString("#", level+1) + " " + text;
}

Of course, you will also need to create renderers for all of the other block level renderer methods and inline level renderer methods which output Markdown syntax. See my ments below regarding renderers in general.

Markdown-JS is of the third variety (as it turns out Marked also provides a lower level API with access to the tokens so it could be used this way as well). As stated in its README:

Intermediate Representation

Internally the process to convert a chunk of Markdown into a chunk of HTML has three steps:

Parse the Markdown into a JsonML tree. Any references found in the parsing are stored in the attribute hash of the root node under the key references.

Convert the Markdown tree into an HTML tree. Rename any nodes that need it (bulletlist to ul for example) and lookup any references used by links or images. Remove the references attribute once done.

Stringify the HTML tree being careful not to wreck whitespace where whitespace is important (surrounding inline elements for example).

Each step of this process can be called individually if you need to do some processing or modification of the data at an intermediate stage.

You could take the tree object in either step 1 or step 2 and make your modifications. However, I would remend step 1 as the JsonML tree will more closely match the actual Markdown document as the HTML Tree in step 2 is a representation of the HTML to be output. Note that the HTML will loose some information regarding the original Markdown in any implementation. For example, were asterisks or underscores used for emphasis (*foo* vs. _foo_), or was a asterisk, dash (hyphen) or plus sign used as a list bullet? I'm not sure how much detail the JsonML tree holds (haven't used it personally), but it should certainly be more than the HTML tree in step 2.

Once you have made your modifications to the JsonML tree (perhpas using one of the tools listed here, then you will probably want to skip step 2 and implement your own step 3 which renders (stringifies) the JsonML tree back to a Markdown document.

And therein lies the hard part. It is very rare for Markdown parsers to output Markdown. In fact it is very rare for Markdown parsers to output anything except HTML. The most popular exception being Pandoc, which is a document converter for many formats of input and output. But, desiring to stay with a JavaScript solution, any library you chose will require you to write your own renderer which will output Markdown (unless a search turns up a renderer built by some other third party). Of course, once you do, if you make it available, others could benefit from it in the future. Unfortunately, building a Markdown renderer is beyond the scope of this answer.

One possible shortcut when building a renderer is that if the Markdown lib you use happens to store the position information in its list of tokens (or in some other way gives you access to the original raw Markdown on a per element basis), you could use that info in the renderer to simply copy and output the original Markdown text, except when you need to alter it. For example, the markdown-it lib offers that data on the Token.map and/or Token.markup properties. You still need to create your own renderer, but it should be easier to get the Markdown to look more like the original.

Finally, I have not personally used, nor am I remending any of the specific Markdown parsers mentioned above. They are simply popular examples of the various types of parsers to demonstrate how you could create a solution. You may find a different implementation which fits your needs better. A lengthy, albeit inplete, list is here.

You must use regexps. marked itself use Regexp for parsing the document. Why don't you?

This is some of the regexp you need, from marked.js source code on github:

var block = {
  newline: /^\n+/,
  code: /^( {4}[^\n]+\n*)+/,
  fences: noop,
  hr: /^( *[-*_]){3,} *(?:\n+|$)/,
  heading: /^ *(#{1,6}) *([^\n]+?) *#* *(?:\n+|$)/,
  nptable: noop,
  lheading: /^([^\n]+)\n *(=|-){2,} *(?:\n+|$)/,
  blockquote: /^( *>[^\n]+(\n(?!def)[^\n]+)*\n*)+/,
  list: /^( *)(bull) [\s\S]+?(?:hr|def|\n{2,}(?! )(?!\1bull )\n*|\s*$)/,
  html: /^ *(?:ment *(?:\n|\s*$)|closed *(?:\n{2,}|\s*$)|closing *(?:\n{2,}|\s*$))/,
  def: /^ *\[([^\]]+)\]: *<?([^\s>]+)>?(?: +["(]([^\n]+)[")])? *(?:\n+|$)/,
  table: noop,
  paragraph: /^((?:[^\n]+\n?(?!hr|heading|lheading|blockquote|tag|def))+)\n*/,
  text: /^[^\n]+/
};

If you really really don't want to use regexp, you can fork the marked object. and overide the Renderer object.

Marked on github is splited to two ponents. One for parsing and one for render. You can eaisly change the render to your own render. (piler)

Example of one function in Render.js:

Renderer.prototype.blockquote = function(quote) {
  return '<blockquote>\n' + quote + '</blockquote>\n';
};)

Maybe it's inplete answer. Copy unaffected into other file.

Then replace all

#space with ##space
space# with space##

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Convert between Markdown elements - Stack Overflow

5 Answers 5

Intermediate Representation

与本文相关的文章

评论列表(0)