1# html-to-markdown
2
3[](https://goreportcard.com/report/github.com/JohannesKaufmann/html-to-markdown)
4[](https://codecov.io/gh/JohannesKaufmann/html-to-markdown)
5
6[](http://godoc.org/github.com/JohannesKaufmann/html-to-markdown)
7
8
9
10Convert HTML into Markdown with Go. It is using an [HTML Parser](https://github.com/PuerkitoBio/goquery) to avoid the use of `regexp` as much as possible. That should prevent some [weird cases](https://stackoverflow.com/a/1732454) and allows it to be used for cases where the input is totally unknown.
11
12## Installation
13
14```
15go get github.com/JohannesKaufmann/html-to-markdown
16```
17
18## Usage
19
20```go
21import (
22 "fmt"
23 "log"
24
25 md "github.com/JohannesKaufmann/html-to-markdown"
26)
27
28converter := md.NewConverter("", true, nil)
29
30html := `<strong>Important</strong>`
31
32markdown, err := converter.ConvertString(html)
33if err != nil {
34 log.Fatal(err)
35}
36fmt.Println("md ->", markdown)
37```
38
39If you are already using [goquery](https://github.com/PuerkitoBio/goquery) you can pass a selection to `Convert`.
40
41```go
42markdown, err := converter.Convert(selec)
43```
44
45### Using it on the command line
46
47If you want to make use of `html-to-markdown` on the command line without any Go coding, check out [`html2md`](https://github.com/suntong/html2md#usage), a cli wrapper for `html-to-markdown` that has all the following options and plugins builtin.
48
49## Options
50
51The third parameter to `md.NewConverter` is `*md.Options`.
52
53For example you can change the character that is around a bold text ("`**`") to a different one (for example "`__`") by changing the value of `StrongDelimiter`.
54
55```go
56opt := &md.Options{
57 StrongDelimiter: "__", // default: **
58 // ...
59}
60converter := md.NewConverter("", true, opt)
61```
62
63For all the possible options look at [godocs](https://godoc.org/github.com/JohannesKaufmann/html-to-markdown/#Options) and for a example look at the [example](/examples/options/main.go).
64
65## Adding Rules
66
67```go
68converter.AddRules(
69 md.Rule{
70 Filter: []string{"del", "s", "strike"},
71 Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
72 // You need to return a pointer to a string (md.String is just a helper function).
73 // If you return nil the next function for that html element
74 // will be picked. For example you could only convert an element
75 // if it has a certain class name and fallback if not.
76 content = strings.TrimSpace(content)
77 return md.String("~" + content + "~")
78 },
79 },
80 // more rules
81)
82```
83
84For more information have a look at the example [add_rules](/examples/add_rules/main.go).
85
86## Using Plugins
87
88If you want plugins (github flavored markdown like striketrough, tables, ...) you can pass it to `Use`.
89
90```go
91import "github.com/JohannesKaufmann/html-to-markdown/plugin"
92
93// Use the `GitHubFlavored` plugin from the `plugin` package.
94converter.Use(plugin.GitHubFlavored())
95```
96
97Or if you only want to use the `Strikethrough` plugin. You can change the character that distinguishes
98the text that is crossed out by setting the first argument to a different value (for example "~~" instead of "~").
99
100```go
101converter.Use(plugin.Strikethrough(""))
102```
103
104For more information have a look at the example [github_flavored](/examples/github_flavored/main.go).
105
106---
107
108These are the plugins located in the [plugin folder](/plugin) which you can use by importing "github.com/JohannesKaufmann/html-to-markdown/plugin".
109
110| Name | Description |
111| --------------------- | ------------------------------------------------------------------------------------------- |
112| GitHubFlavored | GitHub's Flavored Markdown contains `TaskListItems`, `Strikethrough` and `Table`. |
113| TaskListItems | (Included in `GitHubFlavored`). Converts `<input>` checkboxes into `- [x] Task`. |
114| Strikethrough | (Included in `GitHubFlavored`). Converts `<strike>`, `<s>`, and `<del>` to the `~~` syntax. |
115| Table | (Included in `GitHubFlavored`). Convert a `<table>` into something like this... |
116| TableCompat | |
117| | |
118| VimeoEmbed | |
119| YoutubeEmbed | |
120| | |
121| ConfluenceCodeBlock | Converts `<ac:structured-macro>` elements that are used in Atlassianโs Wiki "Confluence". |
122| ConfluenceAttachments | Converts `<ri:attachment ri:filename=""/>` elements. |
123
124These are the plugins in other repositories:
125
126| Name | Description |
127| ---------------------------- | ------------------- |
128| \[Plugin Name\]\(Your Link\) | A short description |
129
130I you write a plugin, feel free to open a PR that adds your Plugin to this list.
131
132## Writing Plugins
133
134Have a look at the [plugin folder](/plugin) for a reference implementation. The most basic one is [Strikethrough](/plugin/strikethrough.go).
135
136## Security
137
138This library produces markdown that is readable and can be changed by humans.
139
140Once you convert this markdown back to HTML (e.g. using [goldmark](https://github.com/yuin/goldmark) or [blackfriday](https://github.com/russross/blackfriday)) you need to be careful of malicious content.
141
142This library does NOT sanitize untrusted content. Use an HTML sanitizer such as [bluemonday](https://github.com/microcosm-cc/bluemonday) before displaying the HTML in the browser.
143
144## Other Methods
145
146[Godoc](https://godoc.org/github.com/JohannesKaufmann/html-to-markdown)
147
148### `func (c *Converter) Keep(tags ...string) *Converter`
149
150Determines which elements are to be kept and rendered as HTML.
151
152### `func (c *Converter) Remove(tags ...string) *Converter`
153
154Determines which elements are to be removed altogether i.e. converted to an empty string.
155
156## Escaping
157
158Some characters have a special meaning in markdown. For example, the character "\*" can be used for lists, emphasis and dividers. By placing a backlash before that character (e.g. "\\\*") you can "escape" it. Then the character will render as a raw "\*" without the _"markdown meaning"_ applied.
159
160But why is "escaping" even necessary?
161
162<!-- prettier-ignore -->
163```md
164Paragraph 1
165-
166Paragraph 2
167```
168
169The markdown above doesn't seem that problematic. But "Paragraph 1" (with only one hyphen below) will be recognized as a _setext heading_.
170
171```html
172<h2>Paragraph 1</h2>
173<p>Paragraph 2</p>
174```
175
176A well-placed backslash character would prevent that...
177
178<!-- prettier-ignore -->
179```md
180Paragraph 1
181\-
182Paragraph 2
183```
184
185---
186
187How to configure escaping? Depending on the `EscapeMode` option, the markdown output is going to be different.
188
189```go
190opt = &md.Options{
191 EscapeMode: "basic", // default
192}
193```
194
195Lets try it out with this HTML input:
196
197| | |
198| -------- | ----------------------------------------------------- |
199| input | `<p>fake **bold** and real <strong>bold</strong></p>` |
200| | |
201| | **With EscapeMode "basic"** |
202| output | `fake \*\*bold\*\* and real **bold**` |
203| rendered | fake \*\*bold\*\* and real **bold** |
204| | |
205| | **With EscapeMode "disabled"** |
206| output | `fake **bold** and real **bold**` |
207| rendered | fake **bold** and real **bold** |
208
209With **basic** escaping, we get some escape characters (the backlash "\\") but it renders correctly.
210
211With escaping **disabled**, the fake and real bold can't be distinguished in the markdown. That means it is both going to render as bold.
212
213---
214
215So now you know the purpose of escaping. However, if you encounter some content where the escaping breaks, you can manually disable it. But please also open an issue!
216
217## Issues
218
219If you find HTML snippets (or even full websites) that don't produce the expected results, please open an issue!
220
221## Contributing & Testing
222
223Please first discuss the change you wish to make, by opening an issue. I'm also happy to guide you to where a change is most likely needed.
224
225_Note: The outside API should not change because of backwards compatibility..._
226
227You don't have to be afraid of breaking the converter, since there are many "Golden File Tests":
228
229Add your problematic HTML snippet to one of the `input.html` files in the `testdata` folder. Then run `go test -update` and have a look at which `.golden` files changed in GIT.
230
231You can now change the internal logic and inspect what impact your change has by running `go test -update` again.
232
233_Note: Before submitting your change as a PR, make sure that you run those tests and check the files into GIT..._
234
235## Related Projects
236
237- [turndown (js)](https://github.com/domchristie/turndown), a very good library written in javascript.
238- [lunny/html2md](https://github.com/lunny/html2md), which is using [regex instead of goquery](https://stackoverflow.com/a/1732454). I came around a few edge case when using it (leaving some html comments, ...) so I wrote my own.
239
240## License
241
242This project is licensed under the terms of the MIT license.