How do you convert Html to plain text?


Question

I have snippets of Html stored in a table. Not entire pages, no tags or the like, just basic formatting.

I would like to be able to display that Html as text only, no formatting, on a given page (actually just the first 30 - 50 characters but that's the easy bit).

How do I place the "text" within that Html into a string as straight text?

So this piece of code.

<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>

Becomes:

Hello World. Is there anyone out there?

1
82
5/27/2012 7:05:34 PM

Accepted Answer

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

If you can use regular expressions there are many web pages out there with good info:

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

22
7/13/2009 7:15:25 PM

The free and open source HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text.

var plainText = HtmlUtilities.ConvertToPlainText(string html);

Feed it an HTML string like

<b>hello, <i>world!</i></b>

And you'll get a plain text result like:

hello world!

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon