C# Remove HTML/XML Tags


Removing HTML tags from strings

Input:    <p>The <b>dog</b> is <i>cute</i>.</p>
Output:   The dog is cute.

Performance test for HTML removal

HtmlRemoval.StripTagsRegex:         2404 ms
HtmlRemoval.StripTagsRegexCompiled: 1366 ms
HtmlRemoval.StripTagsCharArray:      287 ms [fastest]

File length test for HTML removal

File tested:                        Real-world HTML file
File length before:                 8085 chars
HtmlRemoval.StripTagsRegex:         4382 chars
HtmlRemoval.StripTagsRegexCompiled: 4382 chars
HtmlRemoval.StripTagsCharArray:     4382 chars

HtmlRemoval static class [C#]

using System;
using System.Text.RegularExpressions;

/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
    /// <summary>
    /// Remove HTML from string with Regex.
    /// </summary>
    public static string StripTagsRegex(string source)
    {
 return Regex.Replace(source, "<.*?>", string.Empty);
    }

    /// <summary>
    /// Compiled regular expression for performance.
    /// </summary>
    static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

    /// <summary>
    /// Remove HTML from string with compiled Regex.
    /// </summary>
    public static string StripTagsRegexCompiled(string source)
    {
 return _htmlRegex.Replace(source, string.Empty);
    }

    /// <summary>
    /// Remove HTML tags from string using char array.
    /// </summary>
    public static string StripTagsCharArray(string source)
    {
 char[] array = new char[source.Length];
 int arrayIndex = 0;
 bool inside = false;

 for (int i = 0; i < source.Length; i++)
 {
     char let = source[i];
     if (let == '<')
     {
  inside = true;
  continue;
     }
     if (let == '>')
     {
  inside = false;
  continue;
     }
     if (!inside)
     {
  array[arrayIndex] = let;
  arrayIndex++;
     }
 }
 return new string(array, 0, arrayIndex);
    }
}

Program that tests HTML removal [C#]

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
 const string html = "<p>There was a <b>.NET</b> programmer " +
     "and he stripped the <i>HTML</i> tags.</p>";

 Console.WriteLine(HtmlRemoval.StripTagsRegex(html));
 Console.WriteLine(HtmlRemoval.StripTagsRegexCompiled(html));
 Console.WriteLine(HtmlRemoval.StripTagsCharArray(html));
    }
}

Output

There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.

Comments

Popular posts from this blog

HOW TO Use the NumericUpDown Control

How to show help text in page on mouse move

Get control name in Page_Load event after post back by any control