C# Remove HTML/XML Tags


Removing HTML tags from strings

Input:    <p>The <b>dog</b> is <i>cute</i>.</p>
Output:   The dog is cute.

Performance test for HTML removal

HtmlRemoval.StripTagsRegex:         2404 ms
HtmlRemoval.StripTagsRegexCompiled: 1366 ms
HtmlRemoval.StripTagsCharArray:      287 ms [fastest]

File length test for HTML removal

File tested:                        Real-world HTML file
File length before:                 8085 chars
HtmlRemoval.StripTagsRegex:         4382 chars
HtmlRemoval.StripTagsRegexCompiled: 4382 chars
HtmlRemoval.StripTagsCharArray:     4382 chars

HtmlRemoval static class [C#]

using System;
using System.Text.RegularExpressions;

/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
    /// <summary>
    /// Remove HTML from string with Regex.
    /// </summary>
    public static string StripTagsRegex(string source)
    {
 return Regex.Replace(source, "<.*?>", string.Empty);
    }

    /// <summary>
    /// Compiled regular expression for performance.
    /// </summary>
    static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

    /// <summary>
    /// Remove HTML from string with compiled Regex.
    /// </summary>
    public static string StripTagsRegexCompiled(string source)
    {
 return _htmlRegex.Replace(source, string.Empty);
    }

    /// <summary>
    /// Remove HTML tags from string using char array.
    /// </summary>
    public static string StripTagsCharArray(string source)
    {
 char[] array = new char[source.Length];
 int arrayIndex = 0;
 bool inside = false;

 for (int i = 0; i < source.Length; i++)
 {
     char let = source[i];
     if (let == '<')
     {
  inside = true;
  continue;
     }
     if (let == '>')
     {
  inside = false;
  continue;
     }
     if (!inside)
     {
  array[arrayIndex] = let;
  arrayIndex++;
     }
 }
 return new string(array, 0, arrayIndex);
    }
}

Program that tests HTML removal [C#]

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
 const string html = "<p>There was a <b>.NET</b> programmer " +
     "and he stripped the <i>HTML</i> tags.</p>";

 Console.WriteLine(HtmlRemoval.StripTagsRegex(html));
 Console.WriteLine(HtmlRemoval.StripTagsRegexCompiled(html));
 Console.WriteLine(HtmlRemoval.StripTagsCharArray(html));
    }
}

Output

There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.
There was a .NET programmer and he stripped the HTML tags.

Comments

Popular posts from this blog

Executing PowerShell scripts from C#

HOW TO Use the NumericUpDown Control

Exposing Agile Software Development Myths