NOTE: This method is no longer preferred. Please see Microsoft Anti-Cross Site Scripting Library V4.2
http://www.microsoft.com/download/en/details.aspx?id=28589
Long time no update! I’m shocked to see I’m still getting over 100 posts a day considering I haven’t updated in months.
Well I wrote a little script to help everyone out who is using the HTMLeditor that ships with asp.net’s AJAX Control Toolkit. Hope you enjoy!
Function HTMLStream(ByVal InputValue As String, Optional ByVal WhiteList As String = "p|span|ol|li|ul|hr|div|i|b|h1|h2|h3|h4|a|br|img|font") As String
Dim ReturnValue As String
ReturnValue = Regex.Replace(InputValue, "<(?!(" & WhiteList & ")\b)[^>]+>([^.]|[.])*(<(?!/?(" & WhiteList & ")\b)[^>]+>)", "", RegexOptions.IgnoreCase)
While (Regex.IsMatch(ReturnValue, "(<[\s\S]*?) on.*?\=(['""])[\s\S]*?\2([\s\S]*?>)", RegexOptions.Compiled Or RegexOptions.IgnoreCase))
ReturnValue = Regex.Replace(ReturnValue, "(<[\s\S]*?) on.*?\=(['""])[\s\S]*?\2([\s\S]*?>)", _
Function(match As Match) [String].Concat(match.Groups(1).Value, match.Groups(3).Value), RegexOptions.Compiled Or RegexOptions.IgnoreCase)
End While
ReturnValue = Regex.Replace(ReturnValue, "(?<=<.*)href=""(?!http://|www\.)[^""]*""", "", RegexOptions.IgnoreCase)
Return ReturnValue End Function
Now if you want to know how this script works you can continue reading. As a warning I will be assuming that you know regex and intermediate VB.Net code (If you want C# there are a lot of conversion applications online.)
Part 1
The function starts off with two variables. InputValue, which is self described, and the optional WhiteList. WhiteList is a list of HTML characters which will be accepted. By default it’s pretty generous.
Part 2
ReturnValue = Regex.Replace(InputValue, “<(?!(” & WhiteList & “)\b)[^>]+>([^.]|[.])*(<(?!/?(” & WhiteList & “)\b)[^>]+>)”, “”, RegexOptions.IgnoreCase)
This line searches every HTML tag and checks to see if it matches any of the values in the WhiteList group. If it doesn’t it clears out the tag and ALL of it’s contents. This is setup to be greedy! Why greedy? Because it’s for security! I don’t want to remove just the tag, I want to remove EVERYTHING inside of the tag. So be WARNED, altering the WhiteList tags may result in lost of user input.
Part 3
While (Regex.IsMatch(ReturnValue, "(<[\s\S]*?) on.*?\=(['""])[\s\S]*?\2([\s\S]*?>)", RegexOptions.Compiled Or RegexOptions.IgnoreCase))
ReturnValue = Regex.Replace(ReturnValue, "(<[\s\S]*?) on.*?\=(['""])[\s\S]*?\2([\s\S]*?>)", _
Function(match As Match) [String].Concat(match.Groups(1).Value, match.Groups(3).Value), RegexOptions.Compiled Or RegexOptions.IgnoreCase)
End While
This next part is a bit confusing. Generally this goes the extra step most scripts don’t bother to do. Which is a shame since it fails to remove those pesky JavaScript event handlers.
Part 4
ReturnValue = Regex.Replace(ReturnValue, “(?<=<.*)href=”"(?!http://|www\.)[^""]*”"”, “”, RegexOptions.IgnoreCase)
The final part is to go through and remove all javascript injections using the href objection for anchor tags. This will only allow links starting with “www.” or “http://”. You can modify this if you want to allow others such as ftp etc. Obviously this is to prevent against those href=”javascript:…..” injections.
So now that you got the basics you can go through and figure out the nitty gritty! Remember as one developer wrote in a blurb, DO NOT ever let the attacker no if they failed or passed. Otherwise you’re basically inviting them to try to figure out your code. You don’t want to do that!
Please read:
While I put a great deal of effort into this script, I did not write everything from scratch. A lot of people around the web have helped write the code you see above. I simply tweaked what they had and combined it into a far more secure function. So thanks to everyone who posted the original code that helped me write this. Sadly there are too many to know off hand.
So I didnt know anything about javascript injection until recently when someone used this technique to affect our site. Now im doing research on this topic an came across your posting. Im familiar with regx and understand what’s happening in your function above. What I dont understand is how do your use this function. When does it get called? etc. Thx
Comment by Nick A — February 2, 2010 @ 6:06 pm
Nvm I just noticed that you intially wrote “a little script to help everyone out who is using the HTMLeditor that ships with asp.net’s AJAX Control Toolkit.”
- my bad
Comment by Nick A — February 2, 2010 @ 6:08 pm
This does not really matter what control the value is coming from the function accepts a string that it will attempt to sanitize and returns the clean string.
You would use this in a code behind on a vb.net site in it’s current form.
Comment by Anonymous — April 8, 2010 @ 2:18 am