phrants.net
14Jan/110

Stripping MS Word Garbage from HTML form

I recently noticed that my mysql database was filling up with a lot of strange junk code, like StartFragment and <w:WordDocument>, and other miscellaneous xml garbage. I quickly discovered that the problem stemmed from some users copying and pasting text from Microsoft Word into the HTML form editor. This was creating havoc in the database, as frequently the tags would be broken, and then formatting would suffer elsewhere. I couldn't find any decent solutions online, so I created my own, albeit imperfect, solution.

Step 1 : Recognize problematic code in the string-- assign variables for these code fragments.

$bad_fragment = "<w:";
$bad_fragment2 = "StartFragment";

Step 2: Check to see if the form submitted contains either bad code fragment.

$contains_junk_code = strpos($submitted_form, $bad_fragment);
$contains_junk_code2 = strpos($submitted_form, $bad_fragment2);

Step 3:  If the bad code was found, go ahead and strip out ALL formatting.

$submitted_form = strip_tags($submitted_form);

Of course, you could just always strip the tags off every single string formatted using the above function, but I want to keep HTML formatting if it's not problematic. If the bad MS Word xml code fragments are found, they are stripped and I alert the user.

echo "Your submission has been saved as text only. This occurs when copying and pasting from MS Word. To add formatting, edit the text by clicking below.";

While this solution is not perfect, it was the most simple and straightforward approach I could think of for my web app. Hopefully it will help you or inspire you to create a better fix.

$bad_tag = "w:";
$bad_tag2 = "!--";
$bad_tag3 = "StartFragment";