Markup normalization

A small selection of the Vim statements required to normalize every possible variant of shitty markup entered by copy and paste online editors into 35,000 articles over the last eight years:

:%s/,|||,/=nr2char(11)/g
:%s/,|||/=nr2char(11)/g
:%s/""/=nr2char(21)/g
:%s/"//g
:exe '%s/' . nr2char(11) . '/","/g'
:exe '%s/' . nr2char(21) . '/"/g'
:exe '%s/"$//g' # add a
:%s/^/"/g
:%s/<br /><p><br />/</p><p>/g
:%s/<p></p><p>/<p>/g
:%s/<BR><br />/<br />/g
:%s/<br /><p></p><p>/<p>/g
:%s/<p><br />/<p>/g
:%s/</p></p><p>/</p><p>/g
:%s/</p><br /><p>/</p><p>/g
:%s/<br /><p>/</p><p>/g
:%s/<p><p>/<p>/g
:%s/<P>/</p><p>/g
:%s/</p></p>/</p>/g
:%s/<br /></p><p>/</p><p>/g

Two more things: 1) Anyone who’s ever tried to tell you to use find and replace in bbEdit for large files is dead wrong. 2) College Publisher, you suck ****. ‘,|||,’ is not a valid delimiting character. Quit being malicious.

Lastly, if I’ve thought ahead, I would’ve tracked invalid markup against prevalence and date range. That would’ve made for a fascinating anthropological study.

Leave a Reply

Your email address will not be published. Required fields are marked *