Markup normalization
A small selection of the Vim statements required to normalize every possible variant of shitty markup entered by copy and paste online editors into 35,000 articles over the last eight years:
:%s/,|||,/=nr2char(11)/g :%s/,|||/=nr2char(11)/g :%s/""/=nr2char(21)/g :%s/"//g :exe '%s/' . nr2char(11) . '/","/g' :exe '%s/' . nr2char(21) . '/"/g' :exe '%s/"$//g' # add a :%s/^/"/g :%s/<br /><p><br />/</p><p>/g :%s/<p></p><p>/<p>/g :%s/<BR><br />/<br />/g :%s/<br /><p></p><p>/<p>/g :%s/<p><br />/<p>/g :%s/</p></p><p>/</p><p>/g :%s/</p><br /><p>/</p><p>/g :%s/<br /><p>/</p><p>/g :%s/<p><p>/<p>/g :%s/<P>/</p><p>/g :%s/</p></p>/</p>/g :%s/<br /></p><p>/</p><p>/g
Two more things: 1) Anyone who’s ever tried to tell you to use find and replace in bbEdit for large files is dead wrong. 2) College Publisher, you suck ****. ‘,|||,’ is not a valid delimiting character. Quit being malicious.
Lastly, if I’ve thought ahead, I would’ve tracked invalid markup against prevalence and date range. That would’ve made for a fascinating anthropological study.