Alternate title for this post: Let the exodus continue. The Python conversion script CoPress used to migrate over 50 student publications to the glorious free and open source WordPress is now itself licensed under GPL version 2. It’s optimized for College Publisher 4 and College Publisher 5 databases, but will also work with most any database you can turn into a flat CSV file. You can fork it on Github or download the brand new 1.0 release.
Right off the bat, I’d like to say that the most awesome bit about the conversion script is its ease of use. Granted, you do have to run it on the command line and it does often throw mythical, unintelligible errors if your data is screwy, but it’s about 100 to 1,000 times easier than what Sean Blanda or Brian Schlansky had to go through. Furthermore, it spits out WordPress eXtended RSS files that WordPress imports natively. Depending on the size of your archives, you could even do the entire migration in less than a half hour.
There are detailed instructions in the README I encourage you to read thoroughly but, in screenshots, here’s how you’d migrate your site.
Backup your database using Sequel Pro. This is a critically important step, as you’ll definitely want a clean version to revert to if the import goes awry.
Place the conversion script and your archives in a folder you can access from the command line. Both College Publisher 4 and College Publisher 5 migrants should receive an articles file that will need to be renamed “stories.csv.” Publications migrating from the former will have all of their image references stored in a file that will need to be renamed “media.csv.” Navigate to that directory from your terminal prompt and run “python CoPress-Convert.py
.”
Once the script is running, you’ll be asked a series of questions to configure the conversion process. Most options are self-explanatory, and all are explained fully in the README file packaged with the script. The most important thing I’d like to note in this post is that, unless you have less than 500 authors in your archives, I’d highly, highly recommend importing your authors as custom fields instead of users. WordPress is not optimized to add a large number of new users through its import process. We learned this the hard way migrating CM Life‘s database last summer.
When the script is done, you’ll have a series of WordPress eXtended RSS files you can easily upload into WordPress.
Mad props go to Miles Skorpen for the long hours he spent on the conversion script, and to Albert Sun, Will Davis, and Max Cutler for their later contributions.
Feel free to send along any suggestions for improvement, bugs, fixes or general comments. I intend to maintain it for the indefinite future, it’s good Python practice when everything else I’m working on is PHP, but code contributions are always welcome. There is a short list of upgrades under consideration in the top of the script.
24 Comments
Awesome. Awesome. Awesome.
Hmm… I seem to be getting an error right from the start. Do you have any clue what it can be? I have all of the images as well as stories.csv in the same directory as the python file. http://pastebin.com/HyPeB2cR
I think they changed the format of the CSV on you…
Yeah, it looks like they did. They seem to be using ≤|||≤ as the delimiter. Do you know what the second least painful way of doing this is? I got it the data in Excel and I’ve been cleaning it up a bit: http://i.imgur.com/aZA8F.png
To be honest, I think the last time I had to deal with this I just asked them to send an actual CSV. If you can open it in Excel, you should be able to then export it again as standard CSV. The technical problem is that the Python library doesn’t support parsing files with multi-character delimiters.
I’m gonna have to call and complain tomorrow. The text file they sent me starts with a line break and ends mid-sentence…
They sent me a .txt file too. I don’t have excel. Is there another way to convert it to a CSV?
A CSV is actually very similar to a TXT file… it’s just a way of designating formatting. You might just be able to change the line ending.
Can you post some of the content into http://pastebin.com/ and share that here?
Sure! Here’s some.
http://pastebin.com/91jxTysP
They suck, don’t they? I spent four days last week trying to clean that mess up!
So, the challenge with this is that it’s a multi-character delimiter which my conversion script doesn’t handle. In fact, the Python CSV library won’t handle it.
If you want to use the conversion script, you have a couple of options:
Ask College Publisher for a properly formatted CSV
Convert the file you have into a properly formatted CSV
For the latter, csvkit might be of use. I haven’t used it before though.
Hey, csvkit also won’t handle multi-char delimiters either, but I suspect there are no legitimate “|||” strings in your data, so you can probably just replace it with a single delimiter. In vim this would be something like:
:%s/|||/,/g
But you could just as easily do it with Find/Replace in whatever your text editor is choice is. Once you’ve got it down to a single delimiter you should be able to process it with Excel or whatever.
That being said, who knows what other land mines are in that data. (quoting?) If you can get a clean file, that’s a much better solution.
Cheers,
Chris
OpenOffice Calc supports multi-char delimiters, so it can be used to clean up the .csv.
After many hours I have successfully converted! Here’s what I did:
*Application: vim (the file was too big for anything else)*
First of all, CP totally messed my data up. The file they sent me was 130,000+ lines long. I did “:sort u” in vim (sort and remove duplicates) and it left me with about 30,000+ lines (that’s a lot of duplicates). After that, I would definitely suggest using Excel to clean things up, it makes things MUCH easier.
*Application: Excel*
I imported it into Excel using the pipe “|” as the delimiter and just told it to ignore consecutive delimiters (I’m willing to lose a few stories that contain a “|” character… Worth it). Then I did a find on all ² characters (they were used as quotes in mine) and replaced it with an empty string. At this point I still had a lot of really messed up data. Since it was nicely sorted, I noticed there were triple duplicates of many stories with only 1 containing valid data (the other two would have the columns all scrambled). At this point I realized unscrambling data wasn’t helping, I just needed to delete the messed up rows. To do this I filtered certain columns for things that didn’t make sense, for example, the title or body being blank or a category that didn’t start with a colon. With all that bad data in view I was able to delete thousands of rows at a time. After lots and lots and lots of filtering and deleting (down to about 20,000+ lines now), I decided to save the CSV and try the script. This part was weird, to get Excel to save a CSV with the “|” as a delimiter, I needed to go into Windows Control Panel > Regional and Language Settings > Additional Settings > and change the “List separator” to a “|” instead of a comma. So with that being done I saved it as stories.csv from Excel. Yay, pipe delimited. Now for the script.
*Application: Your favorite text editor (Notepad++)*
Some edits to the script I had to make included changing the delimiter to a “|” as well as editing the date string. CP saved my dates as Fri, Feb 10, 2012. The script didn’t like the weekday in there so at the beginning of that date parsing function I did datestring = datestring[5:] which removed the day of the week. There were some other tweaks I made along the way but I forget the details… Sorry.
I’m really excited I got this to work, I don’t know what I would have done without this script, thank you so much!!! If anyone has any questions about what I did I’d be happy to help. I spent a lot of time Googling and experimenting, hopefully I can save someone else the time and frustration.
Awesome, thanks Patrick
I think I’m having problems getting the header row setup correctly. Here are the first 50+ rows (out of 37,000+): http://bitly.com/DN-StoriesCSV-Sample
Any idea if this looks correct? Its a CP5 export/
From what I remember, the headers in your CSV seem like they should be usable. What errors do you see when you try to run the script?
Here’s the erro I’m getting:
Beginning to read in the stories database.
Traceback (most recent call last):
File “CoPress-Convert.py”, line 847, in
main()
File “CoPress-Convert.py”, line 820, in main
version,stories,images = importStories(verbose)
File “CoPress-Convert.py”, line 740, in importStories
line[cp5_map[‘Categorization’]],
KeyError: ‘Categorization’
Things are improving. I had to rewrite the header row, rewrite the dates and comment out Subheadline in the Python file. Now I’m going to try the larger file.
Any tips on dealing with image-related issues:
Error on 286 image
…[Many other errors]…
Error on 9515 image
Traceback (most recent call last):
File “CoPress-Convert.py”, line 847, in
main()
File “CoPress-Convert.py”, line 820, in main
version,stories,images = importStories(verbose)
File “CoPress-Convert.py”, line 727, in importStories
if content_id[0] == “m”:
IndexError: string index out of range
Ping me on Skype and I can help you debug. I’m ‘danielbachhuber’
I wrote at length about moving Pipe Dream from College Publisher to WordPress. Can’t say I remember much about the process (not that I want to), but I’m glad to help.
http://blog.danoc.me/post/18714910407/from-college-publisher-to-wordpress