College Publisher to WordPress conversion script is now open source

Alternate title for this post: Let the exodus continue. The Python conversion script CoPress used to migrate over 50 student publications to the glorious free and open source WordPress is now itself licensed under GPL version 2. It’s optimized for College Publisher 4 and College Publisher 5 databases, but will also work with most any database you can turn into a flat CSV file. You can fork it on Github or download the brand new 1.0 release.

Right off the bat, I’d like to say that the most awesome bit about the conversion script is its ease of use. Granted, you do have to run it on the command line and it does often throw mythical, unintelligible errors if your data is screwy, but it’s about 100 to 1,000 times easier than what Sean Blanda or Brian Schlansky had to go through. Furthermore, it spits out WordPress eXtended RSS files that WordPress imports natively. Depending on the size of your archives, you could even do the entire migration in less than a half hour.

There are detailed instructions in the README I encourage you to read thoroughly but, in screenshots, here’s how you’d migrate your site.

Backup your database using Sequel Pro. This is a critically important step, as you’ll definitely want a clean version to revert to if the import goes awry.

Place the conversion script and your archives in a folder you can access from the command line. Both College Publisher 4 and College Publisher 5 migrants should receive an articles file that will need to be renamed “stories.csv.” Publications migrating from the former will have all of their image references stored in a file that will need to be renamed “media.csv.” Navigate to that directory from your terminal prompt and run “python CoPress-Convert.py.”

Once the script is running, you’ll be asked a series of questions to configure the conversion process. Most options are self-explanatory, and all are explained fully in the README file packaged with the script. The most important thing I’d like to note in this post is that, unless you have less than 500 authors in your archives, I’d highly, highly recommend importing your authors as custom fields instead of users. WordPress is not optimized to add a large number of new users through its import process. We learned this the hard way migrating CM Life‘s database last summer.

When the script is done, you’ll have a series of WordPress eXtended RSS files you can easily upload into WordPress.

Mad props go to Miles Skorpen for the long hours he spent on the conversion script, and to Albert Sun, Will Davis, and Max Cutler for their later contributions.

Feel free to send along any suggestions for improvement, bugs, fixes or general comments. I intend to maintain it for the indefinite future, it’s good Python practice when everything else I’m working on is PHP, but code contributions are always welcome. There is a short list of upgrades under consideration in the top of the script.

24 thoughts on “College Publisher to WordPress conversion script is now open source

  1. Hey, csvkit also won’t handle multi-char delimiters either, but I suspect there are no legitimate “|||” strings in your data, so you can probably just replace it with a single delimiter. In vim this would be something like:

    :%s/|||/,/g

    But you could just as easily do it with Find/Replace in whatever your text editor is choice is. Once you’ve got it down to a single delimiter you should be able to process it with Excel or whatever.

    That being said, who knows what other land mines are in that data. (quoting?) If you can get a clean file, that’s a much better solution.

    Cheers,
    Chris

  2. Patrick says:

    After many hours I have successfully converted! Here’s what I did:

    *Application: vim (the file was too big for anything else)*
    First of all, CP totally messed my data up. The file they sent me was 130,000+ lines long. I did “:sort u” in vim (sort and remove duplicates) and it left me with about 30,000+ lines (that’s a lot of duplicates). After that, I would definitely suggest using Excel to clean things up, it makes things MUCH easier.

    *Application: Excel*
    I imported it into Excel using the pipe “|” as the delimiter and just told it to ignore consecutive delimiters (I’m willing to lose a few stories that contain a “|” character… Worth it). Then I did a find on all ² characters (they were used as quotes in mine) and replaced it with an empty string. At this point I still had a lot of really messed up data. Since it was nicely sorted, I noticed there were triple duplicates of many stories with only 1 containing valid data (the other two would have the columns all scrambled). At this point I realized unscrambling data wasn’t helping, I just needed to delete the messed up rows. To do this I filtered certain columns for things that didn’t make sense, for example, the title or body being blank or a category that didn’t start with a colon. With all that bad data in view I was able to delete thousands of rows at a time. After lots and lots and lots of filtering and deleting (down to about 20,000+ lines now), I decided to save the CSV and try the script. This part was weird, to get Excel to save a CSV with the “|” as a delimiter, I needed to go into Windows Control Panel > Regional and Language Settings > Additional Settings > and change the “List separator” to a “|” instead of a comma. So with that being done I saved it as stories.csv from Excel. Yay, pipe delimited. Now for the script.

    *Application: Your favorite text editor (Notepad++)*
    Some edits to the script I had to make included changing the delimiter to a “|” as well as editing the date string. CP saved my dates as Fri, Feb 10, 2012. The script didn’t like the weekday in there so at the beginning of that date parsing function I did datestring = datestring[5:] which removed the day of the week. There were some other tweaks I made along the way but I forget the details… Sorry.

    I’m really excited I got this to work, I don’t know what I would have done without this script, thank you so much!!! If anyone has any questions about what I did I’d be happy to help. I spent a lot of time Googling and experimenting, hopefully I can save someone else the time and frustration.

      • jeremygilbert says:

        Here’s the erro I’m getting:

        Beginning to read in the stories database.
        Traceback (most recent call last):
        File “CoPress-Convert.py”, line 847, in
        main()
        File “CoPress-Convert.py”, line 820, in main
        version,stories,images = importStories(verbose)
        File “CoPress-Convert.py”, line 740, in importStories
        line[cp5_map['Categorization']],
        KeyError: ‘Categorization’

      • jeremygilbert says:

        Things are improving. I had to rewrite the header row, rewrite the dates and comment out Subheadline in the Python file. Now I’m going to try the larger file.

      • jeremygilbert says:

        Any tips on dealing with image-related issues:

        Error on 286 image
        …[Many other errors]…
        Error on 9515 image
        Traceback (most recent call last):
        File “CoPress-Convert.py”, line 847, in
        main()
        File “CoPress-Convert.py”, line 820, in main
        version,stories,images = importStories(verbose)
        File “CoPress-Convert.py”, line 727, in importStories
        if content_id[0] == “m”:
        IndexError: string index out of range

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>