I had the opportunity recently to do a very large PCO People import for another church. Since it was a few years since I did the import from my own church, this was a good opportunity to revisit the importer, considerations for planning, things to think about, as well as how to organization and prep for the import
Source data
Getting all of your data out of your current system – whether that’s Excel spreadsheets, Fellowship One, Church Community Builder, Exelio, or whatever – is the first and most important step in your migration process. If it all possible, try to get all of your data into 1 consistently formatted CSV file. By consistently formatted, I mean:
-
One line with all fields (ie no custom or “pretty” Excel formatting) — normalized
-
All data consistently formatted per column — ie having some rows with first name last name, and some with last name,first name or dates both formatted as mm/dd/yy and yyyy/mm/dd makes things difficult, and all need to be cleaned up before formatting
-
Get all possible fields out of your current system, even if you don’t currently plan to import— you can always ignore data you have but don’t want to import, but getting data you (later) determine you want is impossible once the current system contract expires
-
Since the target format to be imported into PCO People is CSV (i.e. comma separated values), commas in the middle text fields become a problem, breaking formatting — depending on the field, it can cause a number of issues, like shifting everything to the right of the extra comma over one column. In many cases, if its a name field (e.g. Bob Smith, MD) the comma is not critical (ie. Bob Smith MD) and can be removed to eliminate this issue. Another method would be to quote the string to force the importer to treat the column (with a comma) as a single string. However, this is totally dependent on the importer.
Remote ID field
Another important consideration when importing is properly associating the unique records from your old system with unique records in PCO — for example, associating a CCB or F1 Person Profile with a PCO Person Profile. As I’ve said before, the PCO People importer is phenomenal and does a great job, but it is still ultimately relying on matching, for the most part, the persons first name, last name, and email address. If your church is over a few thousand, it’s more than likely that you have two or more John Smith records. So hopefully the email address will be able to distinguish. But I have run into situations where we have a John Smith, John Smith Junior, John Smith III, several of which are using the same email address within a family. This can certainly make it difficult for the importer to differentiate into different records properly.
To address this, Planning Center has included a field in every PCO People record called remote_id, which is meant to be populated with an external unique ID, such as the profile ID from Fellowship one, CCB, Exelio, etc. Using the remote_id field not only ensures that the importer properly distinguishes the records when initially imported, but it can also be used for follow-up synchronization imports to ensure that the new information is treated as an update to the correct record, rather than create a new record.
As an example, let’s say you initially imported 5000 records into PCO People, and included the remote_id field in the import. If you were in the middle of transitioning between the old system and the PCO, and had updates that were still occurring in the old system, you could do a follow-on import of those records using the remote_id field, and the importer would properly update the existing records. It can be used in other ways which I will get to shortly.
Too big to chew
Recently I worked with a church that needed to import over 60,000 records into PCO People. In their case, they had dozens of custom fields added per person, so the overall size of the import was huge. The PCO People importer, at least at the time of this writing, was not able to handle more than about 10,000 of these specific records at a time — It would time out during the initial CSV upload and analysis, generating an error message. Now being fair, the issue was not really about the CSV being 60,000 records long, but the dozens of fields custom fields that were included in every record.
Example:
- 60,000 rows x 90 columns = 5.4million fields
- 60,000 rows x 10 columns (first_name,last_name,email,address,city,state,zip,phone,household,remote_id) = 600,000 fields (about 10% the total size)
One approach to address this — the obvious one — would be to break the import into sets of 10,000 rows or less. However this triggered a different problem — As the documentation makes reference to, the importer allows you to include a Household ID field to associate records into families. This Household ID can be anything — characters and numbers — as long as it is unique to each family set. The thing is that the importer, when it encounters this, creates a new household object in PCO People, associates it with those people, and then throws away the Household ID value you used in the import. This means that breaking up the import into sets of 10,000‘s, where family associations are likely broken across those imports as well, will break the intended family associations, since the Household ID in the import is only used per import set. A secondary problem compounds this – PCO People supports multiple emails, multiple physical addresses, multiple households per records, etc, so trying to go back and fix up the household record ID later via another import can inadvertently add additional households/emails/addresses rather than update them.
So the first solution, as noted above, to try is to split the import into groups of less than 10,000, however, you will want to be sure to do these splits at boundaries where household ID are kept together per set.
-
Sort the import file by Household ID column
-
Break the file up into groups of (roughly) 10,000, being sure to keep household IDs together per set, saving each set as a separate file.
A second solution to address this would be to use the general info for all records for the first import, then do an update with the all the other details. The import was split initially into two sets, both with the remote_id included:
-
The first set contains the general information (fname, lname,remote_id), as well as the record information that can get inadvertently duplicated (email,address,city,state,zip,phone,household) – all other columns were removed except for remote_id, first name, middle name, last name, nickname, address fields, household ID, email address — with just this set, even the 60,000 records in length would only be 600K fields
-
The second set contains all of the extended data, including the custom field data. For this set, we leave the general data (first name, last name, remote_id), remove the email,address,city,state,zip,phone,household data, and leave all other columns, including custom field data. With the initial set imported, the family associations should be locked in, so this set can be broken into sets of 10K (or whatever the imported can handle without errors) without risk of the households breaking.
-
This second set(s) is essentially an update to the records already posted in the first set, so when you get to the import staging page (the importer telling you what its *going to do*), all should be record updates, rather than new records created.
-
As noted, it’s important to keep the remote_id field in both sets (see? I told you I would come back to this ?). using one aggregate CSV file with all of your import data And that continues a unique remote_id field for every single record, make two copies of the filed, one named general, and one named custom. Open the general CSV in a spreadsheet, and remove all of the custom field columns. )
I also wanted to add a rant here about how badly Excel for Mac handles large-ish data sets, but I will leave that for a later time.
Please let me know any question, comments or corrections in the comments below.