I have been working on my family tree for a while now. During that time I have tried lots of different tools. Online, offline, free, paid and everything in between. Trying to make the job of researching and checking easier and an overall better experience.
MacFamilyTree is awesome genealogical software. If you are using a Mac this is the software you should be using in my opinion. This is where I keep my master tree. I also use ancestry.com, which I believe has the best research tools and content.
Using both together is a challenge to say the least. Ultimately I want to:
- Use ancestry.com to discover new individuals and events through sources.
- Use FamilySearch.org (integrated into MacFamilyTree) to discover new individuals and events. As well as pushing new found information back up.
- Easily and visually compare these changes between MacFamilyTree and ancestry.com so I can move individuals, events and facts between the trees.
The first two points don't require any work since that's how they already operate, yay! However, the third one is tricky. I have tried a few tools but none of them even came close to my expectations. I set out to build a real solution.
The one consistent thing between all of these platforms is the exchange format, GEDCOM. GEDCOM is a text-based file format that is very old. It is the de-facto standard format for representing genealogical trees and is supported by basically everything that deals with genealogical data.
I started by writing a GEDCOM decoder and encoder package called github.com/elliotchance/gedcom. From there I started to build out more advanced functionality and other binary tools. The pinnacle of which, gedcomdiff, is what I was really setting out to achieve, comparing GEDCOM files.
Comparing GEDCOM Files
The gedcom package contains (amongst other things) algorithms for comparing dates, individuals, groups of individuals, families and more. Comparing GEDCOM documents is surprisingly complicated for a few reasons:
- You cannot rely on the individuals having the same pointer, IDs, names, dates etc. To accurately match the same individual in two documents requires looking at their names, births, deaths and other similar events. Then proceeding to do the same in-depth analysis on their parents, spouses and children.
- Often the same names and similar dates appear on many individuals. This makes it especially important to consider the similarity of surrounding individuals to decide if two John Chance's born one year apart from each other (or even in the same year) are the same person or not.
- Knowing the similarity of attributes (such as the numerical similarity of a name) is only useful when all attributes of the individual and surrounding individuals have appropriate weights. For example the similarity of the individuals birth date carries far more weight than the similarity of their respective fathers' birth dates. Weights can be configure manually, however, sensible defaults have been calculated with gedcomtune, a tool for find the best weights by calculating against real trees.
When all of these are put together it is a very powerful diffing system that is both resilient and surprising accurate at finding matching individuals. Even if critical information is missing or different.
gedcomdiff is a package and binary brings all of these layers of algorithms together to generate a HTML report of the comparison between two trees.
At the top of the page is an index that provides and overview of which individuals are the same (white), different (cyan), or only exist in one of the documents (yellow or blue):
For each individual a more detailed comparison (same color rules):
The gedcomdiff program provided many options to configure what should or should not be included in the output.