An all too common scenario…
At work we use the PSR-2 Standard for code style when writing PHP. This works great for new code but creates a new problem… If your editing an old file that does not use this style you have 3 options:
- Use the exiting style of the file; this creates the nicest diff for code review and looking back an annotations but moves you even further away from a nicely formatted codebase.
- Format the whole file; this incremental approach makes sense but can create an enormous diff that both makes it bad for code review and add your name to lots of lines you didn’t actually modify (just changing white spacing).
- Format a portion of the file; this may be just the functions you have worked on. This may be the worst solution, it minimises the noise in the diff but now created multiple styles in the file for the next person to be confused and frustrated with.
Wouldn’t it be great if everyone used PSR from the start?
Enter git filter-branch!
Git has tonnes of great tools that can be very powerful, filter-branch is another example of this. It can be used for lots of things we won’t explore here — rather we will use it for a very specific use — filtering through the tree.
--tree-filter will effectively replay the whole history of commits (all other branches as well, as we will see), apply an arbitrary command and commit whatever is generated to a new history using the same commit metadata. Fantastic! This is like having all the original commit authors format all their code before they commit.
So what’s the catch? From a historical standpoint you will get an all new history — this means any systems you have that rely on existing commit IDs will now point to invalid commits. This will probably effect CI systems and the like, but hopeful only once until it gets its new bearings.
I can deal with the new history, is there anything else? Well the other major problem is the amount of work it’s going to have to do if you have a large codebase. For this we are using phpcbf but this would work for any language that provides a tool to automatically clean up code style. However, the formatting tool will try to format the entire codebase with every commit which would create an enormous amount of unnecessary work.
It is still a good starting point so here’s the command:
git filter-branch --tree-filter 'phpcbf src' -- --all
The --all will do all our branches, not just the history of the current HEAD. The -- before the --all is important so that --all is not read as a ref.
It’s Time to Get Tricky
We can do a lot to speed this up:
- Move the git repository to a in-memory file system like tmpfs. This will certainly make the I/O faster but the real problem is its doing a lot of identifiable extra work we can cut out.
- Run all the file formatting in parallel. For a given commit all the files could be formatted independently and at the same time. I originally tried this in bash but it's more trouble than it’s worth, especially if it hits a commit with a lot of changes that spawn thousands of processes.
- Only focus on files that have been changed. This is easily the largest performance boost, but used in conjunction with the two above should turn hours to formatting in to minutes.
I won’t explain here how to do the first point; you might want to try the commands below first before going that extra work.
git filter-branch --tree-filter 'phpcbf $(\git show $GIT_COMMIT --name-status | egrep ^[AM] |\grep .php | cut -f2)' -- --all
Just to give a quick overview of what it’s doing;
- git show $GIT_COMMIT --name-status would return all the modified files for that commit.
- egrep ^[AM] filters down the statues to Added and Modified only. No need to try and format files that are being Deleted.
- grep .php to only format PHP files.
- cut -f2 removes the status prefix from the list so we just get the raw file paths.
Verifying the Result
Use git blame to look at a file that was not formatted previously. You should see that this file is now formatted nicely and has all the original authors and dates on the left of the output.