Isn't this a fundamental difficulty with any flat text-file representation for code, though? Many programming languages don't lend themselves to easy semantic analysis based on even a single file, never mind a combination of files. We usually don't work with canonical textual representations of even quite simple programming ideas such as the example in the blog post of replacing a nested condition with a guard clause.
At our current level of expressiveness, the refactoring tools that make these semantic changes mostly aren't a huge step up from brute force text editing, and the mechanical changes we can automate are usually trivial compared to the way a developer might conceive a required change in the behaviour of his code. Trying to reverse the process to identify the semantic significance of changes after the fact is far beyond us today.
In fact, it's hard to see how we could ever move beyond that level without developing some new and much more semantically rich way to represent our programs. At that point, the idea of diffing raw text code side-by-side might seem like it comes from the dark ages anyway...
There have been a thousand graduate school and commercial project to try to develop program representations and languages that were superior to the flat text file. Obviously, none of them have received wide usage.
What has succeeded for many is scriptable text editors, cscope-like tools, autocomplete-like features, and refactoring support in editors. But as you point out, these are better tools to work with program text.
I think text is safe for the forseeable future. Partly because we have thousands of years experience with it and partly because it's a winning combination of a extremely powerful representation format and a KISS solution.
There have been a thousand graduate school and commercial project to try to develop program representations and languages that were superior to the flat text file. Obviously, none of them have received wide usage.
Chunk format Smalltalk change logs are a text file, but they're combined with a mechanism for treating code changes like db transaction logs, in a language with very little syntax, with very fine grained codebase change (method level, with generally small methods) built into the language and environment. It's also as rock solid as traditional flat text files, and even more robust in some regards.
Code syntax which requires all information relevant in a class to be defined together interferes with a change log scheme. But such syntactic structures are helpful if dealing with code in traditional flat files. In a way, it's like an Evolutionary Stable Strategy. It's mostly a flat file world, so the languages are mostly implemented with that in mind. This general situation makes it almost impossible for anything else to develop its won ecosystem.
I think forward progress will be made by leveraging "code folding" schemes. Eventually, this will amount to the same functionality people are aiming for then they discuss non-text representations.
I agree that, for better or worse, we'll be using text-based formats for a while.
I don't think the hypothetical alternative representation I mentioned is going to happen off the back of a single grad school research project, or anything even close to that scale. It's more like something that's going to take the R&D lab of an industry heavweight a decade to develop and refine and then launch into widespread industrial usage with the backing of at least one major platform developer when it's ready for prime time.
Sadly, as long as text-based (or, perhaps more accurately, line-based) programming languages are good enough to produce acceptable software, there isn't a truly compelling motivation to develop a completely new model. I wonder whether increasing pressure on the software industry to provide quality and security as a backlash against the current cheap-and-nasty trend will drive us toward more radical programming models first, and perhaps get us close enough to make the quantum leap to an entirely new kind of representation from there.
IMHO the limiting factor can be described as "semantic density". If you look at non-LCS diff algorithms like patience diff [1] and think about where it falls down, its on cases where its very arguable that the code is poor - too much duplication therefore low semantic density. I think (and am attempting to further this at work for non-code) that the effective key is to do something like patience diff on characters but looking at the context of the individual characters when determining whether they are unique.
The benefit here is that you don't need to know about the parse tree, don't depend on special characters to delimit elements (newlines), and don't get LCS type diffs that aren't based on any sort of semantic intuition beyond that the minimal change is probably what was meant.
I've read about Patience diff, but could not find good, real-world examples where LCS diff choked but Patience succeeded.
Just last week, however, I had a small code change that produced a crazy diff that would give my code reviewer an unnecessary headache. I tried Patience diff on a whim and it produced a clean diff. Admittedly, it just entirely -removed the old code block and +added the new one, but at least it was readable.
I work on a code review tool, and we've been shipping patience diff for a few years now, and it has really cut down on complaints about the resulting diff being incomprehensible, and on the one or two occasions people have complained its been really easy to point at the data explain why - and that has always been because there was a "missed opportunity to write better code" or a big pile of copy-pasta.
I know its sort of a silly argument to make, but LCS's algorithmic intiution leaves me cold.
Very interesting. Is patience diff your tool's default algorithm? Do you support diff algorithms (perhaps for different file types) other than patience or LCS?
If by default you mean only available, yes :) Non-code review is really what is driving my delving into non-crlf delimited diff elements. Breaking prose into big enough chunks to diff well and still find small copies is my current target. Once it works for prose I might try to use it on code.
I've been itching to figure out how to combine perceptual hashing, computer vision like opensurf, and stable marriages to do something really interesting for image diffs, but until I can shed some less sexy responsiblities I don't think I'll get to play with that.
You'd probably be better off working on the revision control side of things. As a simple example, develop a more semantically rich way to represent the diff for human review. Perhaps a tree graph in both panes with highlighting based on what has changed, as well as nodes that will be affected by the change (e.g. functions that call the changed function)
I say this because such a tool may be easily adopted into existing workflows. This is key.
Diff works best when things are single idea per line, and when control structures don't get in the way.
One example - the use of the trinary (?:) operator as a replacement for if/else assignment statement can quick and easy when programming. The problem is that diffs with it can look like a total mess because a whole lot of things are happening in that one line.
Similarly, certain languages where program flow or control structures make it so that people are inclined to make many things happen in one line (inline regex, lisp or scheme syntax lanagues) can diff in a confusing manner.
Diff is great when everyone is using a coding/whitespace standard, and things tend to atomically happen on one line.
I'd encourage that when refactoring code that's not to the standard, you do two passes - one to clean up the code to the standard, then another to make the actual changes.
> * Diff works best when things are single idea per line, and when control structures don't get in the way.*
Has "coding to a diff" changed your coding style?
When I know that I will have to prepare a diff for code review, I find myself writing code in a way that produces a cleaner diff. I use more vertical whitespace to collect related code into logical "paragraphs" and keep from stepping on other code when diffing.
I definitely agree with doing formatting/nonfunctional changes independently from other changes. In general all commits should be as focused as possible: have one purpose, don't just change whatever else you happened to think of while editing the same file.
A "diff" is also incredibly useful prior to a commit to make sure that you changed only what you thought you did. Programmers should go to great lengths to adopt a style that is "diff compatible" so they aren't likely to miss anything important.
Good point on making separate commits for "whitespace" versus "content"
We currently have pep8 (python style checker) as a git commit hook, and the first few times we used it, we'd have 10 lines of legitimately modified code and then 100 lines of "style corrections."
Usually I commit my changes first, and then do another pass with style changes. Maybe it would be easier if I did it the other way around - whitespace pull request first, followed by the actual change.
I've used environments with quick and reliable Undo/Redo stacks. To be completely compatible, one would have to treat code as data in a "nondestructive" editing scheme and save off the actual refactoring steps, which can then be committed automatically.
Or maybe, a "quick snapshot" facility could be developed to make it easier to save off intermediate steps and commit the series of them automatically.
The point is fair enough, but to call this a tyranny seems a bit much.
If one believes (as say that Literate Code movement does) that code alone isn't sufficient to convey an author's intent, why should diffs alone, which mark changes in a codebase, be any different?
Comprehensible commit messages and comments explaining the purpose of a method itself (and perhaps the mechanism around which a block of code functions if necessary) are no vice and go a long way to mitigate or counteract any such "tyranny".
Seriously: in most cases it's possible to split up a 30-line patch into 5 separate patches that, applied in order, monotonically improve your code base and are easier to review, in total, than one larger patch. Changesets are cheap, and we should be optimizing them for easy review, so any eyeball we get can see what's going on.
When I'm ready to commit, I usually have solved the problem in code. In order to create five separate patches that illustrate the line of thinking, wouldn't I have to go back in time and re-create the intermediary steps?
I suppose I could create patches as I am actively solving the problem, but at that point my code may very well be a mess that I'd have to clean up each time.
Of course all of that is moot if your problem naturally segments into several patches, if nothing else than simply by the virtue of being larger than a 30-line patch or involving several mostly independent components.
If you've completed a big change and can't stage it in separate commits that build upon each other ("telling a story" of the feature development), I recommend at least splitting non-overlapping chunks that can stand (compile/test) independently into their own commits.
git-cola is a good GUI to visually stage chunks into separate commits. I haven't really used git-cola's other functionality, but I really like its visual staging features.
Basically, yes. It does mean more work for you, for the benefit of the reviewers of the code. Arguably, it is more important to optimise for the readers of the change (many people, spread over time) than for the writer (one person, once).
If you use, say, git, you might commit every single, tiny change separately, on a private (local) branch; then use `git rebase --interactive` to reorder/merge the commits as necessary. This is the easiest way I have found, but it still involves more work.
If you wanted to take it further, you could have your editor automatically commit on every save, with a post-commit hook that takes your changes into a staging area, compiles/tests, and provides a report for each commit.
It isn't necessarily easier to understand five simpler patches than one more complex patch. That is the same argument that makes people believe that short functions are always better. It's a bookkeeping trick: allocate part of the cost to something you consider irrelevant (e.g. the incremental cost of a commit or of defining and calling a function) and it seems like total complexity has gone down when in reality it has often gone up somewhat.
It's easier to understand five simpler patches if 'diff' is too stupid to render the more complex patch. "Here I renamed this variable, then I moved this hunk over there, then I made the change."
Even aside from the artifact of diff, I don't think this situation is analogous to short functions at all. The big difference is that commit logs are much more serial in nature. If I break one commit down to a sequence of 5 simpler commits they will almost always be read in that order, with context preserved.
Largely agree, but it's still a tradeoff. Having many fine-grained commits makes individual steps more intelligible but the overall path harder to make out.
Does anybody actually use commit histories to eyeball the evolution of a project? I have some experience browsing them to find simpler-to-understand snapshots, but that's a much more incremental idea, and even that is usually met with blank looks.
If there were people doing this, I'd see the odd little tool to make it easier, to annotate logs and so on. But there are no signs of this.
If nobody is looking to commit histories for narrative, perhaps it's because the commit history is the wrong place for it. Since history is immutable it's really hard to have a coherent narrative of a project. Trying to do that ends up with commit messages like "final version", "really final version", "final version this time for sure", etc. My attitude has been to leave the narrative up to the reader to reconstruct. All I can do is talk about this particular point in time.
Hmm, even if the globally coherent narrative is impossible, perhaps it's worth trying to keep a piecewise-coherent history. I tend to have 'section boundaries' where I start a new feature/subsystem/narrative[1]. Perhaps I should demarcate them with "=== " or something so they're easier to see in the log.
[1] Again, I never attempt to demarcate where a feature or narrative ends, because that's impossible to judge without hindsight.
I don't know if this is what you mean, but I use version-control diff, log, and annotate ("blame") extensively for finding the commit that introduced a particular feature, in the hope that the commit message or the diff for the entire changeset will shed some light.
Emacs has tools that make this fairly easy with any version-control system supported by Emacs.[1]
Presumably some of the web-based browsers provided by version-control tools offer the same functionality, but every one I have seen lacks the crucial feature of re-running "annotate" again starting from the revision just before the revision that last changed line X. Otherwise you're looking at "annotate" output for a whitespace change, and to get to something useful you must manually get the log for the file, find the previous revision to what "annotate" was reporting, and run "annotate" again from there.
I've also toyed with storing documentation in commit messages themselves. For example, I wrote a blog post[2] where all the code samples in the article reflect files tracked in a version-control system, evolutions in the code samples are different commits, and the text of the article itself is taken from specially-annotated text in the commit messages. Turns out (surprise surprise!) that this is totally unmaintainable, but it was a fun exercise.
Do you have any concrete examples/writeups of your approach?
Your emacs guide looks awesome. I didn't realize just how powerful emacs can be for this version control browsing. I'm going to go through it in its entirety.
I've tried to put some tools together several times (trying to integrate with vim) without success. Mostly my approach boils down to being more aware of the commit history as I navigate. When I find myself in a new codebase I start with browsing the initial commits. Then as I go over the codebase I aggressively use git log <path>, and might drill down to look at specific, tantalizing commits.
One of my project ideas is a 'wikipedia for open source' where anybody can browse the code for open source projects both in space and in time (like emacs seems to allow), and add annotations to specific revisions. While reading, annotations from previous revisions are rendered as well, but every annotation would give some indication of age (like '350 days ago' on HN) which would help the reader gauge if it might be out of date. This would allow readers to collaboratively say, "read this snapshot first if you're new to the project" and so on.
IMO the big reason lots of great hackers mistrust code comments is that when you add a comment it hangs around forever by default, unless someone takes the time to decide to delete it. Attaching comments to the commit log or to annotations on a specific revision helps with this.
---
"I wrote a blog post where all the code samples in the article reflect files tracked in a version-control system, evolutions in the code samples are different commits, and the text of the article itself is taken from specially-annotated text in the commit messages."
You can make your approach more maintainable if you give up on keeping the prose coherent. The biggest problem with documentation is that it doesn't get written most of the time. I focus on mechanisms that make it more likely I will provide that one key sentence for future readers. And -- like in wikipedia -- I assume the reader is reading critically enough to be able to handle glitches.
I tend to have different techniques for different situations. For instance, when wrapping an existing chunk of code in an "if" statement or "while" loop, I'll check in the initial change without re-indenting the existing code. Maybe I'll half-indent the new parts or maybe not indent them at all. Either way, that makes the diff readable. Then I'll do a re-indent and check that in, but to me that's ok since it's just a "whitespace" checkin.
I wonder if there's a sort of perverse incentive happening where devs try to avoid checkins that are "too simple" or "not meaty enough". It seems like a similar phenomenon to avoidance of using the extract method refactoring even down to the limit of one line methods.
A lot of people have mentioned ugly diffs when lines are considered the basic unit of granularity, often giving examples that are much cleaner if words are considered fundamental (e.g. latex, or lists of files in makefiles). But there are many tools that can handle word-based diffs, and most have options to change what's considered a word.
git diff can use --word-diff(=color) and --word-diff-regex=...
There is also the venerable "wdiff" program.
I've only found one program that can do "word patches" though: "wiggle" http://freecode.com/projects/wiggle . It works well, though I've found the interface to be slightly confusing. But turning it into a git diff and merge driver isn't that hard.
If you use vim, try the fugitive plugin[1]. You can do :Gdiff to see side-by-side diff between what is staged and the working tree. It provides some context, but if you need more you can unfold it.
More generically, many diff viewers exist, and certainly help understand what is going on with a diff, in a side by side comparison. However, in a multi-file refactoring, sometimes this can still be a bit tricky. Say for instance you split one method into 3 smaller methods that are somewhat inter-dependent (you have x(), but now it is a(), b(), c(), and you sometimes call a(); c() and sometimes b(); c(), and sometimes just c()); in that case you have a complex diff that is hard to make sure everything is correct in. Even if you can turn it into a series of changes keeping x() as a wrapper around a(), b() and c(), you'll frequently end up with several multi-file diffs to look at as you migrate.
I somewhat agree with the author, that maybe there is some better change viewing paradigm we aren't seeing for these complex cases, that could benefit everyone.
I realized that I didn't want to look at the diff any more. I just wanted to see the full body of the affected method before and after my changes. Oh, and I also wanted to see whether I added new tests or changed existing ones.
Well there are multiple ways to do that. This doesn't really have anything to do with the diff but how you chose to view it.
Absolutely, I was just thinking that. Many version control systems support an arbitrary diff command, e.g. via environment variables. In that case, one might be able to try tkdiff, for example.
That would be an interesting feature to add to diffs - if there are huge structural changes in a block, just show before and after instead of trying to show differences.
No idea how you would define the criteria or implement it though.
I'd like an option to collapse additions or removals of entire functions into just one line saying "foo() added" or "foo() removed", instead of the 50 lines of the function's body scrolling past. However that does have to be language specific.
Most languages and file formats don't enforce an inherently "undiffable" layout, so it's really an issue of programming style. People tend to lazily do what's easiest to write instead of thinking about what will be easiest to read later (in "diff" or otherwise).
A common example I see is something like a one-line list of files to build in a makefile. It is certainly possible to put every file on its own line and backslash-escape each line ending, and doing so produces a very readable "diff": if someone adds a file you see "+ xyz.c" (or whatever) and that's it instead of a mangled mess of file lists repeated with one word that's different.
Simple suggestion: Don't change the old code. Copy the old method and then rename the old method. Refactor the copied code. Diff will only mark your 'newly added' code which will check in without conflicts. Later remove the old method.
For the reasons described in this post I tend to separate in different commits those changes that alter functionality from those that are refactorings.
Sometimes I refactor before making my changes, and sometimes I do it afterwards. In any case I try not to mix a change in the functionality and some refactoring in the same commit. That way, in retrospect, it's easier to me to understand each commit: the ones related to changes in functionality have simple, easy to understand diffs, and the ones related to refactorings have messy diffs but at least I know that they don't change any functionality.
I have the same problem when maintaining Latex files in some RCS. After any "big" change (rewritten sentence, etc), I customarily reformat the text in emacs so that it looks nice on screen, which also totally messes up the diff.
emacs has a mode which allows one "logical" line to wrap and to be edited as many "physical" lines, but when I tested it few years ago, it was rather broken. Fortunately, for editing Latex and such, I don't really need the diffs, I'm just interested in archival.
At our current level of expressiveness, the refactoring tools that make these semantic changes mostly aren't a huge step up from brute force text editing, and the mechanical changes we can automate are usually trivial compared to the way a developer might conceive a required change in the behaviour of his code. Trying to reverse the process to identify the semantic significance of changes after the fact is far beyond us today.
In fact, it's hard to see how we could ever move beyond that level without developing some new and much more semantically rich way to represent our programs. At that point, the idea of diffing raw text code side-by-side might seem like it comes from the dark ages anyway...