Lexi's Blog

Using UTF-16 Files in git – or How You Teach Linguists the Basics of Version Control

Intro

This post is a bit of a story about how I came to know quite a lot about git, the version control system. The intended audience for this post is two-fold: Both technical folks like developers, who will likely be interested in the detailed section How To Actually Handle UTF-16 Files In git, and not quite as technical folks, who might be just generally interested in technology/IT topics, who might work as project managers but have no deep technical knowledge, or are just generally interested in whatever I write.

I will indicate the amount of techiness in the first sentences of the relevant section; so if you feel like you belong in the second category, you’ll know when it’s safe to skip certain parts. (If you happen to stumble over this, and you are one of the people who have worked with me, do say hello!)

Alright, so here we go!

Linguists Aren’t Developers, So They’re Not Using Developer Tools

Several years ago I worked in a linguistic project where we had to build grammar recognition rules for certain languages – in my case German.1 A GUI, built specifically for development of those language rules – an IDE, if you will – was provided to us.2 Trouble was, it was impossible to work on this stuff in parallel – there was a source folder, hosted on a shared cloud directory, and everyone on the team had to access this directory. If you and somebody else on the team were trying to make changes at the same time, one of you would inevitably win that race, and overwrite the changes of the other without them noticing (until it was too late, that is).

Management would try to employ people from across different time zones, in the hopes that that would make it easier for people to work on the source directory together. There were also instructions about how to contact the team via Skype – yes, that was the communication tool of our “choice” (in other words, management had said so) – and where in our shared OneNote access you had to write that you were “taking Main” (our name for the source, which I found surprisingly well-named for…whatever the heck kind of situation that was), wait a couple of minutes to make sure that really it was safe for you to grab it – and then copy the directory for your own work, work in there for as long as you needed, then do the whole “hey, anyone working on Main right now? I’m about to copy my changes in!” shebang again, wait a couple of minutes… you get the idea.

It didn’t take me long to approach my manager and ask him “hey, have you ever heard about git?” I gave him a bit of info about what version control is, and asked whether I could spend a little time trying to figure out if we could actually work with git, instead of all the copying and hoping we hadn’t just destroyed 2 hours of work by accident. Luckily my manager agreed, and so I started figuring out how to set up a repo for that.3

At that point I had already found out that the actual files that were being processed were just text files containing a domain-specific language (that I didn’t bother learning because that would have been really overkill. I’m 100% sure that this is the only project on the planet to use this schema, so no. Just no.) I started getting it all set up, but soon discovered that there was a problem I was repeatedly running into with some of the project’s files: git didn’t recognise changes made in the text files, and treated them as binary (I wasn’t out as non-binary back then, otherwise I might have been upset about this… ;).

After a lot of searching online, and some shenanigans on the command line, I found that these project files were encoded in UTF-16LE. Git assumes that your text files are UTF-8, and if not will not show a diff, or make it easy to merge them.4 Needless to say that this made the whole process of migrating to a git-based workflow a bit unwieldy.

How To Actually Handle UTF-16 Files In git

For non-techies, this is the part you’ll want to skip.

So, now we’re getting into the part that you, the techie, are really interested in. Kind of like how you always want to “jump to recipe” in any online recipe, because they’re infested with SEO stuff (which they need, because otherwise they won’t show up in big corporate global search engines…you know, nobody is a winner here).

So. UTF-16 is text, and there’s a way to tell git how to treat these. You kind of just have to make a deal with an infernal entity, in which you sell them years of your life in exchange for this arcane secret of getting git to accept your humble offer files.

What you will need is a .gitattributes file, and an entry in your git config. The .gitattributes needs to reside in your git repo, just like .gitignore would, and contains information about how git should treat special file types: the Pro Git example is Microsoft-Word-created doc or docx files, that aren’t text files, but wouldn’t it be nice to diff them because they contain mostly text? And there is actually tooling available for that, but since that’s amply described in the respective chapter in Pro Git, I’m not going to delve into that here.

Now for the actual work!

First I created a .gitattributes file, and added this to it:

my_file_name.txt filter=utf16
my_other_file_name.txt filter=utf16

By using the exact names of the files we avoid unintended side effects when UTF-8 encoded “normal” text files are tracked. That is, we don’t want all txt files to be treated with our “utf16” filter, just these two (and surprisingly, no, these were not the actual file names). The string “utf16” is something I chose, and yours could be called differently; you’ll define what this means in your git config.

Now we need to add the following lines to .git/config:

[diff "utf16"]
    textconv = iconv -f UTF-16LE -t UTF-8

(I explain the iconv parameters in this StackOverflow answer.)

As you can see here, the “utf16” is really just what I decided to call it. I could have called this config key “hellish contraption to make my life harder”, but that would have been annoying to type more than once, and I’ve just typed it here, so that’ll have to do.

There’s more to git attributes and UTF-16LE than just textconv; I can only recommend you read man gitattributes thoroughly, and then man git-config and search for “encoding” in both. It’s a real nightmare, honestly. I’m glad I don’t have to work on UTF-16 files anymore.

Now, here comes the part where I’m actually feeling a bit sheepish that I post this, because I cannot offer a proper solution. Because the thing is, I can’t even say for sure which combination was what worked in the end, because it was that long ago, and I don’t have access to the repo anymore, as you do when you leave a job behind. So, I wish I had more to offer you, but it turns out the whole thing is so complex that I can’t even reproduce it in a test repo. I fiddled with this for another hour or so, but gave up eventually. If you do figure it out, I invite you to write your own blog post about it! That way, everybody wins :)

When You Build a Thing, Expect Trouble, Or: How I Almost Introduced a Giant Vulnerability

Non-techies, you may breathe in relief, because the overly technical part is over – there’s still a bit of detail in here that might go over your head, but just skip those sentences :-)

Now, this conversion (which I did manage to get working, both for diffing and for merging, eventually, after literally weeks of investigation and trial and error) introduced another problem: every single one of my team members, and everyone who’d ever have to use this, would have to add that to their repo config, or they’d not get the benefits of the conversion, and git would still report their files as binary. I could obviously write a Readme file and tell everyone that they have to run git config diff.utf16.textconv "iconv -f UTF-16LE -t UTF-8" before they do anything else. I could write a setup script that executes this for them, which still relies on them reading the Readme – and weirdly that is something you can almost never rely on; people will forget that there’s a Readme, or they don’t have access to it, or a thousand other reasons why you absolutely shouldn’t have to leave this in your users’ hands.5

Well, it turns out that this – writing about it in the Readme or some other way of informing your team members – is actually the safe way! Ever wondered why the git config is not committed? That would be a big security risk! Back then I didn’t even consider that there could be other options than just talking to people, so I ended up telling everyone individually anyway. I’m honestly glad I didn’t accidentally add a huge security risk to this project, or they might not have been as happy with me as they were. (I’ll also add that talking to your colleagues in regular intervals is good for you. When was the last time you saw them all for lunch or coffee? Might want to do that ;)

When You Build a Thing, Expect Trouble, Part Two, Or: How I Accidentally Made Myself Invaluable To This Project

There were more problems than just the diffing of course. We ran into all kinds of troubles like line endings being the nightmare they are on Windows,6 and then some random issues that seemed to pop up now and then, but we couldn’t really figure out why exactly. I ended up telling people to re-clone from our main repo quite a lot. It wasn’t perfect, but it did work out well for us.

No more “is anybody using Main?”.

No more 5-minute waits to ensure you don’t accidentally fry another developer’s work.

No more “oh, could you do something else? XYZ is just working on Main, I’m sure it won’t take long.”

Instead, I got to have fun with rebasing, which taught me the value of the rerere.enabled config. I also learned how to plan and manage branches and integration from multiple people, which I hadn’t had to do before, and had no idea about. Me and another colleague, who next to me felt most comfortable using the command line, became “integration managers” – people’s changes had to be reviewed by us before they could be merged. We did do that to ensure that everything worked out, but it was only necessary because now we had complexity in the process we didn’t actually want. We still considered the boon that we could actually get work done at normal office hours, and not wait until someone was finished, worth the irregularly occurring problem with merging, or a broken repo. Again: not perfect, but better than before. Progress can be nice!

I also accidentally caused some trouble myself a couple of times. In figuring out the exact combination of attributes and config and what have you, I rendered the file completely unusable once. Another time I managed to have everything work fine – on my machine, that was, and other people couldn’t open the file from the GUI anymore.

The thing is, though, the benefits of this change were actually really great for the project. After we managed to get it to work (mostly) in our German team, my manager asked me if I could help introduce that to some of the other languages they were writing rules for as well.

I ended up helping a lot of people with their first forays into git; most of these people were not in fact developers, but linguists. Some of them were more tech-inclined than others, but I’m proud to say that, after my mentoring, even the less tech-y folks were confident in using the limited amount of git commands they needed to do their jobs.

For the QAs that meant a combination of git pull, git checkout, git log and git show, for the developers (I ended up calling all the linguists developers, because we were developing, only not writing code but grammar7) it was a bit more – I wrote a lot of documentation back then, with detailed step-by-step guides, and most of them included a “call me if you tried all this and are still stuck” NB at the end. I also did end up in a lot of calls, but I usually got the people out of their dilemmas quickly, so everybody was grateful.

Final Thoughts

To this day, this is one of my most favourite experiences as a developer. The gratitude of those people, who had to learn a tiny bit of a more “tech-y” thing, but which ended up helping them immensely in their day-to-day work, was so refreshing. The feeling of accomplishment after burrowing through what felt like hundreds of StackOverflow questions and answers, forum threads, man pages, the odd blog post (I absolutely can’t remember what I read where, but I do believe it’s likely that if you search for “git utf-16 gitattributes” you might end up reading the same pages as I did back then) and what have you, until it finally worked, and I had done this, I had managed to learn all this stuff and “build” a working solution (it’s more configuration, if you want to be technical, but that didn’t make it any less impressive for me back then)! The amount of stuff I had learned about git – to this day, in most projects, I quickly realise that I’m one of the most, if not the most knowledgeable person around git things in the room. Really makes me think that Ludic is right, you need to read one book to become good at something. I now know that I am good at git, and that’s something I am proud of.


  1. For use in a well-known text processing tool of an equally well-known company. ↩︎

  2. For non-techies: GUI: Graphical User Interface, in other words, you don’t have to use a command line. Like you’re probably not reading this blog post in a terminal but a browser. IDE: Integrated Development Environment, basically a very fancy code editor that can do more than just let you type in code. ↩︎

  3. For non-techies: repo = repository, that is: where the code resides. ↩︎

  4. For non-techies: “diffing” means “viewing the differences in your versions”, and “merging” is the process of putting these differences together. I’m not going to go into the differences between UTF-8 and UTF-16 here, but just be assured that the latter probably only has a good reason to exist in legacy systems. If you use UTF-16 today when you could be using UTF-8, you have earned your place in development hell, where a nerdy demon teases you with your inability to use tar flags correctly↩︎

  5. There is a way to have parts of your git config tracked, but you still have to be careful what you put in there: include and includeIf↩︎

  6. For non-techies: There are special characters that indicate that a line in a text file ends. And for whatever reason, these are different on different operating systems. You don’t really need to know more than that Windows is the odd one out. Don’t search it, you’ll regret it, I promise! For techies: You might of course already know this, but in case you’re not familiar, here goes: Windows uses \r\n, which is “carriage return” and “line feed” (I keep forgetting, but the term “newline” actually means all of those: See Wikipedia), while unixoid systems like Linux, BSD etc. only use \n (“line feed”, for whatever unholy reason). Mac has adapted to the unixoid version, but used to only use \r. See also XKCD #927↩︎

  7. Chomsky would agree that these are mostly interchangeable↩︎