December 29th 2011

My gitolite set-up

I’m paranoid, but also poor. I use gitolite to control access to my git repositories, because github wanted $200/month to meet half of my requirements, and wern’t interested in negotiating (I tried).

Like github, I have two types of git repositories. Public repositories; which show up on gitweb and git-daemon and etc., that everyone can access; and private repositories, which contain my bank details.

My conf file consists of:

A set of user groups: While gitolite supports multiple keys for one user, I prefer to treat my various machines as separate users, for reasons that’ll become apparent later.
@faux = admin fauxanoia fauxhoki fauxtak
@trust = @faux alice
@semi = fauxcodd fauxwilf bob

A set of repositories, both public and private:
@pubrepo = canslations
@pubrepo = coke
@pubrepo = cpptracer
...
@privrepo = bank-details
@privrepo = alices-bank-details

Descriptions for all the public repositories, so they show up in gitweb:
repo coke
     coke = "Coke prices website"

repo cpptracer
     cpptracer = "aj's cppraytracer, now with g++ support"

And permissions:
repo @pubrepo
     RW+ = @trust
     RW  = @semi
     R   = @all daemon gitweb
     config core.sharedRepository = 0664

repo @privrepo
     RW+ = @trust

This allows trusted keys to do anything, and semi-trusted keys (i.e. ones on machines where there are other people with root) to only append data (i.e. they can’t destroy anything, and can’t make any un-auditable changes).

Next, to protect against non-root users on the host itself, I have $REPO_UMASK = 0027; in my .gitolite.rc. This makes the repositories themselves inaccessible to other users. However, gitweb needs to be able to read public repositories; the above config core.sharedRepository = 0664 does this.

This leaves only /var/lib/gitolite/projects.list (which is necessary as non-git users can’t ls /var/lib/gitolite/repositories/, so gitweb can’t discover the project list itself), and repositories/**/description, again for gitweb.

For this, I have a gitolite-admin.git/hooks/post-update.secondary of:

#!/bin/sh
chmod a+r /var/lib/gitolite/projects.list
find /var/lib/gitolite -name description -exec chmod a+r {} +

Now, gitweb can display public projects fine, and local users can’t discover or steal private repositories.

1 Comment »

October 25th 2011

Diagnosing character encoding issues

Natural language is horrible. Unicode is an attempt to make it fit inside computers.

I’m going to make up some terms:

  • Symbol: A group of related lines representing what English people would call a letter
  • Glyph: A group of related lines that might be stand-alone, or might be combined to make a symbol

And use some existing, well defined terms. If you use one of these wrongly, people will get hurt:

  • Code point: A number between 0 and ~1.1 million that uniquely identifies a glyph. They have numbers, written like U+0000, and names,
  • Encoding: A way of converting information to and from a stream of bytes,
  • Byte: The basic unit of storage on basically every computer; an octet of 8 bits. This has nothing to do with letters, or characters, or… etc.

Let’s start at the top:

  • Here’s an grave lower case ‘a’: à. This is a symbol.
    • It could be represented by a single glyph, the code point numbered “U+00E0″ and named “Latin Small Letter A With Grave”, like above,
    • It could be represented by two glyphs, like: à.
      • This looks identical (unless your browser can’t cope, which is plausible), but is actually two glyphs; an ‘a’ (U+0061: “Latin Small Letter A”) followed by a U+0300: “Combining Grave Accent”. These two glyphs combine to make an identical symbol.
      • This is, of course, pointless in this case, but there are many symbols that can only be made with combining characters. Normalisation is the process of simplifying these cases.
      • Don’t believe me? Good. Not believing what you see is an important stage of debugging. Copy the above into Notepad, or any other Unicode-safe editor, and press backspace. It’ll remove just the accent, and leave you with a plain ‘a’. Do this with the first à and it’ll delete the entire thing, leaving nothing. See? Different.
  • So, let’s assume we’re going with the complex representation of the symbol, the two code points: U+0061 followed by U+0300. We want to write them to any kind of storage, be it a file, or a network, or etc. We need to convert them to bytes: Encoding time.
    • Encodings generate “some” bytes to represent a code point. It can be anywhere between zero and infinity. There’s really no way to tell. Common encodings, however, will generate between one and six bytes per code point. Basically everyone uses one of the following three encodings:
    • UTF-8: generates between one and six bytes, depending on the number of the code point. Low-numbered code-points use less bytes, and are common in English and European languages. Other languages will generally get longer byte sequences. Common around the Internet and on Linux-style systems.
    • UTF-16: generates either two or four bytes per code point. The vast majority of all real languages available today fit in two bytes. Common in Windows, Java and related heavy APIs.
    • I have no idea what I’m doing: Anyone using anything else is probably doing so by mistake. The most common example of this is ISO-8859-*, which means you don’t care about non-Western-European people, i.e. 80% of the people in the world. These generate one byte for every code point, i.e. junk for everything except ~250 selected code points.
  • Let’s look at UTF-8.
    • First code point: U+0061. This happens to be below U+007F, so is encoded in a single byte in UTF-8. This happens to align with low-ASCII, a really old encoding that’s a subset of ISO-8859. The single byte is 0×61, 0b0110001. Note that the first bit is ’0′.
    • Second code point: U+0300. This is not below U+007F, so goes through the normal UTF-8 encoding process. In one sentence, UTF-8 puts the number of bytes needed in the first byte, and starts every other byte with the bits “10″. In this case, we need two bytes, which are 0xCC, 0×80; 0b11001100, 0b1000000. Note how the first byte starts with “110″, indicating that there are two bytes (two ones, followed by a zero), and the second byte starts with “10″.
    • Note: All valid UTF-8 data can easily be validated and detected; if a byte has it’s left-most bit set to ’1′, it must be part of a sequence. No exceptions.
    • Consider
      $ xxd -c1 -b some.txt | grep -1 ': 1'
      0000046: 01100001  a
      0000047: 11001100  .
      0000048: 10000000  .
      

      , which shows the byte patterns outlined: The ‘a’ with the leading 0, then the two bytes of the combining character. xxd is the only tool you should trust when diagnosing character encoding issues. Everything else will try and harm you, especially your text editor and terminal.

  • UTF-16 is much easier to recognise and much harder to confuse with other things as, for most Western text (including XML and..), it’ll be over 40% nulls (0×00, 0b00000000).
  • Now you’ve done your conversion, you can write the bytes, to your file or network, and ensure that whoever is on the other end has enough information to work out what format the bytes are in. If you don’t tell them, they’ll have to guess, and will probably get it wrong.

In summary:

  • Have some data, in bytes? Don’t pretend it’s text, even if it looks like it is; find out or work out what encoding it’s in and convert it into something you can process first. It’s easy to detect if it’s UTF-8, UTF-16 or if you have a serious problem.
  • Have some textual information from somewhere? Find an appropriate encoding, preferably UTF-8 or UTF-16, to use before you send it anywhere. Don’t trust your platform or language, it’ll probably do it wrong.
  • Can’t work out what’s in a file? Run it through xxd and look for nulls and bytes starting with ’1′. This’ll quickly tell you if it’s UTF-16, UTF-8 or effectively corrupt.

Hopefully that’s enough information for you to know what you don’t know.

For more, try Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

No Comments yet »

October 9th 2011

PuTTY Tray

I’ve released an updated version of PuTTY Tray to puttytray.goeswhere.com, direct download: putty.exe p0.61-t004 please see the site for the latest version and details.

This is a fork of Barry Haanstra’s PuTTY Tray, which is abandoned.

Main advantages:

  • Now built against PuTTY 0.61, getting features like Windows 7 Jumplist and Aero support, and four years of core PuTTY development
  • Ctrl+mousewheel zoom support
  • URL detection works on URLs ending with close-brackets
  • Much easier to continue development of, build script generator works and source, issue and pull-request tracking provided by github.

Please raise a bug if you have any problems or requests!

6 Comments »

September 4th 2011

Addition is free, static compilation is expensive

Summing the integers from 1 to 10,000,000?

  • Perl: 1.01 seconds
  • Python: 2.04 seconds
  • Java, including javac: 0.76 seconds
  • Java, including ecj: 0.61 seconds
  • Java, including javac, with -Xint: 1.01 seconds.
  • Java, including javac, with -Xint on the compiler too: 1.16 seconds.
  • -Xint disables practically all optimisations that Java offers, forcing the JVM into interpretation mode, so it’ll operate much like perl and python do.
  • ecj is Eclipse’s compiler for Java, a faster and cleaner implementation of javac that can run standalone.

i.e. including compilation time on an entirely unoptimised compiler, Java is still twice the speed of Python.

(This isn’t really interesting or surprising to me, but the question comes up often enough that I’d like to have these here to link WRONG people to.)

Continue Reading »

No Comments yet »

August 2nd 2011

Windows XP End of Support Countdown Gadget

Windows XP End of Support Countdown Gadget The Windows XP End of Support Countdown Gadget gives you a nice countdown until Windows XP, and, more importantly, IE6 will actually finally be unsupported.

It, however, leaks memory. A lot of memory; about 1kb/second. Noting that it’s running all the time, and not important, this is rather inconvenient.

FTFY. Can’t redistribute a patched “binary” as the original is not redistributable.

No Comments yet »

June 5th 2011

Java stacktraces straw man

It’s a sad fact of life that many developers spend a good deal of time staring at stack traces.

My personal favorite situation is when you get to:
Exception in thread "main" java.lang.NullPointerException
    at com.goeswhere.dmnp.linenos.B.foo(B.java:13)

..and, line 13 is:

  System.out.println(first.substring(1) + second.toUpperCase() + third.toLowerCase());

Basically, the end of any happiness.


LineNos can fix this:

$ java -Xbootclasspath/p:linenos.jar -javaagent:linenos.jar=com/goeswhere com.goeswhere.dmnp.linenos.B
Exception in thread "main" java.lang.NullPointerException
    at com.goeswhere.dmnp.linenos.B.foo(B.java:13), attempting to invoke toUpperCase #4
    at com.goeswhere.dmnp.linenos.B.run(B.java:9)
    at com.goeswhere.dmnp.linenos.B.main(B.java:5), attempting to invoke run #2

This is implemented entirely as a Java Agent; it requires no VM modifications and is portable anywhere that supports instrumenters transforming classes (i.e. everywhere that matters).


It works by adding extra line numbers for each call on a line. Currently it adds call-number*1000 to the line number, so that it’s debuggable and easier to do in two phases, but this makes it much less efficient.

i.e., assuming Java allowed labels for line-numbers, it does:

13:
8013: System.out.println(
1013: [secret StringBuilder construction (used to implement String concatenation)]
2013: first.substring(1)
3013: +
4013: second.toUpperCase()
5013: +
6013: third.toLowerCase()
7013: [secret StringBuilder#toString()]);

Thus, the real stacktrace looks like:
Exception in thread "main" java.lang.NullPointerException
    at com.goeswhere.dmnp.linenos.B.foo(B.java:4013)
    at com.goeswhere.dmnp.linenos.B.run(B.java:9)
...

It additionally overrides StackTraceElement#toString() to decompile the named class and report the invocation that’s on that line, i.e. lookup line 4013 in the above bytecode, and report that it’s an invocation of toUpperCase, thus giving the intended result.


It has no run-time performance penalty beyond the extra time to load classes, and the increase in the size of line-number table (actually, I have no idea how this affects performance, but I can’t imagine it’s much). Printing a stacktrace is slower, however (although, it probably wouldn’t be much slower in a real implementation). Also, it’s a straw-man, so leaks memory, but this isn’t an important part of the implementation.

A much better implementation would be to store the [Class<>, line-number] -> hint mapping instead of the whole file and doing decompilation; or to replace the entire line-number table with a bytecode number table (i.e. 1->1, 2->2, 3->..), and do it all at print-time. Patches welcome.


In summary: Dear Oracle, please make the JVM do this by default. Lots of Love, Faux.

10 Comments »

Next »