Wednesday, November 25, 2009

Greasemonkey API Usage

I've been intending to write this post for months, but various things got in the way. Well, it's finally ready! Some of my ideas for Greasemonkey 1.0 would involve major changes to the way that Greasemonkey runs user scripts. The goals would be to make user script authoring easier, by removing some of the quirks, limitations, and problems that Greasemonkey's current security architecture imposes.

To begin, an aside: why does Greasemonkey have a security architecture that imposes limitations and problems on script authors? It's basically history now, but in short: Greasemonkey provides the powerful-but-dangerous capability for user scripts to break the same-origin policy for AJAX requests. Lots of useful scripts have been created that hinge on this capability. Unfortunately, it is indeed powerful, and Greasemonkey by nature mashes itself and the user scripts up with any old web page that you might visit. If Greasemonkey and/or a script it is running presents a vulnerability that the content page can leverage, all sorts of nasty things could result, from stealing your bank account, creating false ecommerce purchases, stealing the content of your private files or site data, and so on.

The point of this post, then, is to examine the landscape for user scripts today. Discover what scripts are out there, what they are like, and how they operate. What kind of changes to Greasemonkey would make these scripts stop working? What kinds of changes could we make with minimal impact? Toward that end, I've got three graphs to show you (with the raw data below).

To perform this analysis, I downloaded over thirty six thousand scripts from userscripts.org. This is by no means the entire population of user scripts out there, but I believe it is a good representative sample. I wrote a python script to read their source and (a bit crudely, but well enough) parse their contents and metadata. The first thing I was interested in seeing is how common the usage of the various GM_ apis are.

The first thing that we can quickly see is that well over half the scripts, 58.87%, use no special API calls at all. No matter what happens to the GM_ APIs, they'll keep working just fine. The most common API call is the get/set value call, at 16.50%. The cross-domain AJAX call is a close second at 15.51%, with GM_addStyle next at 12.95%. From here things trail off rapidly, but we see how common unsafeWindow and eval are, both with the potential to be very dangerous.

Browsers are progressing rapidly, however. Instead of get/set value, one could use DOM Storage, and HTTP Access Control standards, for making cross-domain requests, are being standardized. What's important to know is if the extra power provided by these APIs is actually being used, or whether these sorts of stand-ins could be a viable replacement. To investigate that, I examined how many different domains scripts are @included into when making these calls, and which URLs the AJAX calls are being made to.

The vast majority of get/set value calls (76.33%) are made by scripts that are only ever @include'd into a single domain. For these scripts, DOM Storage would work perfectly. Some execute on two, and almost none on more than two. Some also execute on every page, and this starts to be a problem. The AJAX patterns are very different.

Note importantly that my script was a bit naive with AJAX domain gathering. It used simple string manipulation to find URLs inside GM_xmlhttpRequest calls. If the URL was set in a variable, elsewhere, then the script did not find it. So of 5600 scripts that call GM_xmlhttpRequest, only 2693 were "understood" by my script -- and this may be a bad sample. Scripts that exclusively set their URLs in variables/constants may be more likely to make cross-domain requests, or even perhaps less likely.

That said: an obvious pattern emerges: plenty of scripts do "@include *" then AJAX off, likely to some other, fixed, site (20.16%). (Note: lots of these appear to be update checkers, which should hopefully be unnecessary before 1.0.) Plenty also seem to operate fully within one site (20.87%). By far the most, however, operate on one site and call another (46.79% or 1260 distinct scripts). Larger combinations of sites are minimal. Part of this group is oversimplification in my script, an @include of "*flickr.com" and an AJAX call to "flickr.com" are counted separately. Most though are the especially useful scripts that, for example, include IMDB data on Netflix, or vice versa. So, this is far too large a use case to break. Whatever we do, it seems cross-domain AJAX is going to have to remain.

Finally, I also took a look at the usage of metadata imperatives: both the "official" ones that actually affect how Greasemonkey works, and the others that are used in other tools, or added for the author's own purposes. That looks like:

Generally what I expected. Most everyone has an @name and an @include, nearly as many include an @description and @namespace. Things fall off rapidly from there, but the unofficial @version is next, and an unusual (to me) @author. From there we fall twoard the single-digit range, finding that @require and @resource are still very rarely used.

Conclusions: Over half of user scripts use no privileged APIs. All of Greasemonkey's security mechanisms are a pure hindrance to all these scripts. If it went away, they would benefit greatly. It may be possible to remove get/set value in favor of DOM Storage, but the potential damage of these APIs is so small that the cost likely outweighs the benefit. Although a minority (15.51%) of scripts use GM_xhr, it's still too many to consider removal.




Edit: Fixed GM_getResourceURL count, I first searched for "Url" and not "URL", explaining the zero found, before.



To those that are interested: the script that I used to generate these numbers is available for inspection, in case it perhaps contains a serious bug. The data that I generated with it, and the charts above, are also available to check.