Wednesday, November 25, 2009

Greasemonkey API Usage

I've been intending to write this post for months, but various things got in the way. Well, it's finally ready! Some of my ideas for Greasemonkey 1.0 would involve major changes to the way that Greasemonkey runs user scripts. The goals would be to make user script authoring easier, by removing some of the quirks, limitations, and problems that Greasemonkey's current security architecture imposes.

To begin, an aside: why does Greasemonkey have a security architecture that imposes limitations and problems on script authors? It's basically history now, but in short: Greasemonkey provides the powerful-but-dangerous capability for user scripts to break the same-origin policy for AJAX requests. Lots of useful scripts have been created that hinge on this capability. Unfortunately, it is indeed powerful, and Greasemonkey by nature mashes itself and the user scripts up with any old web page that you might visit. If Greasemonkey and/or a script it is running presents a vulnerability that the content page can leverage, all sorts of nasty things could result, from stealing your bank account, creating false ecommerce purchases, stealing the content of your private files or site data, and so on.

The point of this post, then, is to examine the landscape for user scripts today. Discover what scripts are out there, what they are like, and how they operate. What kind of changes to Greasemonkey would make these scripts stop working? What kinds of changes could we make with minimal impact? Toward that end, I've got three graphs to show you (with the raw data below).

To perform this analysis, I downloaded over thirty six thousand scripts from userscripts.org. This is by no means the entire population of user scripts out there, but I believe it is a good representative sample. I wrote a python script to read their source and (a bit crudely, but well enough) parse their contents and metadata. The first thing I was interested in seeing is how common the usage of the various GM_ apis are.

The first thing that we can quickly see is that well over half the scripts, 58.87%, use no special API calls at all. No matter what happens to the GM_ APIs, they'll keep working just fine. The most common API call is the get/set value call, at 16.50%. The cross-domain AJAX call is a close second at 15.51%, with GM_addStyle next at 12.95%. From here things trail off rapidly, but we see how common unsafeWindow and eval are, both with the potential to be very dangerous.

Browsers are progressing rapidly, however. Instead of get/set value, one could use DOM Storage, and HTTP Access Control standards, for making cross-domain requests, are being standardized. What's important to know is if the extra power provided by these APIs is actually being used, or whether these sorts of stand-ins could be a viable replacement. To investigate that, I examined how many different domains scripts are @included into when making these calls, and which URLs the AJAX calls are being made to.

The vast majority of get/set value calls (76.33%) are made by scripts that are only ever @include'd into a single domain. For these scripts, DOM Storage would work perfectly. Some execute on two, and almost none on more than two. Some also execute on every page, and this starts to be a problem. The AJAX patterns are very different.

Note importantly that my script was a bit naive with AJAX domain gathering. It used simple string manipulation to find URLs inside GM_xmlhttpRequest calls. If the URL was set in a variable, elsewhere, then the script did not find it. So of 5600 scripts that call GM_xmlhttpRequest, only 2693 were "understood" by my script -- and this may be a bad sample. Scripts that exclusively set their URLs in variables/constants may be more likely to make cross-domain requests, or even perhaps less likely.

That said: an obvious pattern emerges: plenty of scripts do "@include *" then AJAX off, likely to some other, fixed, site (20.16%). (Note: lots of these appear to be update checkers, which should hopefully be unnecessary before 1.0.) Plenty also seem to operate fully within one site (20.87%). By far the most, however, operate on one site and call another (46.79% or 1260 distinct scripts). Larger combinations of sites are minimal. Part of this group is oversimplification in my script, an @include of "*flickr.com" and an AJAX call to "flickr.com" are counted separately. Most though are the especially useful scripts that, for example, include IMDB data on Netflix, or vice versa. So, this is far too large a use case to break. Whatever we do, it seems cross-domain AJAX is going to have to remain.

Finally, I also took a look at the usage of metadata imperatives: both the "official" ones that actually affect how Greasemonkey works, and the others that are used in other tools, or added for the author's own purposes. That looks like:

Generally what I expected. Most everyone has an @name and an @include, nearly as many include an @description and @namespace. Things fall off rapidly from there, but the unofficial @version is next, and an unusual (to me) @author. From there we fall twoard the single-digit range, finding that @require and @resource are still very rarely used.

Conclusions: Over half of user scripts use no privileged APIs. All of Greasemonkey's security mechanisms are a pure hindrance to all these scripts. If it went away, they would benefit greatly. It may be possible to remove get/set value in favor of DOM Storage, but the potential damage of these APIs is so small that the cost likely outweighs the benefit. Although a minority (15.51%) of scripts use GM_xhr, it's still too many to consider removal.




Edit: Fixed GM_getResourceURL count, I first searched for "Url" and not "URL", explaining the zero found, before.



To those that are interested: the script that I used to generate these numbers is available for inspection, in case it perhaps contains a serious bug. The data that I generated with it, and the charts above, are also available to check.

8 comments:

Anonymous said...

"All of Greasemonkey's security mechanisms are a pure hindrance to all these scripts. If it went away, they would benefit greatly."

Thank you! It's about time!

Anonymous said...

I think user should know what they are doing when they install scripts rather than having these API that hinder the power of userscript in GreaseMonkey. Beside, I think that Greasemonkey should provide API that allow user to install 3rd party javascript library rather than have the @require metadata in the script, it would be better for script testing. Also, if you do provide API like I mentioned above, it would be nice to have some useful javascript library like jQuery, Adobe Spry,... installed into GreaseMonkey.

Unknown said...

Maybe on installation a check can run on the script to alert the user of possible security concerns.

Also, if @version is so popular why is there no built-in update checking mechanism in Greasemonkey?

More options for including javascript libraries, css, etc. would be appreciated as well.

maxxyme said...

Isn't the last sentence truncated in the article???

Shopping Cart Software said...

I think so Maxx. But the author should care of it.

Anonymous said...

It would be great to have an eficient greasemonkey mode that is enabled when no GM_ api's are called

Also I hope you can make an intermediate version that only breaks the lesser needed stuff and then release a branch, say greasemonkey2 that exclusively uses DOM and other features for the next generation.

Anonymous said...

Foreseeing what could happen (you are exactly describing it), and possible ports to other browsers, I enclosed all GM_API calls into functions so as to easilly port any script.
If many developpers have done so and if you count occurences of the API calls, you will not have a proper estimation of its usage. every one appears only once in each of my scripts.

Chad Hutchins said...

I think GM 1.0 should include jQuery. Why should you guys worry about writing an API when you could just piggy back on a popular library? In the GM scripts I've written to date, I include jQuery at the beginning. However, if the GM API is a better alternative, I guess I need to learn more about it.