Sunday, June 1, 2008

Goggling Google

This is an extract from what was originally posted on the larryhsfriends@yahoogroups.com mailing list Tuesday 27 May 2008. It's posted here on request of a mutual friend, Charles.

I've deliberately avoided replacing the extravigant URLs with Tinyurl.com abbreviations because these URLs would normally be used only once and to illustrate all the dirty work involved in this branch of screen scraping and accessibility issues.

My blind friend Larry was trying to access a video (to listen to, smart alec), located by going to http://video.google.com,

  • clicking the radio button to limit the search to Google hosted videos,
  • searching on the term 'debate'.
  • Lynx numbered link #69, was an MSNBC debate from October 31, 2007, the hosting page being at
  • http://video.google.com/videoplay?docid=8023519711099005229&q=debate&ei=IYM7SOuXEKDk4AL5473bAw

Dallas:

You are correct that there was no link for a .mp4 on that particular page / video, so they don't currently all have that option. I have to agree with that.

However (grin) by a process too complicated to explain right now, I was able to get a bloated, rather complicated URL for the video in question.

Try this, you'll have to cut and paste it, making sure the whole thing, 4 lines, is on one line, quoted to make sure parts aren't passed into background as bogus batch jobs:

http://vp.video.google.com/videodownload?version=0&secureurl=QwAAAD0W5d-wPcVVeTl7QrGr5zTKridzSGf41ASu20PecohoZi2sOgFpCHjq8L-P4O1pMCjFMcitHUDMkPHMpztjyF2Gfx5zzGZQK-3bM6BN4oWB&sigh=NrWxm3ARGWrIAA7a4p9rRb4PcMs&begin=0&len=6395300&docid=8023519711099005229

Since I stepped through this once, in theory it can be automated. I've decided to try and recreate it to record this for posterity. Looking at the page you cited I noted the suggested, simplified URL for embedding at the bottom of the page in some suggested html:

http://video.google.com/googleplayer.swf?docid=8023519711099005229

I then fed this into gnash with this command:

$ gnash -vr 2 'http://video.google.com/googleplayer.swf?docid=8023519711099005229'

-v is for verbose

-r 2 is to play sound only, no video.

This cranked out about a screen of hard to understand very technical messages, and eventually stalled out, but before that it spit out a message:

17430] 22:13:13: SECURITY: Loading XML file from url: 'http://video.google.com/videofeed?fgvns=1&fai=1&docid=8023519711099005229&hl=undefined'

before I lost patience and hit control c. This seems to have the same hash string, all numeric at the original URL, so everthing up to here could be arrived at in a shortcut manner knowing what is crucial in the original URL and the form of this final url. That is to say, gnash is not essential to get to here!

I then dumped that URL with the command:

$ lynx -source 'http://video.google.com/videofeed?fgvns=1&fai=1&docid=8023519711099005229&hl=undefined'

and studying the output I noticed a string:

vidurl=http%3A%2F%2Fvideo.google.com%2Fvideoplay%3Fdocid%3D8023519711099005229%26hl%3Den&usg=AL29H2354Bu9OKKGOkt8CkFi7UeioVIIgQ"

and then fed that into a tool I have 'dex' (de-hex encode) to convert the % escaped hex encoded characters back into straight characters:

$ dex <<< 'http%3A%2F%2Fvideo.google.com%2Fvideoplay%3Fdocid%3D8023519711099005229%26hl%3Den&usg=AL29H2354Bu9OKKGOkt8CkFi7UeioVIIgQ'

and it output:

http://video.google.com/videoplay?docid=8023519711099005229&hl=en&usg=AL29H2354Bu9OKKGOkt8CkFi7UeioVIIgQ

I then did another:

$ lynx -source

'http://video.google.com/videoplay?docid=8023519711099005229&hl=en&usg=AL29H2354Bu9OKKGOkt8CkFi7UeioVIIgQ'

and saw in the output of it the string:

videoUrl\x3dhttp://vp.video.google.com/videodownload%3Fversion%3D0%26secureurl%3DQwAAAD0W5d-wPcVVeTl7QrGr5zTKridzSGf41ASu20PecohoZi2sOgFpCHjq8L-P4O1pMCjFMcitHUDMkPHMpztjyF2Gfx5zzGZQK-3bM6BN4oWB%26sigh%3DNrWxm3ARGWrIAA7a4p9rRb4PcMs%26begin%3D0%26len%3D6395300%26docid%3D8023519711099005229\x26

and feeding the string between the \x escaped hex numbers into dex again:

$ dex <<<

'http://vp.video.google.com/videodownload%3Fversion%3D0%26secureurl%3DQwAAAD0W5d-wPcVVeTl7QrGr5zTKridzSGf41ASu20PecohoZi2sOgFpCHjq8L-P4O1pMCjFMcitHUDMkPHMpztjyF2Gfx5zzGZQK-3bM6BN4oWB%26sigh%3DNrWxm3ARGWrIAA7a4p9rRb4PcMs%26begin%3D0%26len%3D6395300%26docid%3D8023519711099005229'

got the final URL:

http://vp.video.google.com/videodownload?version=0&secureurl=QwAAAD0W5d-wPcVVeTl7QrGr5zTKridzSGf41ASu20PecohoZi2sOgFpCHjq8L-P4O1pMCjFMcitHUDMkPHMpztjyF2Gfx5zzGZQK-3bM6BN4oWB&sigh=NrWxm3ARGWrIAA7a4p9rRb4PcMs&begin=0&len=6395300&docid=8023519711099005229

This is rediculously complicated, and obviously needs to be automated (as it usually is via javascript! :-) ) But it produced working result and is describable.

(after a trip to Food for Less)

However there is an easier way to do this. I just fed the original URL into usnatch:

usnatch 'http://video.google.com/videoplay?docid=8023519711099005229&q=debate&ei=IYM7SOuXEKDk4AL5473bAw' -u

and it output this url, probably by way of scraping it from KeepVid.com:

http://vp.video.google.com/videodownload?version=0&secureurl=twAAAD0W5d-wPcVVeTl7QrGr5zTKridzSGf41ASu20PecohoD0a7UlL0hOryryfecm0kR0Az1TAZjqmcK4Jhzww767-M--b5VXs0aG2FyEksUHG7jZMWLv12yp10ahgVqupjVDS1ehay8IuXr_K5CJYVeSwkYqKv5owxTDiGz7X7xKbrgQVNx-7ue4RTDjur5LWmoqryoLSCkqAgx6UteEa8LIwTCBSDJhB3jzak8cwIcF70G2Np9NfuVtZ8OmJwuiMRsg

which played when I used it with mplayer, making sure to enclose the url in single quotes after pasting into the command. So it could of been played from Lynx and I'm downloading the file from Google right now. The trick is to back up from the page where they want you to view it, and be on top of the video.google.com search results link to run usnatch. Alternatively, from that page you could invoke usnatch from the original url on the original url by using the comma key instead of the period key to invoke the usnatch external program. Period calls externals for the currently active link, comma calls externals for the current page.


2008 June 1 Afterward

...And of course the ultimate goal of the long exercise is to include the algorithm described in usnatch, with the idea of making it less dependent on scraping information from sites like http://KeepVid.com.

No comments: