This project has moved. For the latest updates, please go here.

Parsing A Web page

Dec 23, 2011 at 1:51 PM

I might be missing the boat here but I have been struggling to find a JS parser into which I can pass a web page (downloaded via an HttpWebRequest) and have the JavaScript embedded in that page parsed.  I have tried both Jurrassic and Noesis but neither offers any functionality I can utilize.  I have tried parsing the incoming page line by line, extracting out the JS functions and running those against the content itself but that does not do anything.

These libraries seem to be great if you know and control the JS upfront but I have no idea what is being downloaded and I need to mimic the functionality of a web browser without resorting to the massive memory leaks found in the WinForms embedded web browser control.

Any ideas?

Dec 23, 2011 at 9:47 PM

I think you are somewhat confused.  Jurassic is a javascript engine, which is only a small part of a modern web browser.  Other parts include the DOM, an HTML parser, a CSS/styling engine, a renderer, a plugin engine, a network stack and various misc technologies like SVG, canvas, video/audio, web workers, forms, etc.  In general, in order to execute the javascript within a web page you need all of these.  Of course the exact requirements depend on the script itself.

In order to mimic the functionality of a web browser, you need an actual web browser - something Jurassic explicitly does not aim to provide.  If the built-in webbrowser control does not work, then you need to try embedding another browser engine instead (I suggest webkit as that is commonly embedded into other software, though I haven't done it myself).  Note that browser engines are huge and complex and will unavoidably use lots of memory.