Extracting Data from Web Pages with AgentQL and BoxLang

I discovered AgentQL a few weeks ago and have been thinking about it quite a bit. In a nutshell, it lets you perform queries against a web page. They’ve got a simple query language that kinda reminds me of GraphQL, but simpler. So for example, consider the page you are on right now – if I wanted to get the tags, I could use this query: { tags[] } And it would return: { "tags": [ "#development", "#boxlang" ] } What if I wanted the links? I could change my query to express this: { tags[] { label url } } And then get: { "tags": [ { "label": "#development", "url":... more →
Posted in: JavaScript

Using Generative AI to Parse Web Pages into Data

A few months back, I took a look at using JSON-LD to turn a recipe web page into pure data: Scraping Recipes Using Node.js, Pipedream, and JSON-LD. This relied on a recipe actually using JSON-LD in the header to describe itself, which is pretty common for SEO purposes. Still, I was curious as to how well generative AI could solve this problem. In theory, this could be a good ‘backup’ in cases where a site wasn’t using JSON-LD and a general exploration of ‘parsing’ a web page into data. I’ll be using Google Gemini again, but in theory, this demo would work in other services as well. Here’s what I found. Converting a Web Page into Structured Data In order... more →
Posted in: JavaScript