Scripting Helpers is winding down operations and is now read-only. More info→

How do I get the text shown on websites with HTTPService instead of everything else?

Asked by

8iw 10

8 years ago

From wiki:

--Simple Example of a GET request sent to example.com hs = game:GetService("HttpService") test = hs:GetAsync("http://example.com", true) -- test now contains the html text representation of example.com.

When printing test it prints all kinds of stuff and not just the text on the website.

I want it to print just the text shown on the website. Is there a way to do this?

You don't. You have to parse it yourself. XAXA 1569 — 8y

How so? 8iw 10 — 8y

Depends. What exactly from the website do you want to extract? You'll need some experience with string manipulation (e.g. string.match, string.gsub) XAXA 1569 — 8y

I'm just trying to get the text shown on example.com so I know how to get JUST the text from websites with HTTPService, currently it just returns like all sorts of css and html stuff and all I'm looking for is the text. 8iw 10 — 8y

View all comments (3 more)

Which text from http://example.com ? XAXA 1569 — 8y

Any, but for now lets just say the header( <h1> ) 8iw 10 — 8y

? 8iw 10 — 8y

1 answer

Answered by

BlueTaslem 18071

8 years ago

HTTP is a protocol that sends files from a server to a client.

The vast majority of the web is over HTTP, and the vast majority of the web uses HTML.

HTML is a language which includes includes several other languages (primarily CSS and JS) to encompass the content, behavior, and appearance of webpages.

It does this using tags. HTML has changed somewhat dramatically over different versions, but here's what HTML5 (the most recent) works.

Tags can be void tags. These do not contain content. For example, an image does not have any "content" except its source, so it looks like one of the following:

<img src="http://image-url">
<img src="http://image-url" />

Note that the final slash is optional (depending on the doctype)

Tags can be comments which look like this:

<!-- this is a comment-->

Tags that can have comment have an open tag and a close tag. They look like this:

<open>
</close>

Be aware that tags have attributes and that these attributes maybe have < and other symbols in them:

<open class=">" data-attribute="this is not content">CONTENT CONTENT</open>

Be aware that also you can use ' and " for quote types, and also have bare quotes, and that quotes can be escaped using \, and the backslashes can also be escaped.

State Machine

Here's a state machine that can probably be used to parse HTML. I'm not going to pretend I know all of the details of the HTML spec, but this should at least be able to separate tags from non-tag stuff.

Note: Several tags, e.g., <script> and <style>, contain code. You'd probably want to filter that out too, but I won't because it will make this implementation too complex.

-- (untested, likely to have bugs)
local state = "text"
local output = ""

for i = 1, #input do
    local c = input:sub(i, i)
    if state == "text" then
        if c == "<" then
            state = "tag"
        else
            output = output .. c
        end
    elseif state == "tag" then
        if c == ">" then
            state = "text"
        end
        if c == "=" then
            state = "attributewait"
        end
    elseif state == "attributewait" then -- waiting for ', ", or a word
        if c:match("%S") then
            if c == "'" then
                state = "attributesingle"
            elseif c == "\"" then
                state = "attributedouble"
            else
                state = "attributeword"
            end
        end
    elseif state == "attributesingle" then
        if c == "\\" then
            state = "attributesingleescape"
        elseif c == "'" then
            state = "tag"
        end
    elseif state == "attributesingleescape" then
        state = "attributesingle"
    elseif state == "attributedouble" then
        if c == "\\" then
            state = "attributedoublescape"
        elseif c == "\"" then
            state = "tag"
        end
    elseif state == "attributedoubleescape" then
        state = "attributedouble"
    end
end

One final thing to be aware of is HTML entity codes. In order to represent non-ASCII or special characters (like <) in HTML, entity codes are used.

These can look like this: & where a name is used, or < where a number is used. These are used in both attributes and in tag content.

Aware of Tags

Here is a modified version of the state machine which keeps track of tag names.

This machine calls TAG whenever a tag is processed. By paying attention to opening and closing tags, you can select only pieces of the HTML. For example, by recording whenever a h1 tag is received until a /h1 tag is received, you can get the body of the heading:

local input = game.HttpService:GetAsync("http://www.example.com/")
-- (untested, likely to have bugs)
local state = "text"
local output = ""

local tagname = nil

local writing = false

function TAG(name)
    name = name:lower()
    while name:sub(-1) == "/" do
        name = name:sub(1, -2)
    end
    if name == "h1" then
        writing = true
    end
    if name == "/h1" then
        writing = false
    end
end

function WRITE(s)
    if writing then
        output = output .. s
    end
end

for i = 1, #input do
    local c = input:sub(i, i)
    if state == "text" then
        if c == "<" then
            state = "tagname"
            tagname = ""
        else
            WRITE(c)
        end
    elseif state == "tagname" then
        if c:match("%s") then
            TAG(tagname)
            state = "tag"
        elseif c == ">" then
            TAG(tagname)
            state = "text"
        else
            tagname = tagname .. c
        end
    elseif state == "tag" then
        if c == ">" then
            state = "text"
        end
        if c == "=" then
            state = "attributewait"
        end
    elseif state == "attributewait" then -- waiting for ', ", or a word
        if c:match("%S") then
            if c == "'" then
                state = "attributesingle"
            elseif c == "\"" then
                state = "attributedouble"
            else
                state = "attributeword"
            end
        end
    elseif state == "attributesingle" then
        if c == "\\" then
            state = "attributesingleescape"
        elseif c == "'" then
            state = "tag"
        end
    elseif state == "attributesingleescape" then
        state = "attributesingle"
    elseif state == "attributedouble" then
        if c == "\\" then
            state = "attributedoublescape"
        elseif c == "\"" then
            state = "tag"
        end
    elseif state == "attributedoubleescape" then
        state = "attributedouble"
    end
end

print(output)

Errata

I have neglected to properly parse comments. That will cause big problems in pages that use them.

How would I use that to get the header text from example.com? 8iw 10 — 8y

You would need to amend this to keep track of which open and closing tags you have seen. BlueTaslem 18071 — 8y

For certain tags that is not so easy, because some open/close tags can be implied. For something like <h1> it should be relatively straightforward BlueTaslem 18071 — 8y

Edited to include that code BlueTaslem 18071 — 8y

How do I get the text shown on websites with HTTPService instead of everything else?

1 answer

State Machine

Aware of Tags

Errata

Answer this question