From wiki:
--Simple Example of a GET request sent to example.com hs = game:GetService("HttpService") test = hs:GetAsync("http://example.com", true) -- test now contains the html text representation of example.com.
When printing test it prints all kinds of stuff and not just the text on the website.
I want it to print just the text shown on the website. Is there a way to do this?
HTTP is a protocol that sends files from a server to a client.
The vast majority of the web is over HTTP, and the vast majority of the web uses HTML.
HTML is a language which includes includes several other languages (primarily CSS and JS) to encompass the content, behavior, and appearance of webpages.
It does this using tags. HTML has changed somewhat dramatically over different versions, but here's what HTML5 (the most recent) works.
Tags can be void tags. These do not contain content. For example, an image does not have any "content" except its source, so it looks like one of the following:
<img src="http://image-url"> <img src="http://image-url" />
Note that the final slash is optional (depending on the doctype)
Tags can be comments which look like this:
<!-- this is a comment-->
Tags that can have comment have an open tag and a close tag. They look like this:
<open> </close>
Be aware that tags have attributes and that these attributes maybe have <
and other symbols in them:
<open class=">" data-attribute="this is not content">CONTENT CONTENT</open>
Be aware that also you can use '
and "
for quote types, and also have bare quotes, and that quotes can be escaped using \
, and the backslashes can also be escaped.
Here's a state machine that can probably be used to parse HTML. I'm not going to pretend I know all of the details of the HTML spec, but this should at least be able to separate tags from non-tag stuff.
Note: Several tags, e.g., <script>
and <style>
, contain code. You'd probably want to filter that out too, but I won't because it will make this implementation too complex.
-- (untested, likely to have bugs) local state = "text" local output = "" for i = 1, #input do local c = input:sub(i, i) if state == "text" then if c == "<" then state = "tag" else output = output .. c end elseif state == "tag" then if c == ">" then state = "text" end if c == "=" then state = "attributewait" end elseif state == "attributewait" then -- waiting for ', ", or a word if c:match("%S") then if c == "'" then state = "attributesingle" elseif c == "\"" then state = "attributedouble" else state = "attributeword" end end elseif state == "attributesingle" then if c == "\\" then state = "attributesingleescape" elseif c == "'" then state = "tag" end elseif state == "attributesingleescape" then state = "attributesingle" elseif state == "attributedouble" then if c == "\\" then state = "attributedoublescape" elseif c == "\"" then state = "tag" end elseif state == "attributedoubleescape" then state = "attributedouble" end end
One final thing to be aware of is HTML entity codes. In order to represent non-ASCII or special characters (like <
) in HTML, entity codes are used.
These can look like this: &
where a name is used, or <
where a number is used. These are used in both attributes and in tag content.
Here is a modified version of the state machine which keeps track of tag names.
This machine calls TAG
whenever a tag is processed. By paying attention to opening and closing tags, you can select only pieces of the HTML. For example, by recording whenever a h1
tag is received until a /h1
tag is received, you can get the body of the heading:
local input = game.HttpService:GetAsync("http://www.example.com/") -- (untested, likely to have bugs) local state = "text" local output = "" local tagname = nil local writing = false function TAG(name) name = name:lower() while name:sub(-1) == "/" do name = name:sub(1, -2) end if name == "h1" then writing = true end if name == "/h1" then writing = false end end function WRITE(s) if writing then output = output .. s end end for i = 1, #input do local c = input:sub(i, i) if state == "text" then if c == "<" then state = "tagname" tagname = "" else WRITE(c) end elseif state == "tagname" then if c:match("%s") then TAG(tagname) state = "tag" elseif c == ">" then TAG(tagname) state = "text" else tagname = tagname .. c end elseif state == "tag" then if c == ">" then state = "text" end if c == "=" then state = "attributewait" end elseif state == "attributewait" then -- waiting for ', ", or a word if c:match("%S") then if c == "'" then state = "attributesingle" elseif c == "\"" then state = "attributedouble" else state = "attributeword" end end elseif state == "attributesingle" then if c == "\\" then state = "attributesingleescape" elseif c == "'" then state = "tag" end elseif state == "attributesingleescape" then state = "attributesingle" elseif state == "attributedouble" then if c == "\\" then state = "attributedoublescape" elseif c == "\"" then state = "tag" end elseif state == "attributedoubleescape" then state = "attributedouble" end end print(output)
I have neglected to properly parse comments. That will cause big problems in pages that use them.