So I'm working on a project that will load HTML into a frame, if I got the HTML using GetAsync(), how would I load this into a frame or textlabel?
This question is way too massive.
Here is the spec for HTML5. More accurately, that's the table of contents for the spec of HTML5.
I began to examine this problem at this place, though the system displayed there is very, very far from complete (and also prone to crashing -- I'm not sure if you will get to see a result if you visit)
In general, HTML51 has three main different structures2:
<open>content</close> <void> <void/> &entity; <!-- a comment --> <!doctype blah>
where content
can be any other HTML string, and open
and close
should match. The things in the <>
are called tags. I will skip entities; they are basically things started with &
and ended with ;
. (This makes the ampersand on its own invalid almost everywhere).
Void tags are tags without a body (e.g., <br>
for line break, <hr>
for (what used to be) "horizontal rule"). They /
at the end is optional3 and is merely a matter of style. Even without the closing slash, they don't "contain" any other HTML tags.
Tags are void depending on the tag name, not based on context/usage. E.g., script
is never void, and link
is always void.
If you're going to assume a particular document type (e.g., HTML5) you can ignore the special doctype tag.
Comments begin with <!--
and end with -->
(and are not allowed to contain any other --
in the middle)
Case in tag names is ignored. Tag names are always a single word using only Latin letters (EDIT) and numbers4.
Tags (except for </close>
) can also have attributes. Attributes take two5 forms:
attribute=value attribute
where the second is equivalent to attribute=""
(EDIT this originally incorrectly stated this was equivalent to attribute=attribute
). Again, attribute names are case-insensitive and one word6.
The value
in an attribute may be quoted using "
, '
, or it may be left bare (and a space ends the attribute value).
Not all tags can contain all other tags. E.g.,
<ul> <li>Blah <li>Blah </ul>
Is NOT the same thing as
<ul> <li>Blah <li>Blah</li> </li> </ul>
Look up a reference like MDN to see which tags can contain others. The main ones which will be "automatically closed" are li
, p
, body
, and head
.
HTML 5 is the most recent standardization; older specs (particularly HTML 4 and XHTML 4) have more rigid / different rules than HTML 5. ↩
The <script>
tag disables most of the HTML parsing rules (so, e.g., <
doesn't need to be escaped). The keyword that starts parsing rules again is </script
(which also means if JS wants to include that as text it needs to break it up somehow) ↩
Prior to HTML5, various document types required a particular choice. For instance, XHTML 4 requires the /
while HTML 4 requires it's not present. ↩
I had forgotten about h1
, h2
, ..., h6
. Tag names can contain numbers. ↩
Again, HTML5 introduces a special syntax. Including an attribute without an =
implies attrname=attrname
. E.g., <blah info>
is the same as <blah info=info>
Previous versions don't allow this sugar (because it was much less common; new elements and attributes added in HTML5 make this preferable) ↩
I think some may contain numbers, but all definitely start with Latin letters ↩