Scripting Helpers is winding down operations and is now read-only. More info→
Ad
Log in to vote
0

Parsing HTML to frame?

Asked by 9 years ago

So I'm working on a project that will load HTML into a frame, if I got the HTML using GetAsync(), how would I load this into a frame or textlabel?

1 answer

Log in to vote
2
Answered by
BlueTaslem 18071 Moderation Voter Administrator Community Moderator Super Administrator
9 years ago

This question is way too massive.

Here is the spec for HTML5. More accurately, that's the table of contents for the spec of HTML5.

I began to examine this problem at this place, though the system displayed there is very, very far from complete (and also prone to crashing -- I'm not sure if you will get to see a result if you visit)


Overview of HTML

In general, HTML51 has three main different structures2:

<open>content</close>
<void>
<void/>
&entity;
<!-- a comment -->
<!doctype blah>

where content can be any other HTML string, and open and close should match. The things in the <> are called tags. I will skip entities; they are basically things started with & and ended with ;. (This makes the ampersand on its own invalid almost everywhere).

Void tags are tags without a body (e.g., <br> for line break, <hr> for (what used to be) "horizontal rule"). They / at the end is optional3 and is merely a matter of style. Even without the closing slash, they don't "contain" any other HTML tags.

Tags are void depending on the tag name, not based on context/usage. E.g., script is never void, and link is always void.

If you're going to assume a particular document type (e.g., HTML5) you can ignore the special doctype tag.

Comments begin with <!-- and end with --> (and are not allowed to contain any other -- in the middle)

Case in tag names is ignored. Tag names are always a single word using only Latin letters (EDIT) and numbers4.

Tags (except for </close>) can also have attributes. Attributes take two5 forms:

attribute=value
attribute

where the second is equivalent to attribute="" (EDIT this originally incorrectly stated this was equivalent to attribute=attribute). Again, attribute names are case-insensitive and one word6.

The value in an attribute may be quoted using ", ', or it may be left bare (and a space ends the attribute value).


Not all tags can contain all other tags. E.g.,

<ul>
    <li>Blah
    <li>Blah
</ul>

Is NOT the same thing as

<ul>
    <li>Blah
        <li>Blah</li>
    </li>
</ul>

Look up a reference like MDN to see which tags can contain others. The main ones which will be "automatically closed" are li, p, body, and head.


  1. HTML 5 is the most recent standardization; older specs (particularly HTML 4 and XHTML 4) have more rigid / different rules than HTML 5. 

  2. The <script> tag disables most of the HTML parsing rules (so, e.g., < doesn't need to be escaped). The keyword that starts parsing rules again is </script (which also means if JS wants to include that as text it needs to break it up somehow) 

  3. Prior to HTML5, various document types required a particular choice. For instance, XHTML 4 requires the / while HTML 4 requires it's not present. 

  4. I had forgotten about h1, h2, ..., h6. Tag names can contain numbers. 

  5. Again, HTML5 introduces a special syntax. Including an attribute without an = implies attrname=attrname. E.g., <blah info> is the same as <blah info=info> Previous versions don't allow this sugar (because it was much less common; new elements and attributes added in HTML5 make this preferable) 

  6. I think some may contain numbers, but all definitely start with Latin letters 

Ad

Answer this question