Why is the parser so memory hungry?

Short answer

DOM parsers generally require a lot of memory to represent the document tree and its attributes in memory. If memory is a concern, consider using a SAX parser instead.

Answer

The parser loads the entire document tree and its attributes into memory. This is called the Document Object Model (DOM).

The DOM is not just a copy of the source document. It represents each element in the source document by an object in memory. The result looks like a tree, which is why its called the document tree:


            html
           /    \
       head      body
      /    \         \
 title      meta      div
                     /   \
                    ul    a
                   /  \
                 li    li

Note: Attributes, contents and closing tags were omitted for simplicity.

In this example, for each node the parser needs to store

the name of the node ('html', 'head', 'body', 'title', ...),
a reference to the parent node (i.e. 'div' points to 'body' which points to 'html') and
a list of references to its child nodes (i.e. 'html' points to 'head' and 'body').

Here is a simplified representation:

object
  > node_name
  > parent_node
  > child_nodes[]

While the source document only stores the node name and the opening and closing brackets (i.e. <html>), a node stores the node name as well as references to the parent and child nodes. Each of which require memory.

Example

Let's take the 'head' element and compare the source data with the object data.

This is the source data: <head> (6 Bytes)

The equivalent node (including references to parent and child nodes) has following data:

Node Object (40 Bytes for the base object + 3 x 16 Bytes for properties = 88 Bytes) ¹
Node Name "head" (4 Bytes)
Parent Node (unknown number of Bytes)
Child Nodes (8 x 36 Bytes) ²

This amounts to 380 Bytes per object. A factor of 63 compared to the source data. With larger datasets this factor will be smaller, especially when taking content data into account.

A factor of ~30 compared to the source data is realistic for DOM parsers ³. If memory is a concern, consider using a SAX parser instead.

Objects in PHP 7 by nikic ↩
PHP's new hashtable implementation by nikic ↩
Htlm Agility Pack Issue #77 by aktzpn ↩