0xV3NOMx

ï»¿ 0xV3NOMx

Linux ip-172-26-7-228 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64

Your IP : 3.147.27.152

Current Path : /var/www/website/nublr/Regulations/simplehtmldom/manual/docs/faq/

Current File : /var/www/website/nublr/Regulations/simplehtmldom/manual/docs/faq/0001.md

---
title: Why is the parser so memory hungry?
---

## Short answer

DOM parsers generally require a lot of memory to represent the document tree and its attributes in memory. If memory is a concern, consider using a SAX parser instead.

## Answer

The parser loads the entire document tree and its attributes into memory. This is called the Document Object Model (DOM).

The DOM is not just a copy of the source document. It represents each element in the source document by an object in memory. The result looks like a tree, which is why its called the document tree:

```

            html
           /    \
       head      body
      /    \         \
 title      meta      div
                     /   \
                    ul    a
                   /  \
                 li    li

```

*Note*: Attributes, contents and closing tags were omitted for simplicity.

In this example, for each node the parser needs to store

* the name of the node ('html', 'head', 'body', 'title', ...),
* a reference to the parent node (i.e. 'div' points to 'body' which points to 'html') and
* a list of references to its child nodes (i.e. 'html' points to 'head' and 'body').

Here is a simplified representation:

```
object
  > node_name
  > parent_node
  > child_nodes[]
```

While the source document only stores the node name and the opening and closing brackets (i.e. `<html>`), a node stores the node name as well as references to the parent and child nodes. Each of which require memory.

## Example

Let's take the 'head' element and compare the source data with the object data.

This is the source data: `<head>` (6 Bytes)

The equivalent node (including references to parent and child nodes) has following data:

* Node Object (40 Bytes for the base object + 3 x 16 Bytes for properties = 88 Bytes) [^1]
* Node Name "head" (4 Bytes)
* Parent Node (unknown number of Bytes)
* Child Nodes (8 x 36 Bytes) [^2]

This amounts to 380 Bytes per object. A factor of 63 compared to the source data. With larger datasets this factor will be smaller, especially when taking content data into account.

A factor of ~30 compared to the source data is realistic for DOM parsers [^3]. If memory is a concern, consider using a SAX parser instead.

[^1]: [Objects in PHP 7](https://nikic.github.io/2015/06/19/Internal-value-representation-in-PHP-7-part-2.html#objects-in-php-7) by nikic
[^2]: [PHP's new hashtable implementation](https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html#memory-utilization) by nikic
[^3]: [Htlm Agility Pack Issue #77](https://github.com/zzzprojects/html-agility-pack/issues/77) by aktzpn