Linux ip-172-26-7-228 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64
Your IP : 3.147.27.152
---
title: Why is the parser so memory hungry?
---
## Short answer
DOM parsers generally require a lot of memory to represent the document tree and its attributes in memory. If memory is a concern, consider using a SAX parser instead.
## Answer
The parser loads the entire document tree and its attributes into memory. This is called the Document Object Model (DOM).
The DOM is not just a copy of the source document. It represents each element in the source document by an object in memory. The result looks like a tree, which is why its called the document tree:
```
html
/ \
head body
/ \ \
title meta div
/ \
ul a
/ \
li li
```
*Note*: Attributes, contents and closing tags were omitted for simplicity.
In this example, for each node the parser needs to store
* the name of the node ('html', 'head', 'body', 'title', ...),
* a reference to the parent node (i.e. 'div' points to 'body' which points to 'html') and
* a list of references to its child nodes (i.e. 'html' points to 'head' and 'body').
Here is a simplified representation:
```
object
> node_name
> parent_node
> child_nodes[]
```
While the source document only stores the node name and the opening and closing brackets (i.e. `<html>`), a node stores the node name as well as references to the parent and child nodes. Each of which require memory.
## Example
Let's take the 'head' element and compare the source data with the object data.
This is the source data: `<head>` (6 Bytes)
The equivalent node (including references to parent and child nodes) has following data:
* Node Object (40 Bytes for the base object + 3 x 16 Bytes for properties = 88 Bytes) [^1]
* Node Name "head" (4 Bytes)
* Parent Node (unknown number of Bytes)
* Child Nodes (8 x 36 Bytes) [^2]
This amounts to 380 Bytes per object. A factor of 63 compared to the source data. With larger datasets this factor will be smaller, especially when taking content data into account.
A factor of ~30 compared to the source data is realistic for DOM parsers [^3]. If memory is a concern, consider using a SAX parser instead.
[^1]: [Objects in PHP 7](https://nikic.github.io/2015/06/19/Internal-value-representation-in-PHP-7-part-2.html#objects-in-php-7) by nikic
[^2]: [PHP's new hashtable implementation](https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html#memory-utilization) by nikic
[^3]: [Htlm Agility Pack Issue #77](https://github.com/zzzprojects/html-agility-pack/issues/77) by aktzpn
|