最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

java - DOM vs SAX XML parsing for large files - Stack Overflow

programmeradmin2浏览0评论

Background:

I have a large OWL (Web Ontology Language) file (approximately 125MB or 1.5 million lines long) that I would like to parse into a set of tab delimited values. I have been researching about the SAX and DOM XML parsers, and found the following:

  • SAX allows for the document to be read node by node, so the whole document is not in memory.
  • DOM allows for the whole document to be placed in memory at once, but has a ridiculous amount of overhead.

SAX vs DOM for large files:

As far as I understand it,

  • If I use SAX, I would have to iterate through 1.5 millions lines of code, node by node.
  • If I use DOM, I would have a big overhead, but then the results would be returned rapidly.

Problem:

I need to be able to use this parser multiple times on similar files of the same length.

Therefore, which parser should I use?

Bonus points: Does anyone know any good parsers for JavaScript. I realize many are made for Java, but I am much more fortable with JavaScript.

Background:

I have a large OWL (Web Ontology Language) file (approximately 125MB or 1.5 million lines long) that I would like to parse into a set of tab delimited values. I have been researching about the SAX and DOM XML parsers, and found the following:

  • SAX allows for the document to be read node by node, so the whole document is not in memory.
  • DOM allows for the whole document to be placed in memory at once, but has a ridiculous amount of overhead.

SAX vs DOM for large files:

As far as I understand it,

  • If I use SAX, I would have to iterate through 1.5 millions lines of code, node by node.
  • If I use DOM, I would have a big overhead, but then the results would be returned rapidly.

Problem:

I need to be able to use this parser multiple times on similar files of the same length.

Therefore, which parser should I use?

Bonus points: Does anyone know any good parsers for JavaScript. I realize many are made for Java, but I am much more fortable with JavaScript.

Share Improve this question edited Jun 20, 2020 at 9:12 CommunityBot 11 silver badge asked Jun 26, 2013 at 2:21 Shrey GuptaShrey Gupta 5,6278 gold badges46 silver badges71 bronze badges 1
  • I don't think DOM vs sax is a somewhat narrow parison, because there are so many other libs that are much better than either one of them. – vtd-xml-author Commented Jul 19, 2013 at 2:41
Add a ment  | 

3 Answers 3

Reset to default 6

Meet StAX

Just like SAX, StAX follows a Streaming programming model for parsing XML. But, it's a cross between DOM's bidirectional read/write support, its ease of use and SAX's CPU and memory efficiency.

SAX is read-only and does push parsing forcing you to handle events and errors right there and then while parsing the input. StAX on the other hand is a pull parser that lets the client call methods on the parser when needed. This also means that the application can read multiple XML files simultaneously.

JAXP API parison

╔══════════════════════════════════════╦═════════════════════════╦═════════════════════════╦═══════════════════════╦═══════════════════════════╗
║          JAXP API Property           ║          StAX           ║           SAX           ║          DOM          ║           TrAX            ║
╠══════════════════════════════════════╬═════════════════════════╬═════════════════════════╬═══════════════════════╬═══════════════════════════╣
║ API Style                            ║ Pull events; streaming  ║ Push events; streaming  ║ In memory tree based  ║ XSLT Rule based templates ║
║ Ease of Use                          ║ High                    ║ Medium                  ║ High                  ║ Medium                    ║
║ XPath Capability                     ║ No                      ║ No                      ║ Yes                   ║ Yes                       ║
║ CPU and Memory Utilization           ║ Good                    ║ Good                    ║ Depends               ║ Depends                   ║
║ Forward Only                         ║ Yes                     ║ Yes                     ║ No                    ║ No                        ║
║ Reading                              ║ Yes                     ║ Yes                     ║ Yes                   ║ Yes                       ║
║ Writing                              ║ Yes                     ║ No                      ║ Yes                   ║ Yes                       ║
║ Create, Read, Update, Delete (CRUD)  ║ No                      ║ No                      ║ Yes                   ║ No                        ║
╚══════════════════════════════════════╩═════════════════════════╩═════════════════════════╩═══════════════════════╩═══════════════════════════╝

Reference:
Does StAX Belong in Your XML Toolbox?

StAX is a "pull" type of API. As discussed, there are Cursor and Event Iterator APIs. There are both reading and writing sides of the API. It is more developer friendly than SAX. StAX, like SAX, does not require an entire document to be held in memory. However, unlike SAX, an entire document need not be read. Portions can be skipped. This may result in even improved performance over SAX.

You want SAX, most likely.

DOM is not necessarily faster; it might well me slower, if it works at all, and, as you say, you would need to hold a LOT in memory, probably needlessly.

OWL XML syntax is reasonably flat, but contains lots of cross-references.

If you need to resolve the cross-references, then a streaming approach (like SAX or StAX) isn't feasible; you will need to build a data structure in memory that holds the whole tree. If you're going to use an in-memory tree, don't use DOM, use one of the more modern models such as JDOM2 or XOM - they are more efficient and more usable.

If a streaming approach is feasible - that is, if there's a very direct correspondence between your input and output, then StAX is easier to work with than SAX because you can save the current state in variables on the Java stack, rather than needing plex data structures to maintain state between calls.

However, there's an alternative; you could write the whole thing in streaming XSLT 3.0. To be honest, this is bleeding edge and your learning time would probably be a lot greater; and it's not open-source; but you might well end up with a solution in 10 lines of code rather than 300.

There are other streaming technologies I haven't tried, like XStream.

发布评论

评论列表(0)

  1. 暂无评论