US Patent 9053206 Method and system of extracting web page information

A method of extracting web page information includes analyzing a document object model (DOM) structure of a sample page to obtain a position of information to be extracted. A node corresponding to the position of the information to be extracted is rendered in the DOM structure as a target node. Starting from the target node, relative position information is traversed recursively until the root node is found to create candidate paths. The candidate paths are rendered as a path set. A DOM structure of a page to be extracted is analyzed, information is located in the DOM structure of the page starting from the root node in the path set, and an extracted node candidate set is obtained. A node having highest robustness from the extracted node candidate set is selected to be a final extracted node and extracted information is obtained using the extracted node.

Timeline

No Timeline data yet.

Further Resources

Title

Author

Link

Type

Date

No Further Resources data yet.

US Patent 9053206 Method and system of extracting web page information

Contents

Patent attributes

Timeline

Further Resources

References

Find more entities like US Patent 9053206 Method and system of extracting web page information