Disclosed herein is a system and method for Natural Language Processing (NLP) of real world documents. the system and method combines various models not previously combined and overcomes the challenges of this combination. Models include an encoder-decoder model, a spatial model, and a multi-modal model. An iterative training process receives documents and generates outputs, wherein the iterative training process comprises enabling information retrieval from documents without training data.