Friday, February 6, 2015

Parsing Unstructured Data Using Data Processor Transformation in Informatica - PDF to XML


Data Processor transformation processes unstructured and semi-structured file formats in a mapping. We can configure it to process HTML pages, XML, JSON, and PDF documents. We can also convert structured formats such as ACORD, HIPAA, HL7, EDI-X12, EDIFACT, AFP, and SWIFT. For example, if we have customer invoices in Microsoft Word files, we can configure a Data Processor transformation to parse the data from each word file and extract the customer data to a Customer table and order information Orders table.

The Data Processor Transformation  has the following options:
Parser converts source documents to XML. The output of a Parser is always XML. The input can have any format, such as text, HTML, Word, PDF, or HL7.
Serializer converts an XML file to an output document of any format. The output of a Serializer can be any format, such as a text document, an HTML document, or a PDF.
Mapper converts an XML source document to another XML structure or schema. You can convert the same XML documents as in an XMap.
Transformer modifies the data in any format. Adds, removes, converts, or changes text. Use Transformers with a Parser, Mapper, or Serializer. You can also run a Transformer as stand-alone component.
Streamer splits large input documents, such as multi-gigabyte data streams, into segments. The Streamer processes documents that have multiple messages or records in them, such as HIPAA or EDI files.

In this blog, we will see how to extract data from a PDF Document and create a XML file using Data Processor Transformation. The source documents that have fixed page layout like bills, invoices and account statements can be parsed using positional format to find the data fields.  An anchor is a signpost that you place in a document, indicating the position of the data. The most commonly used anchors are called Marker and Content anchors. These anchors are often used as a pair:
Marker anchor labels a location in a document.
Content anchor retrieves text from the location. It stores the text that it extracts from a source document in a data holder.
 
I have a PDF document with employee data as shown below:

FirstName: Chris
Lastname: Boyd
Department: HR
StartDate: 2009-10-11

You will also need an XML Schema Definition (XSD) file which contains target XML schema. XSD file looks like this:

<?xml version="1.0" encoding="Windows-1252"?>

<xs:schema attributeFormDefault="unqualified" elementFormDefault="unqualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="EmpPdf">

<xs:complexType>

<xs:sequence>

<xs:element name="FirstName" type="xs:string" />

<xs:element name="LastName" type="xs:string" />

<xs:element name="Department" type="xs:string" />

<xs:element name="StartDate" type="xs:date" />        

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:schema>

Using the above data, I have explained in my video how to create a schema object and Data Processor Transformation and use it in a mapping.



 * In case of any questions, feel free to leave comments on this page and I would get back as soon as I can.

40 comments:

  1. I asked this question on your You tube video also "If my Source is PDF file then how to create data object for PDF file from where i can read and put into the data Data process transformation?"  I am able to create data process transformation but when i am creating a mapping for this Data processor transformation , i am not able to create the XML/PDF input file which i have to connect with Data Processor Transformation.

    ReplyDelete
    Replies
    1. Hi Hema,

      Nice to be visiting your blog again, it has been months for me. Well this article that i’ve been waited for so long.
      Can anyone please suggest is there any provision in informatica to pass values like above explained??
      You could add a task to generate a new parameter file at the end of the workflow. An assignment task can be used to evaluate the current value of $$Month_Count and set it to a new value. Then, that parameter could be used to create a parameter file that would be used in the next run.
      PIM represents a solution for centralized, media-independent data maintenance, as well as efficient data collection, management, refinement and output.
      Once again thanks for your tutorial.

      Thank you,
      Annie

      Delete
  2. First for the PDF source file:
    Place the PDF file in the Source folder of Informatica.
    Then create a *.dat in the same folder and give the path in the file like I specified in the video, for example .\xyz.pdf
    Then create a flat file data object for the *.dat file as comma delimited file.

    For the XML Target:
    Create a Flatfile Data object in informatica and choose "Create as Empty" option and finish.
    Then open the flatfile object that you created and choose the write option. select the target transformation and add a column of string data type and in input transformation, goto runtime properties , then in the Output File name change the *.out file to *.xml file.

    Use the above two as source and target, it should work.

    ReplyDelete
    Replies
    1. Thanks a lot. Its Working . Ran it successfully.

      Delete
    2. Hi rajeev, Can you help me how did you create the PDF source/XML Target in more detail. I know the above one is more detail, but I am not able to run the mapping. Can you please help ?

      Delete
  3. Hi Hema, Data processor is working fine but my source PDF and target XML is not working. I don't see any error/output/anything. Can you please elaborate in detail ?

    ReplyDelete
  4. Hi Hema,

    Can you help on scenario of extracting details records from Invoice

    Ex:

    Item Item_Price
    abc 10
    xyz 15
    cbg 30

    I would like to have the tags in the XML file as
    abc
    10
    xyz
    15
    cbg
    30

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. This is awesome explanation of parsing unstructured data using data processor transformation

    http://www.tekclasses.com/courses/etl/informatica-training-in-bangalore/

    ReplyDelete
  7. Hi,
    I replicated the transformation per the video but when the mapping is run i am ger a error "server variable :[$PMExtProcDir], NULL override value not valid. The override value is set to empty string.".

    Can someone help on this pls.

    Thanks

    ReplyDelete
  8. Hi,

    Looking for Options to write Table data into PDF file. Can that be achieved using Serializer or any other DT Component?

    ReplyDelete
  9. Hi Hema,

    If we have multiple records those many times have we marked the columns and content the data?.

    ReplyDelete
  10. This video is realy good but i am interested in JSON format.So through parser we can convert JSON to XML.Data processor just Converting from one semi-stucture to another semi-unstrucutre.if there is any way to extract data from json and store into any RDBMS

    ReplyDelete
  11. Hi Hema,

    In my scenario we are generating Json files by reading relational data. we can able to generate it. but when i open my json output file i am getting an empty line b/w the each row.do we have any option to remove the empty lines. we are using dataproc transformation to generate the jsons.

    ReplyDelete
    Replies
    1. Can you explain how you achieved it? what is your informatica version?

      Delete
  12. im getting my output as below.
    [kistareddy_jageerapu@ausplcdhedge02 c360_test]$ head -10 FF_AEHUB_CNTCT_ADDR.json
    {"RootEntity":"Party","SourceKey":"10100058998799793","Demographics":{"Address":[{"AddressType":"1000001","Key":"101100037127270092","AddressLineOne":"PULCHOWK","CityName":"LALITPUR","State":"BAGMATI","Country":"NP","StatusCode":"S6000","QualityCode":"Q4","FaultCode":"3000","PostalCode":"1000","ExtrcDts":"2016-05-20 00:02:00.000000"}]}}

    {"RootEntity":"Party","SourceKey":"10100129158360493","Demographics":{"Address":[{"AddressType":"1000002","Key":"187100068680458694","AddressLineOne":"RETAILER WAL MART STORES","AddressLineTwo":"PO BOX 500787","CityName":"SAINT LOUIS","State":"MO","Country":"US","UndeliverableIndicator":"F","StatusCode":"S00000","PostalCode":"63150","ZipPlusFour":"0787","ExtrcDts":"2016-05-21 00:02:40.000000"}]}}

    i am expecting like below:

    [kistareddy_jageerapu@ausplcdhedge02 c360_test]$ head -10 FF_AEHUB_CNTCT_ADDR.json
    {"RootEntity":"Party","SourceKey":"10100058998799793","Demographics":{"Address":[{"AddressType":"1000001","Key":"101100037127270092","AddressLineOne":"PULCHOWK","CityName":"LALITPUR","State":"BAGMATI","Country":"NP","StatusCode":"S6000","QualityCode":"Q4","FaultCode":"3000","PostalCode":"1000","ExtrcDts":"2016-05-20 00:02:00.000000"}]}}
    {"RootEntity":"Party","SourceKey":"10100129158360493","Demographics":{"Address":[{"AddressType":"1000002","Key":"187100068680458694","AddressLineOne":"RETAILER WAL MART STORES","AddressLineTwo":"PO BOX 500787","CityName":"SAINT LOUIS","State":"MO","Country":"US","UndeliverableIndicator":"F","StatusCode":"S00000","PostalCode":"63150","ZipPlusFour":"0787","ExtrcDts":"2016-05-21 00:02:40.000000"}]}}

    ReplyDelete
  13. Hi,

    Can you help me how to read a json file and write it to flat file

    ReplyDelete
  14. Hi Can you please tell me if it is possible to read a padf file residing in a directory as binary and store in oracle column as blob data type using unstructured data transformation.

    ReplyDelete
  15. Hi Buddy,


    Gasping at your brilliance! Thanks a tonne for sharing all that content. Can’t stop reading. Honestly!


    They are mapping variables and workflow variables. I have created them in the mapping and assigned the value in the mapping. Created the workflow variables in the workflow and assigned the mapping variables to the workflow variables. They are not set to persistent. I need the values to get refreshed every time I run the workflow.


    But nice Article Mate! Great Information! Keep up the good work!


    Kind Regards,
    Morgan

    ReplyDelete
  16. Hi Mate,


    Great info! I recently came across your blog and have been reading along.
    I thought I would leave my first comment. I don’t know what to say except that I have

    If that doesn't help, then I follow the guidelines provided by the Performance Tuning guide.

    Now the interviewer might ask, and what's written in the Performance Tuning guide? My answer is, how to identify source and target bottlenecks and all other potential pain points. As these may change from version to version, there's not too much sense in learning them all to repeat them off the top of my head; it's easier to simply follow the chapters in this PDF file for the currently used software version than to keep them all in my head.
    And to be honest during my 16 years at Informatica I never needed the Performance Tuning guide. All the other points I mentioned above were sufficient to identify all bottlenecks I've encountered.

    Very useful post !everyone should learn and use it during their learning path.


    Merci Beaucoup,
    Vani

    ReplyDelete
  17. Hello There,

    Nice to be visiting your blog again, it has been months for me. Well this article that i’ve been waited for so long.

    One suggestion from my side: insert Decision Tasks after each session; from the Decision Task there are two output links, one leading to the next session, Informatica PIM Training one leading to the point after all those sessions to be skipped.This way you can use simpler conditions everywhere, making them easier to read and to understand.

    That is not strictly necessary, but at least it will make it easier for you to debug and test the link conditions.

    Very useful post !everyone should learn and use it during their learning path.

    Shukran,
    Preethi

    ReplyDelete
  18. Hi Mate,


    Smokin hot stuff! You’ve trimmed my dim. I feel as bright and fresh as your prolific website and blogs!

    Thanks for the suggestion . But as per my requirement I don't want the session to execute if the count is greater than 5000 , I need to abort the process where it is ( roll back if any thing is done) and fail the workflow and trigger a mail to the concerned team.

    The operation team need to know the workflow has failed ( as per requirement ) so the abortion is required.

    Informatica PIM Training USA






    Follow my new blog if you interested in just tag along me in any social media platforms!


    Merci Beaucoup,
    Ajeeth

    ReplyDelete
  19. Hola peeps,


    Muchas Gracias Mi Amigo! You make learning so effortless. Anyone can follow you and I would not mind following you to the moon coz I know you are like my north star.

    I started using Informatica PIM Training blog for my training practice.
    But don't forget that some of the "faulty" records may already have been inserted into the target when the EXP counts >5000 records and ABORT()s the session.
    One could try increasing the commit size to some ridiculous value (e.g. 100 million), but that might well overload the undo/redo logs of the underlying database. Also such behaviour will inevitably lead to inconsistent target data.


    Awesome! Thanks for putting this all in one place. Very useful!

    Cheers,

    Krishna kumar

    ReplyDelete
  20. Great article. Really enjoyed it. :)

    Informatica Read JSON


    Following article is also good to learn, How to Import JSON data using Informatica (Read JSON Files or REST API)

    ReplyDelete
  21. I am essentially satisfied with your great work.You put truly extremely supportive data. Keep it up. Continue blogging. Hoping to perusing your following postsolve afp problems

    ReplyDelete
  22. i want informatica bdm course can u send me ur mail id

    ReplyDelete
  23. Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Data Science online Course

    ReplyDelete
  24. it the surgery is done now, it'll b enough 2 cut off a part of the liver.. liver will work well even if there is only half of it. but wat wil u do when the entire liver gets affected by cancer? so please explain him all these and ask him 2 undergo surgery.

    solve afp problems

    ReplyDelete
  25. HI PLEASE CAN ANYONE IN loading multiple rows using parser transformation

    ReplyDelete
  26. This is beyond doubt a blog significant to follow. You’ve dig up a great deal to say about this topic, and so much awareness. I believe that you recognize how to construct people pay attention to what you have to pronounce, particularly with a concern that’s so vital. I am pleased to suggest this blog.
    Data Science Training in Indira nagar
    Data Science training in marathahalli
    Data Science Interview questions and answers
    Data Science training in btm layout | Data Science Training in Bangalore
    Data Science Training in BTM Layout | Data Science training in Bangalore
    Data science training in kalyan nagar

    ReplyDelete
  27. Whoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.
    AWS TRAINING IN BANGALORE|NO.1AWS TRAINING INSTITUTES IN BANGALORE |AWS COURSE TRAINING IN BANGALORE

    ReplyDelete
  28. The following article is a must read since it will definitely provide you with loads of information related to Reading JSON in Informatica.Thanks.

    ReplyDelete
  29. Awesome post. You Post is very informative. Thanks for Sharing.
    Informatica Training in Noida

    ReplyDelete
  30. Mastech InfoTrellis - Data and Analytics Consulting Company extending premier services in Master Data Management, Big Data and Data Integration.

    Visit for More : https://mastechinfotrellis.com/data-management/

    ReplyDelete
  31. That is a great piece of content! pdf to xml or xml to pdf conversion seems easy that they are. If however you face difficulties in xml conversion then you can check out our services:)

    ReplyDelete
  32. This comment has been removed by the author.

    ReplyDelete
  33. Hello,

    Thank you for sharing such valuable information. You have made it so easy to read on Data Processor Transformation.

    ReplyDelete