Informatica HowTo : Parsing Unstructured Data Using Data Processor Transformation in Informatica

Friday, February 6, 2015

Parsing Unstructured Data Using Data Processor Transformation in Informatica - PDF to XML

Data Processor transformation processes unstructured and semi-structured file formats in a mapping. We can configure it to process HTML pages, XML, JSON, and PDF documents. We can also convert structured formats such as ACORD, HIPAA, HL7, EDI-X12, EDIFACT, AFP, and SWIFT. For example, if we have customer invoices in Microsoft Word files, we can configure a Data Processor transformation to parse the data from each word file and extract the customer data to a Customer table and order information Orders table.

The Data Processor Transformation has the following options:

Parser converts source documents to XML. The output of a Parser is always XML. The input can have any format, such as text, HTML, Word, PDF, or HL7.

Serializer converts an XML file to an output document of any format. The output of a Serializer can be any format, such as a text document, an HTML document, or a PDF.

Mapper converts an XML source document to another XML structure or schema. You can convert the same XML documents as in an XMap.

Transformer modifies the data in any format. Adds, removes, converts, or changes text. Use Transformers with a Parser, Mapper, or Serializer. You can also run a Transformer as stand-alone component.

Streamer splits large input documents, such as multi-gigabyte data streams, into segments. The Streamer processes documents that have multiple messages or records in them, such as HIPAA or EDI files.

In this blog, we will see how to extract data from a PDF Document and create a XML file using Data Processor Transformation. The source documents that have fixed page layout like bills, invoices and account statements can be parsed using positional format to find the data fields. An anchor is a signpost that you place in a document, indicating the position of the data. The most commonly used anchors are called Marker and Content anchors. These anchors are often used as a pair:

Marker anchor labels a location in a document.

Content anchor retrieves text from the location. It stores the text that it extracts from a source document in a data holder.

I have a PDF document with employee data as shown below:

FirstName: Chris

Lastname: Boyd

Department: HR

StartDate: 2009-10-11

You will also need an XML Schema Definition (XSD) file which contains target XML schema. XSD file looks like this:

<?xml version="1.0" encoding="Windows-1252"?>

<xs:schema attributeFormDefault="unqualified" elementFormDefault="unqualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="EmpPdf">

<xs:complexType>

<xs:sequence>

<xs:element name="FirstName" type="xs:string" />

<xs:element name="LastName" type="xs:string" />

<xs:element name="Department" type="xs:string" />

<xs:element name="StartDate" type="xs:date" />

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:schema>

Using the above data, I have explained in my video how to create a schema object and Data Processor Transformation and use it in a mapping.

* In case of any questions, feel free to leave comments on this page and I would get back as soon as I can.

40 comments:

AnonymousMarch 6, 2015 at 7:13 AM
I asked this question on your You tube video also "If my Source is PDF file then how to create data object for PDF file from where i can read and put into the data Data process transformation?" I am able to create data process transformation but when i am creating a mapping for this Data processor transformation , i am not able to create the XML/PDF input file which i have to connect with Data Processor Transformation.
ReplyDelete
Replies
HemaMarch 6, 2015 at 8:48 PM
First for the PDF source file:
Place the PDF file in the Source folder of Informatica.
Then create a *.dat in the same folder and give the path in the file like I specified in the video, for example .\xyz.pdf
Then create a flat file data object for the *.dat file as comma delimited file.

For the XML Target:
Create a Flatfile Data object in informatica and choose "Create as Empty" option and finish.
Then open the flatfile object that you created and choose the write option. select the target transformation and add a column of string data type and in input transformation, goto runtime properties , then in the Output File name change the *.out file to *.xml file.

Use the above two as source and target, it should work.
ReplyDelete
Replies
AnonymousAugust 24, 2015 at 1:38 PM
Hi Hema, Data processor is working fine but my source PDF and target XML is not working. I don't see any error/output/anything. Can you please elaborate in detail ?
ReplyDelete
Replies
Manjeeth's arenaAugust 27, 2015 at 2:47 PM
Hi Hema,

Can you help on scenario of extracting details records from Invoice

Ex:

Item Item_Price
abc 10
xyz 15
cbg 30

I would like to have the tags in the XML file as
abc
10
xyz
15
cbg
30
ReplyDelete
Replies
PallaviOctober 12, 2015 at 4:39 PM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownOctober 17, 2015 at 12:07 AM
This is awesome explanation of parsing unstructured data using data processor transformation

http://www.tekclasses.com/courses/etl/informatica-training-in-bangalore/
ReplyDelete
Replies
UnknownDecember 30, 2015 at 2:10 AM
Hi,
I replicated the transformation per the video but when the mapping is run i am ger a error "server variable :[$PMExtProcDir], NULL override value not valid. The override value is set to empty string.".

Can someone help on this pls.

Thanks
ReplyDelete
Replies
RamApril 15, 2016 at 2:44 PM
Hi,

Looking for Options to write Table data into PDF file. Can that be achieved using Serializer or any other DT Component?
ReplyDelete
Replies
AnonymousMay 12, 2016 at 5:48 AM
Hi Hema,

If we have multiple records those many times have we marked the columns and content the data?.
ReplyDelete
Replies
AshwaniJune 29, 2016 at 10:25 AM
This video is realy good but i am interested in JSON format.So through parser we can convert JSON to XML.Data processor just Converting from one semi-stucture to another semi-unstrucutre.if there is any way to extract data from json and store into any RDBMS
ReplyDelete
Replies
UnknownJuly 22, 2016 at 11:36 PM
Hi Hema,

In my scenario we are generating Json files by reading relational data. we can able to generate it. but when i open my json output file i am getting an empty line b/w the each row.do we have any option to remove the empty lines. we are using dataproc transformation to generate the jsons.

ReplyDelete
Replies
UnknownJuly 22, 2016 at 11:37 PM
im getting my output as below.
[kistareddy_jageerapu@ausplcdhedge02 c360_test]$ head -10 FF_AEHUB_CNTCT_ADDR.json
{"RootEntity":"Party","SourceKey":"10100058998799793","Demographics":{"Address":[{"AddressType":"1000001","Key":"101100037127270092","AddressLineOne":"PULCHOWK","CityName":"LALITPUR","State":"BAGMATI","Country":"NP","StatusCode":"S6000","QualityCode":"Q4","FaultCode":"3000","PostalCode":"1000","ExtrcDts":"2016-05-20 00:02:00.000000"}]}}

{"RootEntity":"Party","SourceKey":"10100129158360493","Demographics":{"Address":[{"AddressType":"1000002","Key":"187100068680458694","AddressLineOne":"RETAILER WAL MART STORES","AddressLineTwo":"PO BOX 500787","CityName":"SAINT LOUIS","State":"MO","Country":"US","UndeliverableIndicator":"F","StatusCode":"S00000","PostalCode":"63150","ZipPlusFour":"0787","ExtrcDts":"2016-05-21 00:02:40.000000"}]}}

i am expecting like below:

[kistareddy_jageerapu@ausplcdhedge02 c360_test]$ head -10 FF_AEHUB_CNTCT_ADDR.json
{"RootEntity":"Party","SourceKey":"10100058998799793","Demographics":{"Address":[{"AddressType":"1000001","Key":"101100037127270092","AddressLineOne":"PULCHOWK","CityName":"LALITPUR","State":"BAGMATI","Country":"NP","StatusCode":"S6000","QualityCode":"Q4","FaultCode":"3000","PostalCode":"1000","ExtrcDts":"2016-05-20 00:02:00.000000"}]}}
{"RootEntity":"Party","SourceKey":"10100129158360493","Demographics":{"Address":[{"AddressType":"1000002","Key":"187100068680458694","AddressLineOne":"RETAILER WAL MART STORES","AddressLineTwo":"PO BOX 500787","CityName":"SAINT LOUIS","State":"MO","Country":"US","UndeliverableIndicator":"F","StatusCode":"S00000","PostalCode":"63150","ZipPlusFour":"0787","ExtrcDts":"2016-05-21 00:02:40.000000"}]}}
ReplyDelete
Replies
UnknownMarch 1, 2018 at 7:44 PM
Hi,

Can you help me how to read a json file and write it to flat file
ReplyDelete
Replies
AnonymousMarch 27, 2018 at 1:20 AM
Hi Can you please tell me if it is possible to read a padf file residing in a directory as binary and store in oracle column as blob data type using unstructured data transformation.
ReplyDelete
Replies
UnknownApril 16, 2018 at 10:40 PM
Hi Buddy,

Gasping at your brilliance! Thanks a tonne for sharing all that content. Can’t stop reading. Honestly!

They are mapping variables and workflow variables. I have created them in the mapping and assigned the value in the mapping. Created the workflow variables in the workflow and assigned the mapping variables to the workflow variables. They are not set to persistent. I need the values to get refreshed every time I run the workflow.

But nice Article Mate! Great Information! Keep up the good work!

Kind Regards,
Morgan
ReplyDelete
Replies
UnknownMay 11, 2018 at 11:11 PM
Hi Mate,

Great info! I recently came across your blog and have been reading along.
I thought I would leave my first comment. I don’t know what to say except that I have

If that doesn't help, then I follow the guidelines provided by the Performance Tuning guide.

Now the interviewer might ask, and what's written in the Performance Tuning guide? My answer is, how to identify source and target bottlenecks and all other potential pain points. As these may change from version to version, there's not too much sense in learning them all to repeat them off the top of my head; it's easier to simply follow the chapters in this PDF file for the currently used software version than to keep them all in my head.
And to be honest during my 16 years at Informatica I never needed the Performance Tuning guide. All the other points I mentioned above were sufficient to identify all bottlenecks I've encountered.

Very useful post !everyone should learn and use it during their learning path.

Merci Beaucoup,
Vani
ReplyDelete
Replies
UnknownMay 28, 2018 at 10:58 PM
Hello There,

Nice to be visiting your blog again, it has been months for me. Well this article that i’ve been waited for so long.

One suggestion from my side: insert Decision Tasks after each session; from the Decision Task there are two output links, one leading to the next session, Informatica PIM Training one leading to the point after all those sessions to be skipped.This way you can use simpler conditions everywhere, making them easier to read and to understand.

That is not strictly necessary, but at least it will make it easier for you to debug and test the link conditions.

Very useful post !everyone should learn and use it during their learning path.

Shukran,
Preethi
ReplyDelete
Replies
UnknownMay 28, 2018 at 11:06 PM
Hi Mate,

Smokin hot stuff! You’ve trimmed my dim. I feel as bright and fresh as your prolific website and blogs!

Thanks for the suggestion . But as per my requirement I don't want the session to execute if the count is greater than 5000 , I need to abort the process where it is ( roll back if any thing is done) and fail the workflow and trigger a mail to the concerned team.

The operation team need to know the workflow has failed ( as per requirement ) so the abortion is required.

Informatica PIM Training USA

Follow my new blog if you interested in just tag along me in any social media platforms!

Merci Beaucoup,
Ajeeth
ReplyDelete
Replies
Krishna kumarMay 29, 2018 at 2:04 AM
Hola peeps,

Muchas Gracias Mi Amigo! You make learning so effortless. Anyone can follow you and I would not mind following you to the moon coz I know you are like my north star.

I started using Informatica PIM Training blog for my training practice.
But don't forget that some of the "faulty" records may already have been inserted into the target when the EXP counts >5000 records and ABORT()s the session.
One could try increasing the commit size to some ridiculous value (e.g. 100 million), but that might well overload the undo/redo logs of the underlying database. Also such behaviour will inevitably lead to inconsistent target data.

Awesome! Thanks for putting this all in one place. Very useful!

Cheers,

Krishna kumar
ReplyDelete
Replies
UnknownJune 11, 2018 at 8:46 AM
Great article. Really enjoyed it. :)

Informatica Read JSON

Following article is also good to learn, How to Import JSON data using Informatica (Read JSON Files or REST API)

ReplyDelete
Replies
UnknownJuly 5, 2018 at 8:39 AM
I am essentially satisfied with your great work.You put truly extremely supportive data. Keep it up. Continue blogging. Hoping to perusing your following postsolve afp problems
ReplyDelete
Replies
UnknownJuly 26, 2018 at 4:46 AM
i want informatica bdm course can u send me ur mail id
ReplyDelete
Replies
TejutejuAugust 22, 2018 at 4:01 AM
Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Data Science online Course
ReplyDelete
Replies
UnknownAugust 30, 2018 at 6:26 AM
it the surgery is done now, it'll b enough 2 cut off a part of the liver.. liver will work well even if there is only half of it. but wat wil u do when the entire liver gets affected by cancer? so please explain him all these and ask him 2 undergo surgery.

solve afp problems
ReplyDelete
Replies
UnknownNovember 11, 2018 at 8:22 PM
HI PLEASE CAN ANYONE IN loading multiple rows using parser transformation
ReplyDelete
Replies
UnknownDecember 17, 2018 at 11:02 PM
This is beyond doubt a blog significant to follow. You’ve dig up a great deal to say about this topic, and so much awareness. I believe that you recognize how to construct people pay attention to what you have to pronounce, particularly with a concern that’s so vital. I am pleased to suggest this blog.
Data Science Training in Indira nagar
Data Science training in marathahalli
Data Science Interview questions and answers
Data Science training in btm layout | Data Science Training in Bangalore
Data Science Training in BTM Layout | Data Science training in Bangalore
Data science training in kalyan nagar
ReplyDelete
Replies
UNKNOWNJanuary 12, 2019 at 5:31 AM
Whoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.
AWS TRAINING IN BANGALORE|NO.1AWS TRAINING INSTITUTES IN BANGALORE |AWS COURSE TRAINING IN BANGALORE
ReplyDelete
Replies
James ZicrovApril 12, 2019 at 5:07 AM
The following article is a must read since it will definitely provide you with loads of information related to Reading JSON in Informatica.Thanks.
ReplyDelete
Replies
Raj SharmaJuly 8, 2019 at 4:32 AM
Awesome post. You Post is very informative. Thanks for Sharing.
Informatica Training in Noida
ReplyDelete
Replies
infotrellisusJuly 31, 2019 at 7:44 AM
Mastech InfoTrellis - Data and Analytics Consulting Company extending premier services in Master Data Management, Big Data and Data Integration.

Visit for More : https://mastechinfotrellis.com/data-management/
ReplyDelete
Replies
teqtindiaNovember 30, 2019 at 3:23 AM
That is a great piece of content! pdf to xml or xml to pdf conversion seems easy that they are. If however you face difficulties in xml conversion then you can check out our services:)
ReplyDelete
Replies
Malcom MarshallJanuary 21, 2020 at 6:35 AM
This comment has been removed by the author.
ReplyDelete
Replies
MOUNIKASeptember 18, 2020 at 3:46 AM
Nice post.
Informatica Data Quality online training
Informatica Data Quality training
Informatica idq online training
Informatica idq training
Informatica mdm online training
Informatica mdm training
Informatica message Queue online training
Informatica message Queue training
Informatica power center online training
Informatica power center training
Manual Testing online training
Manual Testing training
Microservices online training
Microservices training
Office 365 online training
Office 365 training
Open stack online training
Open stack training
ReplyDelete
Replies
Informatica Read JsonMay 11, 2021 at 2:49 AM
Hello,

Thank you for sharing such valuable information. You have made it so easy to read on Data Processor Transformation.
ReplyDelete
Replies
KITS TechnologiesMay 22, 2021 at 2:32 AM
Say, you got a Say, you got a nice article post.Really thank you! Really Great.
nice article post.Really thank you! Really Great.
Sap ABAP On Hana online online training
SAP Grc online training
SAP Secrity online training
oracle sql plsql online training
go langaunage online training
ReplyDelete
Replies
KITS TechnologiesNovember 2, 2021 at 4:09 AM
Very neat article. Keep writing.
SAP Bods online training
SAP QM online training
ReplyDelete
Replies

Add comment

Pages

Friday, February 6, 2015

Parsing Unstructured Data Using Data Processor Transformation in Informatica - PDF to XML

40 comments: