Data Processor transformation processes
unstructured and semi-structured file formats in a mapping. We can configure it
to process HTML pages, XML, JSON, and PDF documents. We can also convert
structured formats such as ACORD, HIPAA, HL7, EDI-X12, EDIFACT, AFP, and SWIFT.
For example, if we have customer invoices in Microsoft Word files, we can
configure a Data Processor transformation to parse the data from each word file
and extract the customer data to a Customer table and order information Orders
table.
The Data Processor
Transformation has the following
options:
Parser converts source documents to XML. The
output of a Parser is always XML. The input can have any format, such as text,
HTML, Word, PDF, or HL7.
Serializer converts an XML file to an output
document of any format. The output of a Serializer can be any format, such as a
text document, an HTML document, or a PDF.
Mapper converts an XML source document to
another XML structure or schema. You can convert the same XML documents as in
an XMap.
Transformer modifies the data in any format.
Adds, removes, converts, or changes text. Use Transformers with a Parser,
Mapper, or Serializer. You can also run a Transformer as stand-alone component.
Streamer splits large input documents, such as
multi-gigabyte data streams, into segments. The Streamer processes documents
that have multiple messages or records in them, such as HIPAA or EDI files.
In this blog, we
will see how to extract data from a PDF Document
and create a XML file using Data Processor Transformation. The source documents
that have fixed page layout like bills, invoices and account statements can be
parsed using positional format to find the data fields. An anchor is a signpost that you place in a
document, indicating the position of the data. The most commonly used anchors
are called Marker and Content anchors. These anchors are often used as a pair:
Marker anchor labels a location in a document.
Content anchor retrieves text from the
location. It stores the text that it extracts from a source document in a data
holder.
I have a PDF document with employee data as shown below:
FirstName: Chris
Lastname: Boyd
Department: HR
StartDate:
2009-10-11
You will also need
an XML Schema Definition (XSD) file which contains target XML schema. XSD file
looks like this:
<?xml
version="1.0" encoding="Windows-1252"?>
<xs:schema
attributeFormDefault="unqualified"
elementFormDefault="unqualified"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element
name="EmpPdf">
<xs:complexType>
<xs:sequence>
<xs:element
name="FirstName" type="xs:string" />
<xs:element
name="LastName" type="xs:string" />
<xs:element
name="Department" type="xs:string" />
<xs:element
name="StartDate" type="xs:date"
/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Using the above
data, I have explained in my video how to create a schema object and Data
Processor Transformation and use it in a mapping.
* In case of any
questions, feel free to leave comments on this page and I would get back as
soon as I can.
I asked this question on your You tube video also "If my Source is PDF file then how to create data object for PDF file from where i can read and put into the data Data process transformation?" I am able to create data process transformation but when i am creating a mapping for this Data processor transformation , i am not able to create the XML/PDF input file which i have to connect with Data Processor Transformation.
ReplyDeleteHi Hema,
DeleteNice to be visiting your blog again, it has been months for me. Well this article that i’ve been waited for so long.
Can anyone please suggest is there any provision in informatica to pass values like above explained??
You could add a task to generate a new parameter file at the end of the workflow. An assignment task can be used to evaluate the current value of $$Month_Count and set it to a new value. Then, that parameter could be used to create a parameter file that would be used in the next run.
PIM represents a solution for centralized, media-independent data maintenance, as well as efficient data collection, management, refinement and output.
Once again thanks for your tutorial.
Thank you,
Annie
First for the PDF source file:
ReplyDeletePlace the PDF file in the Source folder of Informatica.
Then create a *.dat in the same folder and give the path in the file like I specified in the video, for example .\xyz.pdf
Then create a flat file data object for the *.dat file as comma delimited file.
For the XML Target:
Create a Flatfile Data object in informatica and choose "Create as Empty" option and finish.
Then open the flatfile object that you created and choose the write option. select the target transformation and add a column of string data type and in input transformation, goto runtime properties , then in the Output File name change the *.out file to *.xml file.
Use the above two as source and target, it should work.
Thanks a lot. Its Working . Ran it successfully.
DeleteHi rajeev, Can you help me how did you create the PDF source/XML Target in more detail. I know the above one is more detail, but I am not able to run the mapping. Can you please help ?
DeleteHi Hema, Data processor is working fine but my source PDF and target XML is not working. I don't see any error/output/anything. Can you please elaborate in detail ?
ReplyDeleteHi Hema,
ReplyDeleteCan you help on scenario of extracting details records from Invoice
Ex:
Item Item_Price
abc 10
xyz 15
cbg 30
I would like to have the tags in the XML file as
abc
10
xyz
15
cbg
30
This comment has been removed by the author.
ReplyDeleteThis is awesome explanation of parsing unstructured data using data processor transformation
ReplyDeletehttp://www.tekclasses.com/courses/etl/informatica-training-in-bangalore/
Hi,
ReplyDeleteI replicated the transformation per the video but when the mapping is run i am ger a error "server variable :[$PMExtProcDir], NULL override value not valid. The override value is set to empty string.".
Can someone help on this pls.
Thanks
Hi,
ReplyDeleteLooking for Options to write Table data into PDF file. Can that be achieved using Serializer or any other DT Component?
Hi Hema,
ReplyDeleteIf we have multiple records those many times have we marked the columns and content the data?.
This video is realy good but i am interested in JSON format.So through parser we can convert JSON to XML.Data processor just Converting from one semi-stucture to another semi-unstrucutre.if there is any way to extract data from json and store into any RDBMS
ReplyDeleteHi Hema,
ReplyDeleteIn my scenario we are generating Json files by reading relational data. we can able to generate it. but when i open my json output file i am getting an empty line b/w the each row.do we have any option to remove the empty lines. we are using dataproc transformation to generate the jsons.
Can you explain how you achieved it? what is your informatica version?
Deleteim getting my output as below.
ReplyDelete[kistareddy_jageerapu@ausplcdhedge02 c360_test]$ head -10 FF_AEHUB_CNTCT_ADDR.json
{"RootEntity":"Party","SourceKey":"10100058998799793","Demographics":{"Address":[{"AddressType":"1000001","Key":"101100037127270092","AddressLineOne":"PULCHOWK","CityName":"LALITPUR","State":"BAGMATI","Country":"NP","StatusCode":"S6000","QualityCode":"Q4","FaultCode":"3000","PostalCode":"1000","ExtrcDts":"2016-05-20 00:02:00.000000"}]}}
{"RootEntity":"Party","SourceKey":"10100129158360493","Demographics":{"Address":[{"AddressType":"1000002","Key":"187100068680458694","AddressLineOne":"RETAILER WAL MART STORES","AddressLineTwo":"PO BOX 500787","CityName":"SAINT LOUIS","State":"MO","Country":"US","UndeliverableIndicator":"F","StatusCode":"S00000","PostalCode":"63150","ZipPlusFour":"0787","ExtrcDts":"2016-05-21 00:02:40.000000"}]}}
i am expecting like below:
[kistareddy_jageerapu@ausplcdhedge02 c360_test]$ head -10 FF_AEHUB_CNTCT_ADDR.json
{"RootEntity":"Party","SourceKey":"10100058998799793","Demographics":{"Address":[{"AddressType":"1000001","Key":"101100037127270092","AddressLineOne":"PULCHOWK","CityName":"LALITPUR","State":"BAGMATI","Country":"NP","StatusCode":"S6000","QualityCode":"Q4","FaultCode":"3000","PostalCode":"1000","ExtrcDts":"2016-05-20 00:02:00.000000"}]}}
{"RootEntity":"Party","SourceKey":"10100129158360493","Demographics":{"Address":[{"AddressType":"1000002","Key":"187100068680458694","AddressLineOne":"RETAILER WAL MART STORES","AddressLineTwo":"PO BOX 500787","CityName":"SAINT LOUIS","State":"MO","Country":"US","UndeliverableIndicator":"F","StatusCode":"S00000","PostalCode":"63150","ZipPlusFour":"0787","ExtrcDts":"2016-05-21 00:02:40.000000"}]}}
Hi,
ReplyDeleteCan you help me how to read a json file and write it to flat file
Hi Can you please tell me if it is possible to read a padf file residing in a directory as binary and store in oracle column as blob data type using unstructured data transformation.
ReplyDeleteHi Buddy,
ReplyDeleteGasping at your brilliance! Thanks a tonne for sharing all that content. Can’t stop reading. Honestly!
They are mapping variables and workflow variables. I have created them in the mapping and assigned the value in the mapping. Created the workflow variables in the workflow and assigned the mapping variables to the workflow variables. They are not set to persistent. I need the values to get refreshed every time I run the workflow.
But nice Article Mate! Great Information! Keep up the good work!
Kind Regards,
Morgan
Hi Mate,
ReplyDeleteGreat info! I recently came across your blog and have been reading along.
I thought I would leave my first comment. I don’t know what to say except that I have
If that doesn't help, then I follow the guidelines provided by the Performance Tuning guide.
Now the interviewer might ask, and what's written in the Performance Tuning guide? My answer is, how to identify source and target bottlenecks and all other potential pain points. As these may change from version to version, there's not too much sense in learning them all to repeat them off the top of my head; it's easier to simply follow the chapters in this PDF file for the currently used software version than to keep them all in my head.
And to be honest during my 16 years at Informatica I never needed the Performance Tuning guide. All the other points I mentioned above were sufficient to identify all bottlenecks I've encountered.
Very useful post !everyone should learn and use it during their learning path.
Merci Beaucoup,
Vani
Hello There,
ReplyDeleteNice to be visiting your blog again, it has been months for me. Well this article that i’ve been waited for so long.
One suggestion from my side: insert Decision Tasks after each session; from the Decision Task there are two output links, one leading to the next session, Informatica PIM Training one leading to the point after all those sessions to be skipped.This way you can use simpler conditions everywhere, making them easier to read and to understand.
That is not strictly necessary, but at least it will make it easier for you to debug and test the link conditions.
Very useful post !everyone should learn and use it during their learning path.
Shukran,
Preethi
Hi Mate,
ReplyDeleteSmokin hot stuff! You’ve trimmed my dim. I feel as bright and fresh as your prolific website and blogs!
Thanks for the suggestion . But as per my requirement I don't want the session to execute if the count is greater than 5000 , I need to abort the process where it is ( roll back if any thing is done) and fail the workflow and trigger a mail to the concerned team.
The operation team need to know the workflow has failed ( as per requirement ) so the abortion is required.
Informatica PIM Training USA
Follow my new blog if you interested in just tag along me in any social media platforms!
Merci Beaucoup,
Ajeeth
Hola peeps,
ReplyDeleteMuchas Gracias Mi Amigo! You make learning so effortless. Anyone can follow you and I would not mind following you to the moon coz I know you are like my north star.
I started using Informatica PIM Training blog for my training practice.
But don't forget that some of the "faulty" records may already have been inserted into the target when the EXP counts >5000 records and ABORT()s the session.
One could try increasing the commit size to some ridiculous value (e.g. 100 million), but that might well overload the undo/redo logs of the underlying database. Also such behaviour will inevitably lead to inconsistent target data.
Awesome! Thanks for putting this all in one place. Very useful!
Cheers,
Krishna kumar
Great article. Really enjoyed it. :)
ReplyDeleteInformatica Read JSON
Following article is also good to learn, How to Import JSON data using Informatica (Read JSON Files or REST API)
I am essentially satisfied with your great work.You put truly extremely supportive data. Keep it up. Continue blogging. Hoping to perusing your following postsolve afp problems
ReplyDeletei want informatica bdm course can u send me ur mail id
ReplyDeleteExcellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Data Science online Course
ReplyDeleteit the surgery is done now, it'll b enough 2 cut off a part of the liver.. liver will work well even if there is only half of it. but wat wil u do when the entire liver gets affected by cancer? so please explain him all these and ask him 2 undergo surgery.
ReplyDeletesolve afp problems
HI PLEASE CAN ANYONE IN loading multiple rows using parser transformation
ReplyDeleteThis is beyond doubt a blog significant to follow. You’ve dig up a great deal to say about this topic, and so much awareness. I believe that you recognize how to construct people pay attention to what you have to pronounce, particularly with a concern that’s so vital. I am pleased to suggest this blog.
ReplyDeleteData Science Training in Indira nagar
Data Science training in marathahalli
Data Science Interview questions and answers
Data Science training in btm layout | Data Science Training in Bangalore
Data Science Training in BTM Layout | Data Science training in Bangalore
Data science training in kalyan nagar
Whoa! I’m enjoying the template/theme of this website. It’s simple, yet effective. A lot of times it’s very hard to get that “perfect balance” between superb usability and visual appeal. I must say you’ve done a very good job with this.
ReplyDeleteAWS TRAINING IN BANGALORE|NO.1AWS TRAINING INSTITUTES IN BANGALORE |AWS COURSE TRAINING IN BANGALORE
The following article is a must read since it will definitely provide you with loads of information related to Reading JSON in Informatica.Thanks.
ReplyDeleteAwesome post. You Post is very informative. Thanks for Sharing.
ReplyDeleteInformatica Training in Noida
Mastech InfoTrellis - Data and Analytics Consulting Company extending premier services in Master Data Management, Big Data and Data Integration.
ReplyDeleteVisit for More : https://mastechinfotrellis.com/data-management/
That is a great piece of content! pdf to xml or xml to pdf conversion seems easy that they are. If however you face difficulties in xml conversion then you can check out our services:)
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteNice post.
ReplyDeleteInformatica Data Quality online training
Informatica Data Quality training
Informatica idq online training
Informatica idq training
Informatica mdm online training
Informatica mdm training
Informatica message Queue online training
Informatica message Queue training
Informatica power center online training
Informatica power center training
Manual Testing online training
Manual Testing training
Microservices online training
Microservices training
Office 365 online training
Office 365 training
Open stack online training
Open stack training
Hello,
ReplyDeleteThank you for sharing such valuable information. You have made it so easy to read on Data Processor Transformation.
Say, you got a Say, you got a nice article post.Really thank you! Really Great.
ReplyDeletenice article post.Really thank you! Really Great.
Sap ABAP On Hana online online training
SAP Grc online training
SAP Secrity online training
oracle sql plsql online training
go langaunage online training
Very neat article. Keep writing.
ReplyDeleteSAP Bods online training
SAP QM online training