apache pig - How can I modify my following Pig Latin script to perform Set Intersection efficiently -


i new pig , trying run following pigscript on our 5-node hadoop cluster. following script gives me set intersection of 2 columns in relation

register '/home/workspace/pig/setintersecudf.jar';  define inter com.cs.pig.setintersection();  = load '/home/pig/pig-0.12.0/input/location.txt' (location:chararray);  b = load '/home/pig/pig-0.12.0/input/location.txt' (location:chararray);  c = cross a,b parallel 10;  c = distinct c;  d = foreach c generate $0,$1,inter($0,$1) intersection;  e = filter d intersection !='[]' parallel 10;  e = filter e $0!=$1 parallel 10;  store e '/home/documents/pig_output'; 

i have 6 mb file contains locations san diego ca or san d ca. want third column intersection of both i.e. [san, ca]. have file 321,372 records , have take cross of 2 columns can process each tuple @ time.

as, pointed out me,t 6 mb file translates around 1.9 tb , hence, job fails because of insufficient disk space.

what changes can make script make run efficiently?

following error getting:

java.io.ioexception: org.apache.hadoop.ipc.remoteexception: java.io.ioexception: file /tmp/temp-10926921/tmp-1823693600/_temporary/_attempt_201401171541_0001_r_000000_0/part-r-00000 replicated 0 nodes, instead of 1 @ org.apache.hadoop.hdfs.server.namenode.fsnamesystem.getadditionalblock(fsnamesystem.java:1639)  @  org.apache.hadoop.hdfs.server.namenode.namenode.addblock(namenode.java:736) @ sun.reflect.generatedmethodaccessor29.invoke(unknown source) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) @ java.lang.reflect.method.invoke(method.java:597) @ org.apache.hadoop.ipc.rpc$server.call(rpc.java:578) @ org.apache.hadoop.ipc.server$handler$1.run(server.java:1393) @ org.apache.hadoop.ipc.server$handler$1.run(server.java:1389) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:396) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1149) @ org.apache.hadoop.ipc.server$handler.run(server.java:1387) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapreduce$reduce.runpipeline(piggenericmapreduce.java:469) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapreduce$reduce.processonepackageoutput(piggenericmapreduce.java:432) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapreduce$reduce.reduce(piggenericmapreduce.java:404) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapreduce$reduce.reduce(piggenericmapreduce.java:256) @ org.apache.hadoop.mapreduce.reducer.run(reducer.java:176) @ org.apache.hadoop.mapred.reducetask.runnewreducer(reducetask.java:650) @ org.apache.hadoop.mapred.reducetask.run(reducetask.java:418) @ org.apache.hadoop.mapred.child$4.run(child.java:255) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:396) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1149) @ org.apache.hadoop.mapred.child.main(child.java:249) caused by: org.apache.hadoop.ipc.remoteexception: java.io.ioexception: file /tmp/temp-10926921/tmp-1823693600/_temporary/_attempt_201401171541_0001_r_000000_0/part-r-00000 replicated 0 nodes, instead of 1 @ org.apache.hadoop.hdfs.server.namenode.fsnamesystem.getadditionalblock(fsnamesystem.java:1639) @ org.apache.hadoop.hdfs.server.namenode.namenode.addblock(namenode.java:736) @ sun.reflect.generatedmethodaccessor29.invoke(unknown source) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) @ java.lang.reflect.method.invoke(method.java:597) @ org.apache.hadoop.ipc.rpc$server.call(rpc.java:578) @ org.apache.hadoop.ipc.server$handler$1.run(server.java:1393) @ org.apache.hadoop.ipc.server$handler$1.run(server.java:1389) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:396) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1149) @ org.apache.hadoop.ipc.server$handler.run(server.java:1387) @ org.apache.hadoop.ipc.client.call(client.java:1107) @ org.apache.hadoop.ipc.rpc$invoker.invoke(rpc.java:229) @ $proxy2.addblock(unknown source) @ sun.reflect.generatedmethodaccessor4.invoke(unknown source) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) @ java.lang.reflect.method.invoke(method.java:597) @ org.apache.hadoop.io.retry.retryinvocationhandler.invokemethod(retryinvocationhandler.java:85) @ org.apache.hadoop.io.retry.retryinvocationhandler.invoke(retryinvocationhandler.java:62) @ $proxy2.addblock(unknown source) @ org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.locatefollowingblock(dfsclient.java:3686) @ org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.nextblockoutputstream(dfsclient.java:3546) @ org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.access$2600(dfsclient.java:2749) @ org.apache.hadoop.hdfs.dfsclient$dfsoutputstream$datastreamer.run(dfsclient.java:2989) 

firstly, distinct after cross unlikely anything. distinct respects order, rid of tuples same. should case if have multiples of same line in input file.

the e = filter e $0!=$1... should done before find intersections lines. going throw them out anyways, going want possible.

to solve this, going need limit cross somehow. don't know data looks like, or expecting output, grouping data state beforehand should cut down on number of tuples cross make dramatically. however, not return intersection things like: foo ca , foo ct.


Comments

Popular posts from this blog

html - Sizing a high-res image (~8MB) to display entirely in a small div (circular, diameter 100px) -

java - IntelliJ - No such instance method -

identifier - Is it possible for an html5 document to have two ids? -