apache pig - How can I modify my following Pig Latin script to perform Set Intersection efficiently -
i new pig , trying run following pigscript on our 5-node hadoop cluster. following script gives me set intersection of 2 columns in relation
register '/home/workspace/pig/setintersecudf.jar'; define inter com.cs.pig.setintersection(); = load '/home/pig/pig-0.12.0/input/location.txt' (location:chararray); b = load '/home/pig/pig-0.12.0/input/location.txt' (location:chararray); c = cross a,b parallel 10; c = distinct c; d = foreach c generate $0,$1,inter($0,$1) intersection; e = filter d intersection !='[]' parallel 10; e = filter e $0!=$1 parallel 10; store e '/home/documents/pig_output';
i have 6 mb file contains locations san diego ca or san d ca. want third column intersection of both i.e. [san, ca]. have file 321,372 records , have take cross of 2 columns can process each tuple @ time.
as, pointed out me,t 6 mb file translates around 1.9 tb , hence, job fails because of insufficient disk space.
what changes can make script make run efficiently?
following error getting:
java.io.ioexception: org.apache.hadoop.ipc.remoteexception: java.io.ioexception: file /tmp/temp-10926921/tmp-1823693600/_temporary/_attempt_201401171541_0001_r_000000_0/part-r-00000 replicated 0 nodes, instead of 1 @ org.apache.hadoop.hdfs.server.namenode.fsnamesystem.getadditionalblock(fsnamesystem.java:1639) @ org.apache.hadoop.hdfs.server.namenode.namenode.addblock(namenode.java:736) @ sun.reflect.generatedmethodaccessor29.invoke(unknown source) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) @ java.lang.reflect.method.invoke(method.java:597) @ org.apache.hadoop.ipc.rpc$server.call(rpc.java:578) @ org.apache.hadoop.ipc.server$handler$1.run(server.java:1393) @ org.apache.hadoop.ipc.server$handler$1.run(server.java:1389) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:396) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1149) @ org.apache.hadoop.ipc.server$handler.run(server.java:1387) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapreduce$reduce.runpipeline(piggenericmapreduce.java:469) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapreduce$reduce.processonepackageoutput(piggenericmapreduce.java:432) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapreduce$reduce.reduce(piggenericmapreduce.java:404) @ org.apache.pig.backend.hadoop.executionengine.mapreducelayer.piggenericmapreduce$reduce.reduce(piggenericmapreduce.java:256) @ org.apache.hadoop.mapreduce.reducer.run(reducer.java:176) @ org.apache.hadoop.mapred.reducetask.runnewreducer(reducetask.java:650) @ org.apache.hadoop.mapred.reducetask.run(reducetask.java:418) @ org.apache.hadoop.mapred.child$4.run(child.java:255) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:396) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1149) @ org.apache.hadoop.mapred.child.main(child.java:249) caused by: org.apache.hadoop.ipc.remoteexception: java.io.ioexception: file /tmp/temp-10926921/tmp-1823693600/_temporary/_attempt_201401171541_0001_r_000000_0/part-r-00000 replicated 0 nodes, instead of 1 @ org.apache.hadoop.hdfs.server.namenode.fsnamesystem.getadditionalblock(fsnamesystem.java:1639) @ org.apache.hadoop.hdfs.server.namenode.namenode.addblock(namenode.java:736) @ sun.reflect.generatedmethodaccessor29.invoke(unknown source) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) @ java.lang.reflect.method.invoke(method.java:597) @ org.apache.hadoop.ipc.rpc$server.call(rpc.java:578) @ org.apache.hadoop.ipc.server$handler$1.run(server.java:1393) @ org.apache.hadoop.ipc.server$handler$1.run(server.java:1389) @ java.security.accesscontroller.doprivileged(native method) @ javax.security.auth.subject.doas(subject.java:396) @ org.apache.hadoop.security.usergroupinformation.doas(usergroupinformation.java:1149) @ org.apache.hadoop.ipc.server$handler.run(server.java:1387) @ org.apache.hadoop.ipc.client.call(client.java:1107) @ org.apache.hadoop.ipc.rpc$invoker.invoke(rpc.java:229) @ $proxy2.addblock(unknown source) @ sun.reflect.generatedmethodaccessor4.invoke(unknown source) @ sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl.java:25) @ java.lang.reflect.method.invoke(method.java:597) @ org.apache.hadoop.io.retry.retryinvocationhandler.invokemethod(retryinvocationhandler.java:85) @ org.apache.hadoop.io.retry.retryinvocationhandler.invoke(retryinvocationhandler.java:62) @ $proxy2.addblock(unknown source) @ org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.locatefollowingblock(dfsclient.java:3686) @ org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.nextblockoutputstream(dfsclient.java:3546) @ org.apache.hadoop.hdfs.dfsclient$dfsoutputstream.access$2600(dfsclient.java:2749) @ org.apache.hadoop.hdfs.dfsclient$dfsoutputstream$datastreamer.run(dfsclient.java:2989)
firstly, distinct
after cross
unlikely anything. distinct
respects order, rid of tuples same. should case if have multiples of same line in input file.
the e = filter e $0!=$1
... should done before find intersections lines. going throw them out anyways, going want possible.
to solve this, going need limit cross
somehow. don't know data looks like, or expecting output, group
ing data state beforehand should cut down on number of tuples cross
make dramatically. however, not return intersection things like: foo ca
, foo ct
.
Comments
Post a Comment