Pig doesn’t support scalar variable assignment. That is you can not have a statement like this
var = 3 |
The smallest unit you can have is a tuple, containing a single value
var = {3} |
So, say that you have a variable X containing 2 columns,
(word1,1) (word2,4) (word3,14) |
and you need to do some math against the second column, based on the result of a value stored in a variable, var above.
The following statement won’t work:
result = FOREACH X GENERATE $1*var; |
Instead you need to join two variables together so that for every row of X, you will have an additional column containing the value from var. You need to produce the following data before proceeding with your calculation
(word1,1) (word2,4,3) (word3,14,3) |
To accomplish this, you need to do the following:
temp = JOIN X BY 1, var BY 1 USING 'replicated'; |
Now you can do your math operation
result = FOREACH temp GENERATE $1*$2; |
Something like,
result = FOREACH X GENERATE $1*var.$0 …
should work I guess.
Thanks a lot for this article. You saved my day… Will share it on my blog as well.