ggplot2 - R, ggplot, separate mean by range of x value -
i have set of data looks this
chrom pos gt diff 1 chr01 14653 ct 254 2 chr01 14907 ag 254 3 chr01 14930 ag 23 4 chr01 15190 ga 260 5 chr01 15211 tg 21 6 chr01 16378 tc 1167 where pos range 1xxxx 1xxxxxxx. , chrom categorical variable contains values of "chr01" "chr22" , "chrx".
i want plot scatterplot:
- y(diff) vs. x(pos)
- having panels separated chrom
- grouped gt (different colors gt)
i'm creating ggplot running average (though not time series data).
what want average every 1,000,000 range of pos gt.
for example,
for x in range(1 ~ 1,000,000) , diff average = _____
for x in range(1,000,001 ~ 2,000,000), diff average = _____
and want plot horizontal lines on ggplot (coloured gt).
#what have far before apply function: 
after apply function:

i tried apply solution have, here problems:
- there different panels, mean values different different panel, when apply code, horizontal mean lines identical first panel.
- i'm having different ranges x-axis, when apply function, automatically fills out range previous horizontal mean line
here code before:
ggplot(data1, aes(x=pos,y=diff,colour=gt)) + geom_point() + facet_grid(~ chrom,scales="free_x",space="free_x") + theme(strip.text.x = element_text(size=40), strip.background = element_rect(color='lightblue',fill='lightblue'), legend.position="top", legend.title = element_text(size=40,colour="darkblue"), legend.text = element_text(size=40), legend.key.size = unit(2.5, "cm")) + guides(fill = guide_legend(title.position="top", title = "legend:gt='ref'+'alt'"), shape = guide_legend(override.aes=list(size=10))) + scale_y_log10(breaks=trans_breaks("log10", function(x) 10^x, n=10)) + scale_x_continuous(breaks = pretty_breaks(n=3))
this tougher expected! should @ least started, though:
# saves lot of headaches make factors need them options(stringsasfactors = false) library(ggplot2) library(plyr) # here's made-up data - helps if can post subset of # real data, though. dput() function useful that. dat <- data.frame(pos = seq(1, 1e7, = 1e4)) # add random gt value dat$gt <- sample(x = c("ct", "ag", "ga", "tg", "tc"), size = nrow(dat), replace = true) # group millions - there several ways can # never remember, here's simple way split millions dat$posgroup <- floor(dat$pos / 1e6) # add arbitrary diff value dat$diff <- rnorm(n = nrow(dat), mean = 200 * dat$posgroup, sd = 300) # aggregate data gt , pos-group # ideally, you'd inside of plot using stat_summary, # couldn't work. using 2 datasets in plot # okay, though. datsum <- ddply(dat, .var = "posgroup", .fun = function(x) { # calculate mean diff value each gt group in posgroup meandiff <- ddply(x, .var = "gt", .fun = summarise, ymean = mean(diff)) # add center of posgroup range x position meandiff$center <- (x$posgroup[1] * 1e6) + 0.5e6 # return results meandiff }) # on plot, these results grouped both pos , gt - # ggplot accept 1 vector grouping. make combination. datsum$combogroup <- paste(datsum$gt, datsum$posgroup) # plot ggplot() + # first, layer points # large numbers of points can pretty slow - might try getting # plot work subsample (~1000) , add in rest of # data geom_point(data = dat, aes(x = pos, y = diff, color = as.factor(gt))) + # layer means. there variety of geoms # use here, crossbar ymin , ymax set group mean # simple 1 geom_crossbar(data = datsum, aes(x = center, y = ymean, ymin = ..y.., ymax = ..y.., color = as.factor(gt), group = combogroup), size = 1) + # other niceties scale_x_continuous(breaks = seq(0, 1e7, = 1e6)) + labs(x = "pos", y = "diff", color = "gt") + theme_bw() which results in this:

there's more straightforward way this, don't know it. hope helps.
Comments
Post a Comment