Policy optimization in multi-agent settings under partially observable environments