112 Commits

Author SHA1 Message Date
lucidrains
b6aa19f31e complete multi-token prediction for actions, tackle loss balancing another day 0.0.38 2025-10-18 10:23:14 -07:00
lucidrains
bc629d78b1 inverse norm for continuous actions when sampling 0.0.37 2025-10-18 08:55:04 -07:00
lucidrains
0ee475d2df oops 0.0.36 2025-10-18 08:50:53 -07:00
lucidrains
8c88a33d3b complete multi token prediction for the reward head 0.0.35 2025-10-18 08:33:06 -07:00
lucidrains
911a1a8434 oops 0.0.34 2025-10-18 08:07:06 -07:00
lucidrains
5fc0022bbf the function for generating the MTP targets, as well as the mask for the losses 2025-10-18 08:04:51 -07:00
lucidrains
83cfd2cd1b task conditioning when dreaming 0.0.33 2025-10-18 07:47:13 -07:00
lucidrains
22e13c45fc rename 0.0.32 2025-10-17 14:44:25 -07:00
lucidrains
c967404471 0.0.31 0.0.31 2025-10-17 08:55:42 -07:00
lucidrains
0c1b067f97 if optimizer is passed into the learn from dreams function, take the optimizer steps, otherwise let the researcher handle it externally. also ready muon 2025-10-17 08:55:20 -07:00
lucidrains
cb416c0d44 handle the entropies during policy optimization 0.0.30 2025-10-17 08:47:26 -07:00
lucidrains
61773c8219 eventually we will need to learn from the outside stream of experience 0.0.29 2025-10-17 08:06:24 -07:00
lucidrains
0dba734280 start the learning in dreams portion 0.0.27 2025-10-17 08:00:47 -07:00
lucidrains
a0161760a0 extract the log probs and predicted values (symexp two hot encoded) for the phase 3 RL training 0.0.26 2025-10-16 10:40:59 -07:00
lucidrains
2d20d0a6c1 able to roll out actions from one agent within the dreams of a world model 0.0.25 2025-10-16 10:15:43 -07:00
lucidrains
d74f09f0b3 a researcher in discord pointed out that the tokenizer also uses the axial space time transformer. redo without the 3d rotary and block causal, greatly simplifying the implementation 0.0.24 2025-10-16 09:40:14 -07:00
lucidrains
2ccb290e26 pass the attend kwargs for the block causal masking in tokenizer 0.0.23 2025-10-16 08:33:26 -07:00
lucidrains
517ef6b94b oops 0.0.22 2025-10-16 07:03:51 -07:00
lucidrains
ec18bc0fa4 cleanup 2025-10-16 06:44:28 -07:00
lucidrains
2a902eaaf7 allow reward tokens to be attended to as state optionally, DT-esque. figure out multi-agent scenario once i get around to it 0.0.21 2025-10-16 06:41:02 -07:00
lucidrains
d28251e9f9 another consideration before knocking out the RL logic 0.0.20 2025-10-14 11:10:26 -07:00
lucidrains
ff81dd761b separate action and agent embeds 0.0.19 2025-10-13 11:36:21 -07:00
lucidrains
6dbdc3d7d8 correct a misunderstanding where past actions is a separate action token, while agent token is used for the prediction of next action, rewards, values 0.0.18 2025-10-12 16:16:18 -07:00
lucidrains
9c78962736 sampling actions 0.0.17 2025-10-12 11:27:12 -07:00
lucidrains
c5e64ff4ce separate out the key from the value projections in attention for muon 0.0.16 2025-10-12 09:42:22 -07:00
lucidrains
ab5de6795f bring in muon 2025-10-12 09:35:06 -07:00
lucidrains
8a73a27fc7 add nested tensor way for getting log prob of multiple discrete actions 0.0.15 2025-10-11 10:53:24 -07:00
lucidrains
01bf70e18a 0.0.14 0.0.14 2025-10-11 09:24:58 -07:00
lucidrains
b2725d9b6e complete behavior cloning for one agent 2025-10-11 09:24:49 -07:00
lucidrains
02558d1f08 will organize the unembedding parameters under the actor optimizer 2025-10-11 06:55:57 -07:00
lucidrains
563b269f8a bring in hyper connections 0.0.12 2025-10-11 06:52:57 -07:00
lucidrains
5df3e69583 last commit for the day 0.0.11 2025-10-10 11:59:18 -07:00
lucidrains
9230267d34 handle subset of discrete action unembedding 0.0.10 2025-10-10 11:27:05 -07:00
lucidrains
c68942b026 cleanup 2025-10-10 10:42:54 -07:00
lucidrains
32aa355e37 prepare unembedding parameters in ActionEmbedder as well as the policy head, to allow for behavioral cloning before RL 0.0.9 2025-10-10 10:41:48 -07:00
lucidrains
9101a49cdd handle continuous value normalization if stats passed in 2025-10-09 08:59:54 -07:00
lucidrains
31f4363be7 must be able to do phase1 and phase2 training 2025-10-09 08:04:36 -07:00
lucidrains
e2d86a4543 add a complete action embedder that can accept any number of discrete actions with variable bins as well as any number of continuous actions, pooled and added to the agent token as described in the paper (seems like they fixed that horrendous hack in dreamer v3 with sticky action) 0.0.8 2025-10-09 07:53:42 -07:00
lucidrains
b62c08be65 fix task embed in presence of multiple agent tokens 2025-10-08 08:42:25 -07:00
lucidrains
4c2ed100a3 fix masking for multiple agent tokens 0.0.7 2025-10-08 08:26:44 -07:00
lucidrains
ed0918c974 prepare for evolution within dreams 2025-10-08 08:13:16 -07:00
lucidrains
892654d442 multiple agent tokens sharing the same state 2025-10-08 08:06:13 -07:00
lucidrains
c4e0f46528 for the value head, we will go for symexp encoding as well (following the "stop regressing" paper from Farebrother et al), also use layernormed mlp given recent papers 2025-10-08 07:37:34 -07:00
lucidrains
a50e360502 makes more sense for the noise to be fixed 2025-10-08 07:17:05 -07:00
Phil Wang
9c56ba0c9d
Merge pull request #3 from lucidrains/pytest-shard
add pytest shard
2025-10-08 07:03:11 -07:00
lucidrains
b5744237bf fix 2025-10-08 06:58:46 -07:00
lucidrains
63b63dfedd add shard 2025-10-08 06:56:03 -07:00
lucidrains
612f5f5dd1 a bit of dropout to rewards as state 2025-10-08 06:45:25 -07:00
lucidrains
c8f75caa40 although not in the paper, it would be interesting for each agent (will extend to multi-agent) to consider its own past rewards as state 2025-10-08 06:40:43 -07:00
lucidrains
187edc1414 all set for generating the perceived rewards once the RL components fall into place 2025-10-08 06:33:28 -07:00