HyPO: A Hybrid Reinforcement Learning Algorithm that Uses Offline Data for Contrastive-based Preference Optimization and Online Unlabeled Data for KL Regularization MarkTechPost
Recent Comments