【英文原版】StableDiffusion3技术报告-英.docx
《【英文原版】StableDiffusion3技术报告-英.docx》由会员分享,可在线阅读,更多相关《【英文原版】StableDiffusion3技术报告-英.docx(30页珍藏版)》请在课桌文档上搜索。
1、ScalingRectifledFlowTransformersforHigh-ResolutionImageSynthesisPatrickEsserSumithKulalAndreasBlattmannRahimEntezariJonasMu,llerHarrySainiYam1.eviDominik1.orenzAxelSauerFredericBoeselDustinPodelITimDockhornZionEnglishKyle1.aceyAlexGoodwinYannikMarekRobinRombach*StabilityAIFigure1.High-resolutionsamp
2、lesfromour8Brectifiedflowmodel,showcasingitscapabilitiesintypography,precisepromptfollowingandspatialreasoning,attentiontofinedetails,andhighimagequalityacrossawidevarietyofstyles.AbstractDiffusionmodelscreatedatafromnoisebyinvertingtheforwardpathsofdatatowardsnoiseandhaveemergedasapowerfulgenerativ
3、emodelingtechniqueforhigh-dimensional,perceptualdatasuchasimagesandvideos.Rectifiedflowisarecentgenerativemodelformulationthatconnectsdataandnoiseinastraightline.Despiteitsbettertheoreticalpropertiesandconceptualsimplicity,itisnotyetdecisivelyestablishedasstandardpractice.Inthiswork,weimproveexistin
4、gnoisesamplingtechniquesfbrtrainingrectifiedflowmodelsbybiasingthemtowardsperceptuallyrelevantscales.Throughalarge-scalestudy,wedemon-4Equalcontribution.stability.ai.stratethesuperiorperformanceofthisapproachcomparedtoestablisheddiffusionformulationsforhigh-resolutiontext-to-imagesynthesis.Additiona
5、lly,wepresentanoveltransformer-basedarchitecturefortext-to-imagegenerationthatusesseparateweightsforthetwomodalitiesandenablesabidirectionalflowofinformationbetweenimageandtexttokens,improvingtextcomprehension,typography,andhumanpreferenceratings.Wedemonstratethatthisarchitecturefollowspredictablesc
6、alingtrendsandcorrelateslowervalidationlosstoimprovedtext-to-imagesynthesisasmeasuredbyvariousmetricsandhumanevaluations.Ourlargestmodelsoutperformstate-of-the-artmodels,andwewillmakeourexperimentaldata,code,andmodelweightspubliclyavailable.1. IntroductionDiffusionmodelscreatedatafromnoise(Songetal.
7、,2020).Theyaretrainedtoinvertforwardpathsofdatatowardsrandomnoiseand,thus,inconjunctionwithapproximationandgeneralizationpropertiesofneuralnetworks,canbeusedtogeneratenewdatapointsthatarenotpresentinthetrainingdatabutfollowthedistributionofthetrainingdata(Sohl-Dicksteinetal.,2015;Song&Ermon,2020).Th
8、isgenerativemodelingtechniquehasproventobeveryeffectiveformodelinghigh-dimensional,perceptualdatasuchasimages(HOetal.,2020).Inrecentyears,diffusionmodelshavebecomethede-factoapproachforgeneratinghigh-resolutionimagesandvideosfromnaturallanguageinputswithimpressivegeneralizationcapabilities(Sahariaet
9、al.,2022b;Rameshetal.,2022;Rombachetal.,2022;Podelletal.,2023;Daietal.,2023;Esseretal.,2023;Blattmannetal.,2023b;Betkeretal.,2023;Blattmannetal.,2023a;Singeretal.l2022).Duetotheiriterativenatureandtheassociatedcomputationalcosts,aswellasthelongsamplingtimesduringinference,researchonformulationsformo
10、reefficienttrainingand/orfastersamplingofthesemodelshasincreased(Karrasetal.,2023;1.iuetal.,2022).Whilespecifyingaforwardpathfromdatatonoiseleadstoefficienttraining,italsoraisesthequestionofwhichpathtochoose.Thischoicecanhaveimportantimplicationsforsampling.Forexample,aforwardprocessthatfailstoremov
11、eallnoisefromthedatacanleadtoadiscrepancyintrainingandtestdistributionandresultinartifactssuchasgrayimagesamples(1.inetal.,2024).Importantly,thechoiceoftheforwardprocessalsoinfluencesthelearnedbackwardprocessand,thus,thesamplingefficiency.Whilecurvedpathsrequiremanyintegrationstepstosimulatetheproce
12、ss,astraightpathcouldbesimulatedwithasinglestepandislesspronetoerroraccumulation.Sinceeachstepcorrespondstoanevaluationoftheneuralnetwork,thishasadirectimpactonthesamplingspeed.Aparticularchoicefortheforwardpathisaso-calledRectifiedFlow(1.iuetal.,2022;Albergo&Vanden-Eijnden,2022;1.ipmanetal.,2023),w
13、hichconnectsdataandnoiseonastraightline.Althoughthismodelclasshasbettertheoreticalproperties,ithasnotyetbecomedecisivelyestablishedinpractice.Sofar,someadvantageshavebeenempiricallydemonstratedinsmallandmedium-sizedexperiments(Maetal.,2024),butthesearemostlylimitedtoclass-conditionalmodels.Inthiswor
14、k,wechangethisbyintroducingare-weightingofthenoisescalesinrectifiedflowmodels,similartonoise-predictivediffusionmodels(Hoetal.,2020).Throughalarge-scalestudy,wecompareournewformulationtoexistingdiffusionformulationsanddemonstrateitsbenefits.Weshowthatthewidelyusedapproachfortext-to-imagesynthesis,wh
15、ereafixedtextrepresentationisfeddirectlyintothemodel(e.g.,viacross-attention(Vaswanietal.,2017;Rombachetal.,2022),isnotideal,andpresentanewarchitecturethatincorporatesIeamablestreamsforbothimageandtexttokens,whichenablesatwo-wayflowOfinformationbetweenthem.Wecombinethiswithourimprovedrectifiedflowfo
16、rmulationandinvestigateitsscalability.Wedemonstrateapredictablescalingtrendinthevalidationlossandshowthatalowervalidationlosscorrelatesstronglywithimprovedautomaticandhumanevaluations.Ourlargestmodelsoutperformstate-of-theartopenmodelssuchasSDX1.(Podelletal.,2023),SDX1.-Turbo(Saueretal.,2023),Pixart
17、-(Chenetal.,2023),andclosed-sourcemodelssuchasDA1.1.-E3(Betkeretal.,2023)bothinquantitativeevaluation(Ghoshetal.,2023)ofpromptunderstandingandhumanpreferenceratings.Thecorecontributionsofourworkare:(i)Weconductalarge-scale,systematicstudyondifferentdiffusionmodelandrectifiedflowformulationstoidentif
18、ythebestsetting.Forthispurpose,weintroducenewnoisesamplersforrectifiedflowmodelsthatimproveperformanceoverpreviouslyknownsamplers,(ii)Wedeviseanovel,scalablearchitecturefortext-to-imagesynthesisthatallowsbi-directionalmixingbetweentextandimagetokenstreamswithinthenetwork.Weshowitsbenefitscomparedtoe
19、stablishedbackbonessuchasUViT(Hoogeboometal,2023)andDiT(Peebles&Xie,2023).Finally,we(iii)performascalingstudyofourmodelanddemonstratethatitfollowspredictablescalingtrends.Weshowthatalowervalidationlosscorrelatesstronglywithimprovedtext-to-imageperformanceassessedviametricssuchasT2I-CompBench(Huanget
20、al.,2023),GenEval(Ghoshetal.,2023)andhumanratings.Wemakeresults,code,andmodelweightspubliclyavailable.2. Simulation-FreeTrainingofFlowsWeconsidergenerativemodelsthatdefineamappingbetweensamplesifromanoisedistributionPltosamplesxofromadatadistributionpointermsofanordinarydifferentialequation(ODE),dyt
21、=v-,r)dt,(1)wherethevelocityvisparameterizedbytheweightsofaneuralnetwork.PriorworkbyChenetal.(2018)suggestedtodirectlysolveEquation(1)viadifferentiableODEsolvers.However,thisprocessiscomputationallyexpensive,especiallyforlargenetworkarchitecturesthatparameterizev-(,t.t).Amoreefficientalternativeisto
22、directlyregressavectorfieldwtthatgeneratesaprobabilitypathbetweenPOandp.Toconstructsuchaux,wedefineaforwardprocess,correspondingtoaprobabilitypathPtbetweenpoandPl=N(0,1),aszt=auo+btawherexoand,weintroducetanduxast():xoato+Z(4)Mze):=ItT(Z付(5)SinceZtcanbewrittenassolutiontotheODEzt=t(ZtI),withinitialv
23、aluezo=xo,wt()generatespt(e).Remarkably,onecanconstructamarginalvectorfield“twhichgeneratesthemarginalprobabilitypaths(1.ipmanetal.,2023)(seeB.l),usingtheconditionalvectorfields11t():(z)=EufzelAtUl(6)tSN(OJ)八Pt(Z)WhileregressingwlwiththeFlowMatchingobjective1.FM=Et,pt(z)Hv-(z,z)wt(z)112.(7)directlyi
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 英文原版 StableDiffusion3 技术 报告

链接地址:https://www.desk33.com/p-1373411.html