您好,欢迎来到保捱科技网。
搜索
您的当前位置:首页A Trimaran based framework for exploring the design space of VLIW ASIPs with coarse grain f

A Trimaran based framework for exploring the design space of VLIW ASIPs with coarse grain f

来源:保捱科技网
ATrimaranBasedFrameworkforExploringtheDesignSpaceofVLIWASIPswithCoarseGrainFunctionalUnits

BhuvanMiddha

bhuvan@cse.iitd.ernet.in

VarunRajAnupGangwarvarun@cse.iitd.ernet.inanup@cse.iitd.ernet.in

AnshulKumarM.Balakrishnananshul@cse.iitd.ernet.inmbala@cse.iitd.ernet.in

DepartmentofComputerScienceandEngineering

IndianInstituteofTechnologyDelhi,India

PaoloIennePaolo.Ienne@epfl.ch

ProcessorArchitectureLaboratory

SwissFederalInstituteofTechnologyLausanne(EPFL),Switzerland

ABSTRACT

ItiswidelyacceptedthatuseofanApplicationSpecificIn-structionSetProcessor(ASIP)inanembeddedsystemcanprovideasolutionwhichismuchmoreflexiblethanASICsandmuchmoreefficientthanstandardprocessorsintermsofperformanceandpowerconsumption.HoweveralackofanacceptabledesignmethodologyandsupportingtoolsforASIPslimitstheiruseeventoday.WepresentinthispaperamethodologyfordesignspaceexplorationofhighperformanceVLIWASIPsbymodelingApplicationSpecificFunctionalUnitsinTrimaranCompilerInfrastructure.TodemonstratetheeffectivenessofourstrategyweconsidertwoimportantapplicationsFFTandKalmanFilterandperformcomputeintensiveoperationsintheseapplicationsviaspecialFunctionalUnits.Theresultsweobtainareverypromisingwithupto2×speedimprovement.

1.INTRODUCTIONANDMOTIVATION

CategoriesandSubjectDescriptors

C.1.1[ProcessorArchitectures]:VLIWarchitectures

GeneralTerms

Performance

Keywords

Trimaran,VLIW,Performance,ASIP,DesignSpaceExplo-ration

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.

ISSS’02,October2–4,2002,Kyoto,Japan.

Copyright2002ACM1-58113-576-9/02/0010...$5.00.

Withthecostofsiliconareadecreasing,itisbecomingmoreandmoreattractivetotrade-offthisareaforthein-creasedflexibilitywhichanASIPcanprovide.However,designmethodologiesandtoolsdonotexistwhichcande-liverASIPdesignssuitableforembeddedsystemswithchal-lengingdemandsonperformance.TheworkreportedinthispaperisaimedatdevelopingamethodologyfordesignspaceexplorationandsynthesisofhighperformanceASIPs.

Inordertodeliverhighperformance,anASIPmustex-ploittheinstructionlevelparallelism(ILP)availableinthegivenapplication.ThispointstoVLIWarchitectureasapossiblechoicebecauseitoffersabetterpossibilityofcus-tomization[1].ThenumberofFunctionalUnits(FUs)andtheirorganizationintoclusters(knownasclusteredVLIWarchitecture)isonedimensionofarchitecturalspacewhichhasalreadybeenexplored[3].ThetypesofFUsandin-troductionofapplicationspecificFUsinthearchitectureisarelativelylessexploreddimension.HenceweconsideraVLIWarchitecturewhichconsistsofacoresetofFUsaug-mentedwithapplicationspecificcoarsegrainFUs.Special-izingorcustomizingFUsforoperations(orgroupofopera-tions)occurringinagivenapplicationcanpotentiallyleadtohighperformancegainsbecauseofthefollowingconsider-ations[4]:(a)Iftheoperandsofanoperationhavealimitedresolution(bitwidth),FUhardwarecanbesimplifiedandmadefaster.(b)BychainingasequenceofoperationsinanFU,thecomputationtimecanbereduced.(c)ConcurrentoperationswithinagroupcanbemoreeasilyparallelizedthanparallelizationacrosstheFUs.Further,bymappingagroupofoperationstoanFU,accesstoregisterfilefortheintermediateresultsisavoided.Thisreductioninregisterpressurehasabeneficialeffectonperformance.

Inordertoevaluatetheperformanceofanarchitectureforaspecificapplication,onecanfollowanestimationbasedapproachorasimulationbasedapproach.Theestimationbasedapproachyieldsquick,butinaccurateresults.Hence

2

forfinegrainperformancecomparisonofvariousarchitec-turesoneneedstosimulatethearchitecturerunningthecodeoftheapplication.ForVLIWprocessors,thisisimpracti-calwithoutacompiler.Tosupportarchitectureexploration,thecompileraswellasthesimulatorneedtoberetargetable.Trimaran[5]thoughlimitedintermsofarchitecturalspaceprovidessuchtools.

Inthispaper,wepresentaframeworkbuiltaroundtheTrimaraninfrastructure,whichallowsustostudytheef-fectofapplicationspecificFUsonperformance.WeassumeanexecutablespecificationinCandsuggestamethodol-ogytoobservetheeffectofputtinghardwareacceleratorsforspeedingupspecificportionsofthecode.Thispaperisorganizedasfollows:Section2describesthespaceoftargetOntheotherhand,iftheI/Ooperationsarehold-able,thatisthecyclesinwhichtheyoccurcouldbedelayed,thetimeshapeissaidtobeflexiblewhicheasesscheduling.ABasicMIMOtakesallitsoperandsfromregisterfiles,thereforeafurtherextensioncanbeintermsofconsideringMIMOswithload/storewhicharecapableofaccessingthemem-ory.AnorthogonaldimensioniswhetherconditionalsarepermittedwithinaFUornot.ThisfurtherenhancesitsscopebutitcausesthelatencyoftheFUtobevariable.SuchvariablelatencymakespureVLIWkindofschedul-ingdifficult.OnecaneventhinkofmappingloopstoanFUwitheventheloopcontrolinsidetheFUbutagainthiswillrequiresomeruntimecontrolandsynchronization,e.g.handshaking.architectures,Section3,describestheperformanceevalua-tionframework,Section4describestheextensionstoTri-maranforintroductionofapplicationspecificFUs,Section5describesthecasestudiesforvalidationoftheworkandfinallySection6summarizesthisworkandalsodiscussesthepossiblefuturedirections.

2.SPACEOFTARGETARCHITECTURES

REGFILE(Single/Clustered)INTERCONNECTION NETWORKAPPLICATIONCORE SET OF FUsSPECIFIC FUFU1FU2.....FUnafu..afuINTERCONNECTION NETWORKFigure1:AtypicalVLIWASIPArchitectureThetargetarchitecturewhichweconsiderforASIPsyn-thesisisasshowninFigure1.ItisessentiallyaVLIWpro-cessorwithacoretosupporttheusualfinegrainoperationslikeadd,multiply,compareetc.,augmentedbyapplicationspecificextensions.TheseextensionsarecenteredaroundsomemediumorcoarsegrainFUsdefiningnewinstructionsforimplementingsomecriticalfunctionalityofthespecificapplication.Actuallythefinegraincoremaynotbeab-solutelyrigid,butgenericinsomelimitedsense.Further,itprovidesadefaultsetofresourceswhichareadequatetoimplementanypartoftheapplication.

OurfocusinthispaperisonexploringtheuseofcoarsegrainFUsforobtaininghighperformanceASIParchitec-tures.WenowlookatthespectrumofcustomFUswhichweconsiderinthedesignspaceexploration.Atoneendofthespectrumtherearemultipleinputsingleoutputunitswith-outanymemoryaccessesandcontrol,termedasMISOs[6].Thisisthesimplestgeneralizationofbasicfinegrainoper-ationswhichtypicallytakeoneortwoinputsandproduceoneresult.Thenextconceivablegeneralizationistoal-lowmultipleoutputstobeproducedbyanFU,makingitaMIMOoramultipleinputmultipleoutputunit.ThecyclesinwhichvariousoperandsofaMIMOareinputandresultsareoutput,relativetothebeginningcycle,definetheI/OtimeshapeofsuchaMIMO[7].IfthecyclesinwhichI/Ooccursarefixed,thetimeshapeisconsideredtoberigid.

3

NameInputsand

OutputsI/OPolicySources

andDests.MISOMultiple(Regfile)

Single(Regfile)FlexibleorRigidMIMOMultiple(Regfile)

Multiple(Regfile)FlexibleorRigidMIMOwithMultipleMultipleLD/ST(RegfileMemory)

or(RegfileFlexiblefororRigidMemory)

orLD/STRegfiles,ningeration

andatBlockendbegin-ofop-Table1:ArchitecturalspectrumofcustomFUs

3.

FRAMEWORKFORPERFORMANCEEVALUATION

HereweconsiderTrimaranCompilerInfrastructureastheframeworkforperformanceevaluation.TheTrimaransystemisbasedontheHPL-PDarchitecturewhichisaparametericprocessorarchitectureconceivedforresearchininstruction-levelparallelism.TheHPL-PDopcodereper-toire,atitscore,issimilartothatofaRISC-likeload/storearchitecture,withstandardinteger,floatingpoint(includingfusedmultiply-addtypeofoperations)andmemoryopera-tions.WemapthecorepartofourtargetarchitecturetotheHPL-PDarchitecture.

TheTrimarancompilerinfrastructure,asshowninFigure2,consistsofacompilerfront-end,IMPACT,compilerback-end,Elcor[2],andasimulatorgenerator.Theframeworkisparameterizedusingamachinedescriptionfacility,HMDES[10].Webrieflydescribeeachofthesetools.

TheIMPACTcompilersystem,isusedbytheTri-maransystemasitsfrontend.Thisfront-endperforms,ANSICparsing,codeprofiling,classicalcodeoptimizationsalongwithblockformation.

TheHighLevelMachineDescriptionFacilityorHMDESisthemachinedescriptionlanguageusedinTri-maransystem.Thislanguagedescribesaprocessorarchi-tecturefromthecompiler’spointofview.Tothisenditspecifiestheinstructionformat,resourceusagesandreserva-tiontables,latencyinformation,operationinformationandsomecompilerspecificinformation.Theinstructionformatconveyswhatoperandsareallowedbyeachtypeofopera-tion,resourceusagesspecifyhowoperationsuseprocessorsresourcesastheyexecuteandlatencyinformationspecifieshowtocalculatedependencedistancesbetweenoperations.Finally,operationinformationspecifiestheoperationssup-

C ProgramIMPACT* ANSI C Parsing* Code Profiling*Classical Machine Independent Optimizations*Block FormationBridge CodeELCORElcor IRGenerated SimulatorSIMULATOR GENERATOR* Elcor IR to low level C files* HPL−PD virtual machine* Cache simulation* Machine dependent code optimizations* Code scheduling* Register allocationHMDES Machine DescriptionFigure2:TheTrimaranCompilerInfrastructureportedbythearchitecture.anddescribeseachofthemintermsofthereSchedulingAlternativeswhichincludestheformat,resourceusageandlatency.

ElcorisTrimaran’sback-endfortheHPL-PDarchitec-ture.Itperformsthreetasks:(a)codeselectionandschedul-ing.(b)registerallocation.(c)machinedependentcodeoptimizations.Elcorisparameterizedbythemachinede-scriptionfacilitytoalargeextent.AsshowninFigure2,ittakesasinputthebridgecodeproducedbyfront-endalongwithaHMDESmachinespecificationandproducesanEl-corIRfile.TheIRisannotatedwithHPL-PDassemblyinstructions.TheinternalrepresentationofElcorIRcon-sistsofasetofC++objects.AlloptimizationmodulesintheElcorIRusetheinterfaceprovidedbytheseobjectstocarryoutoptimizations.OptimizationsaresimplyIRtoIRtransformations.

TheTrimaranframeworkalsoconsistsofasimulatorwhichisusedtogeneratevariousstatisticssuchascomputecycles,totalnumberofoperations,etc.

ThelimitationsoftheTrimaranframeworkarethatfirstly,itisbuiltaroundtheHPL-PDarchitecturaldomain.Hence,itonlysupportsoperationswhichareasubsetofHPL-PDoperations.Secondly,theTrimaranframeworkdoesnotcompletelysupportclusteredVLIWarchitecture.Ithasasingleregisterfileofeachtype(e.g.integerregfile,floatingpointregfileetc).EachintegerFUaccessesthesameintegerregfile.Hence,wecannotevaluateperformanceforclusteredarchitectures.

toidentifythepatterncorrespondingtoOintheapplicationsourcecodeandemittheappropriateIntermediateRepre-sentation(IR).ThebackendshouldbeabletogeneratecodecorrespondingtothisIR.Theformerisingeneralaveryhardproblemasalltheinformationcannotbecodedinthemachinedescription,whichwillenablethefrontendtoidentifythepatterninthesourcecodecorrespondingtocoarseoperationO.

Anotherapproachcouldbethatthefrontendremainsunchanged.Theapplicationcodeisitselfmodifiedsothatthedesiredcomputation(tobecarriedbycoarsegrainFU)isreplacedbyanexternalfunctioncall.TheIRwillconsistofnodescorrespondingtothisfunctioncall.ThenonecanmodifytheIRitselftoreplacethesenodesbyanewnodecorrespondingtotheoperationO.Thebackendwillthentreatthisnodeasanyotherstandardmachineoperation(e.g.ADD)andgeneratecodeforit.Finallyoneneedstodefinetheoperationsemanticsinsidetheretargetablesimu-latorsothatvariousstatisticscanbegenerated.

WetakethelatterapproachtoextendtheTrimaranin-frastructure.Eachnewoperationisrepresentedintermsofanexternalfunctioncall.ThefunctionnameandcoarsegrainFUbindingisimplicit.Thefunctionnameitselfspec-ifiestowhichFUitshouldbebound.Weidentifythefunc-tioncallintheIRofthecodeandreplaceitwithacoarsegrainoperationintheIR.Thereisaonetoonemappingbetweenthecoarsegrainoperationandthenameofthefunctionintheapplicationcode.Theoperationnowpropa-gatesthroughthewholesuiteofoptimizationsdonebythecompiler.WealsodefinethesemanticsofthenewoperationintheTrimaransimulator.

TheTrimaranframeworkhasnonotionofregisterfileports.Itassumeseachregfiletohaveanunlimitednum-berofports.Weincorporatethenotionofregfileportsintheframeworkwithparameterizednumberofread/writeportscorrespondingtoeachregfile.Thisisanessentialcon-straintinVLIWASIPdesignasaccesstime,areaandpowerconsumptionsharplyincreasewiththenumberofportsinaregisterfile.ThemodifiedframeworkisshowninFig-ure3.Theshadedportionrepresentsthosepartsoftheframeworkwhichhavebeenmodified,withchangesindi-catedalongwitheachpart.WehavesuccessfullymodeledthethreeclassesofFUsdescribedabove,MISOs,BasicMI-MOsandMIMOswithLD/ST.Inthefollowingparagraphswedescribeeachofthese.

4.

EXTENDINGTRIMARANINFRASTRUCTURE

4.1ModelingMISOs

Inthissection,weconsidertheproblemofintroducingcoarsegrainFUsinacompilerinfrastructure.Weassumethatthecompilerinfrastructureconsistsofamachinede-scriptionfacility,acompilerfrontandbackendandaretar-getablesimulator.

ThefirststepinvolvesdefininganewmachineoperationOandanewresourceRinthesystem.TheoperationOwillbeperformedbyRwhichcorrespondstoacoarsegrainFUinthearchitecture.TheoperationOwillbedefinedintermsoftheoperationformat,operationlatencyandthere-sourceusage.Afterthisthecompilerneedstobemodifiedsothatitisabletogeneratecodeforthisnewoperation.Forthisonerequiresaretargetablecompilerparameterizedwiththemachinedescription.Thefrontendshouldbeable

WehaveidentifiedandsuccessfullytestedthefollowingapproachforintroducingMISOsintheTrimaranCompilerInfrastructure.TheapplicationprograminCconsistsofaprototypedeclarationofafunctionwhichtheuserwantstoperformviaaspecialfunctionalunit.Thisisillustratedwiththehelpofanexample:

main(){

inta,b,c,d;a=3;b=4;c=5;

//Thefollowingcomputation//istobedoneviaspecialFUd=(a+b)*(b+c)*(a+c);}

Letusdefineanewfunctionalunitwhichtakesin3inputsa,bandcandproducesoneoutputd.Fordefiningthenew

4

C ProgramoriginalinstrumentedC ProgramBridge CodeIMPACTtionscorrespondingtomisofunfunctioncallarereplacedbyNEWOPandothersuchcombinationsremainunchanged.ThenewoperationisalsodefinedinMDES(withopcodeMNEWOP)whichinvolvesdefiningitsOperationFormat,NumberofResources(FUs),OperationLatency,ResourceUsageandtheReservationTable.

ModifiedELCOR( IR Transformation)(Incorporated Register PortConstraints)Elcor IRGenerated Simulator4.2ModelingMIMOs

statsSIMULATOR GENERATORwith semantics of new operationSinceafunctioncannotreturnmorethan1valuebydef-inition,wetakeaslightlydifferentapproachhere.Insteadofaprototypefunctionreturningavalueweconsideravoidfunction.Wereservesomeregisters(throughtheCcodeitself,bygivingsomecompilerdirectives)andthefunctionreturnsvaluesinthoseregisters.ThisisillustratedbytheHMDES Machine Descriptionreflecting new operationFigure3:ModifiedTrimaranFrameworkfunctionalunitwedeclareaprototypefunctioninC;i.e.,wedonotdefineitsfunctionalitybutonlydeclareitsinterface.

intmiso_fun(inta,intb,intc);main(){

inta,b,c,d;a=3;b=4;c=5;

d=miso_fun(a,b,c);}

Sincethefunctionisnotcompletelydefinedinsidetheap-plication,afterpassingthroughthefrontenditappearsintheformofanexternalfunctioncallintheTrimaranbridgecodealongwiththerelevantannotationswhichconsistsofnameofthefunctionetc.AfterfirstpassthroughElcoritappearsintheformoftwoElcorOperationsPBRR(preparetobranch)andBRL(branchandlink).Weidentifythiscom-binationofPBRRandBRLcorrespondingtotheprototypefunctionintheIRandreplacethiscombinationbyanewnodeintheIRwhichcorrespondstoanewElcorOperationandrepresentstheFUwhichwewanttointroduce.ThesourceanddestinationoperandsofthisnewElcorOpera-tionarethesameasthesourceanddestinationoperandsoftheprototypefunctioncall.

Finallythesemanticsofthenewoperationaredefinedinthesimulatorwhichinvolvesdefiningdestinationasafunc-tionofsources.AsillustratedinFigure4,weconsideran

ADD_WADD_WPBRRNew_Op [d] [a b c] s_time(1)(Miso_Fun)New_OPs_opcode(MNEWOP)BRLSUB_WSUB_WPBRRPBRR(Printf)BRLBRLFigure4:ModificationsinElcorIR

exampleofapartofIRinwhichthePBRRandBRLopera-

5

followingexample:

main(){

inta,b,c,d;a=4;b=5;

//Thefollowingcomputations//aretobedoneviaspecialFUc=a+b;d=a-b;

printf(c,d);}

Letusdefineanewfunctionalunitwhichtakesin2inputsaandbandreturnoutputs(a+b)and(a-b).HenceitisaMIMO.Theapplicationconsistsofvoidprototypefunctiondeclaration.

voidmimo_fun(inta,intb);main(){

inta,b,ret1,ret2;

//Somecodeanddirectives//toreserveregisters........

mimo_fun(a,b);

//Valueswillbereturnedinret1&ret2printf(ret1,ret2);}

AsinthepreviouscasetheprototypefunctioncallappearsasacombinationofPBRRandBRLintheElcorIR.ButnowinadditiontoreplacingtheabovecombinationbyanewElcorOperationwesetthedestinationsofoperationastheregistersreservedforthispurpose(thatisregisterscor-respondingtovariablesret1andret2intheaboveexample).TheoperationisalsodefinedinMDESanditssemanticsaredefinedinthesimulator.

4.3

ModelingMIMOswithload/store

TohandleMIMOswithcapabilityofinteractionwithmemorywemakemodificationsonlyintheMDES.Basi-callyinthereservationtablecorrespondingtotheoperationwealsoreservememoryunitsineachtimeunitwherein-teractionwiththememoryisrequired.Wehavemultiplememoryunitsinthesystem,sooneoftheunitsisreservedforperformingthisoperationwhileotherscanhandlenor-malload/storeoperations.ThearchitectureassumeseachLD/STunithasportstomemorysotheycanbeactivesi-multaneously.Whilemakingthefunctioncallwealsopasstheaddressesofthememorylocationsfromwhichdataisre-quired.Inthefirstfewcyclesoftheoperationmemoryresi-dentdataisaccessedwiththehelpofLD/STunitandstored

inlocalbuffers.Thenthecomputationisperformedandfi-nallydataiswrittenintothememory,ifrequired,againwiththehelpofLD/STunit.ThesemanticsarehandledinthesimulatorinasimilarwayasforbasicMIMOs.

tationofthebutterflyoperation,ThebutterflyoperationisshowninFigure6.Inthisapplicationwereplacethebut-

awb*4.4ImposingRegisterPortConstraints

+a+bwTheTrimaranframeworkhasnonotionofregfileports.Itprimarilyhas1regfileofeachtype(GPR,controlregfile,floatingpointregfile,branchtargetandpredicateregfile).Thescheduler,forexamplecanscheduleanynumberofin-tegeroperationsinparalleldependingontheavailabilityofresources.Thisimplieseachregfilehasinfinitenumberofportsintheory.SoweimposetheseportconstraintsinthearchitecturalframeworkbecauseallthespecialFUslikeMIMOswillhavealargeno.ofsourcesandmanydestina-tions.ToincorporatetheseconstraintswebuildaTimeXRegporttableforreadaswellaswriteportsinwhichateachinstantoftimecorrespondingtoeachregfiletheutilizationofitsread/writeportsismaintained.Beforeschedulinganyoperationthetableischeckedforavailabilityofread/writeportsalongwiththeavailabilityofresources.IntherigidI/OtimeshapemodelifaparticularFUhasmoresourcesthanthenumberofreadportsormoredestinationsthanthenumberofwriteportsthentheportsarereservedintheverynexttimeinstant,i.e.,thecyclesinwhichI/Ooccursisfixed.

IntheflexibleI/OtimeshapemodelanFUisdividedintovariousstagesandthenumberofsourcesanddesti-nationsineachstageliewithinthemaximumnumberofread/writeports.Aflowdependencyedgeisaddedbetweeneachstagetoensureeachstageisscheduledafteritspre-decessorsarescheduleddependingontheavailabilityofre-sourcesandports.Butitisflexibleinthesensethatstageicanbescheduledanytimeafterstagei-1hasbeenscheduled.ThisisshowninFigure5.

−a−bwa = ar + i(ac)w = wr + i(wc)b = br + i(bc)i =−1Figure6:ButterflyOperation

terflyoperationwhichhas6sources(aseachofthe3sourcesarecomplexnumbers)and4destinationswithaMultipleIn-putMultipleOutputFU.ThemodelconformstotherigidI/Otimeshapemodelwitheachregfilehaving4readportsand2writeports.Thelatencyofthebutterflyoperationissetto8.

1readbrreadbcreadwrreadwc2345readac++XXX+−readar678−+writeo1−writeo3+writeo2−writeo4AFU_S0flow dependencyedgeFigure7:DataflowgraphofbutterflyoperationWeimplementtheDFGshowninFigure7inthe6-4MIMOcorrespondingtothebutterflyoperation.Thehigh-lightedportionrepresentsoneofthemanycriticalpaths.Thelengthofthecriticalpathis8,assuming4readsand2writesarepermittedineachcycle.Theshadedportionsrepresentthereadandwriteoperations.Thelatencyofthemultiplicationoperationinthebasearchitectureis3,whilstthatofarithmeticoperationsis1.Theresultsofintroduc-tionofthenewfunctionalunitareshowninTable2.

n-pointFFT24816WithoutSpe-cialFU(cycles)236549120929WithSpecialFU(cycles)2093415831081AFUAFU_S1ELCOR OPERATIONSFigure5:FlexibleI/OTimeshape

5.5.1

CASESTUDIES

FastFourierTransform(FFT)

ToillustratetheconceptofMISOsandMIMOsandtoevaluatetheperformancegainwhenspecialfunctionalunitsarepresentinthesystemweconsiderastandardN-pointFFTapplication.FFTformstheheartofmanyimagetrans-formationpackages,thusisaninterestingapplicationtocon-siderspeedup.TheheartofFFTistherepetitivecompu-

Table2:ComputecyclesforvaryingninFFTTable2showsthatasthevalueofnincreasesthenumberofbutterflyoperationsalsoincreaseandthereforeperfor-

6

mancegainalsoincreases.Forn=16thespeedimprovementisalmost2.5×.

WehavealsoimplementedtheflexibleI/OtimeshapemodelcorrespondingtotheFFTapplication.Theresultsobtainedaresimilartoasintherigidcase.

5.2KalmanFilter

TheKalmanfilter[8]isasetofmathematicalequationsthatprovidesanefficientcomputational(recursive)solutionoftheleast-squaresmethod.Thefilterisverypowerfulinseveralaspects:itsupportsestimationsofpast,present,andevenfuturestates,anditcandosoevenwhentheprecisenatureofthemodeledsystemisunknown.TheKalmanfilterbasicallyconsistsoftwomainfunctionspredictstatewhichpredictsthestateofthesystemandkalmanupdatewhichupdatesthesystem.WebuiltspecialFUstoperformvariousfrequentlyoccurringoperationsinthesefunctions.ManyoftheseoperationsinvolvemanipulationofvariousarrayswhichinvolvedhandlingmemorywithintheAFU.Inall,weintroduced5MISOswithload/storetohandlethevariousoperations.ThedescriptionofeachAFUisshowninTable3.Thesemanticsarespecifiedintheformofdesti-nationasafunctionofsources,wheresirepresentstheithsourceoftheAFUandirepresentstheithdestinationoftheAFU.ThelatencyofeachFUconformedtotheamountofAFUNo.12345No.ofIn-puts53555No.ofout-puts11111Semanticsframeworkcanbeusedforextensivedesignspaceexplo-rationandtheusercanexperimentbymappingvariouscomputeintensivepartsoftheapplicationtospecialFUsinhardwareandcomparingtherelativeperformanceestimatesunderaccurateimplementationconstraints.ThiscaneasethesynthesisofASIPscorrespondingtothesesetofapplica-tions.Researchisgoingonintheareaofautomatictopologybasedidentificationofinstructionsetextensionsforembed-dedprocessors[9].Currentlythepotentialcandidatesareidentifiedmanually.Apossiblefutureworkcanbetocreateanautomaticidentification-evaluationframework,whichau-tomaticallyidentifiesthepotentialcandidatesbasedonsomespeedupfactorsassociatedwitheachinstructionandthenevaluatesthemusingthisextendedTrimaranframeworkforpossiblegains.ThemodelingdoesnottakeintoaccounttherelativecostofthespecialFU.Apossibleextensioncanbeanintroductionofcostmodelwhichevaluatestheareacor-respondingtoeachspecialFUsothatonecanevenevaluatetheperformance-areatradeoff.Besides,theFUsareoflim-itedcomplexity,onecanextendtheframeworktointroduceFUswhicharecapableofhandlingconditionalsandloopsalso.Currently,thechangesrequiredintheTrimaranframe-workfortheintroductionofanyparticularFUaremanual.Wearemakingeffortstoautomateittotheextentpossible.

7.REFERENCES

d1=(s1+(s2*(s3+s4+s5*s2)))d1=(s1+(s2*s3))d1=(-s1*s2+s3*s4)/s5d1=(s1+s2*s3+s4*s5)d1=(s1-s2*s3-s4*s5)Table3:KalmanFilterAFUs

computationinvolvedandthemodelfollowedwasrigidI/Otimeshapemodel.Thenumberofreadportsineachregfilewere3andnumberofwriteportswere2.TheresultsobtainedareshowninTable4.

WithoutSpecialFU(cycles)

PredictState699KalmanUpdate774FunctionWithSpecialFU(cycles)498342Table4:KalmanFilterResults

Wehaveused3specialFUsinthekalmanupdatefunctionand2specialFUsinthepredictstatefunction,theperfor-manceisbetterinthecaseoflatterbecausethereweremoreoperationsthatcouldbemappedtothesespecialFUsthantheformercase.AscanbeobservedfromTable4,thenum-berofcycleshavecomedowntolessthanhalfinthepre-dictstatefunction,whichimpliesafairlylargeperformancegain.

6.CONCLUSIONANDFUTUREWORK

Wehavepresentedaframeworktoquicklyevaluatetheperformancegainobtainedwhenspecialapplicationspecificfunctionalunitsareintroducedinthearchitecture.The

[1]ShailAditya,B.R.RauandV.Kathail.Automatic

architecturalsynthesisofVLIWandEPICprocessors.InProceedingsof12thISSS.November,1999.

[2]ShailAditya,VinodKathail,andB.Ramakrishna

Rau.Elcor’sMachineDescriptionSystem:Version3.0.TechnicalReportHPL-1998-128,Hewlett-PackardLaboratories,October1998.

[3]MargaridaF.Jacomeetal.ClusteredVLIW

architectureswithpredicatedswitching.InDAC,pages696-701,2001.

[4]PaoloIenne,LauraPozzi,M.Vuletic.OntheLimits

ofProcessorSpecialisationbyMappingDataflowSectionsonAd-hocFunctionalUnits.CSTechnicalReport01/376,LAP,EPFL,Lausanne.December2001.

[5]TheTrimaranCompilerInfrastructure,

http://www.trimaran.org.

[6]CesareAlippietal.ADAGbaseddesignapproachfor

reconfigurableVLIWprocessors.InProceedingsoftheDATE,pages778-79,March1999.

[7]N.G.Busaetal.Schedulingcoarsegrainoperations

forVLIWprocessorsInProceedingsofthe13thISSS,pages47-53,Madrid,September2000.

[8]GregWelchandGaryBishop.AnIntroductiontothe

KalmanFilter.TechnicalReport,DepartmentofComp.Sc.andEngg.,Univ.ofNorthCarolinaatChapelHill,March2002.[9]Pozzi,LauraandVuleti´c,MiljanandIenne,Paolo.

AutomaticTopology-BasedIdentificationof

Instruction-SetExtensionsforEmbeddedProcessors.InProceedingsoftheDesign,AutomationandTestinEuropeConferenceandExhibition,Paris,March2002.[10]J.Gyllenhaal,B.Rau,andW.Hwu.HMDESversion

2.0specification,IMPACT,UniversityofIllinois,Urbana,IL,Tech.Rep.IMPACT-96-03,1996.

7

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- baoaiwan.cn 版权所有 赣ICP备2024042794号-3

违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务