BhuvanMiddha
bhuvan@cse.iitd.ernet.in
VarunRajAnupGangwarvarun@cse.iitd.ernet.inanup@cse.iitd.ernet.in
AnshulKumarM.Balakrishnananshul@cse.iitd.ernet.inmbala@cse.iitd.ernet.in
DepartmentofComputerScienceandEngineering
IndianInstituteofTechnologyDelhi,India
PaoloIennePaolo.Ienne@epfl.ch
ProcessorArchitectureLaboratory
SwissFederalInstituteofTechnologyLausanne(EPFL),Switzerland
ABSTRACT
ItiswidelyacceptedthatuseofanApplicationSpecificIn-structionSetProcessor(ASIP)inanembeddedsystemcanprovideasolutionwhichismuchmoreflexiblethanASICsandmuchmoreefficientthanstandardprocessorsintermsofperformanceandpowerconsumption.HoweveralackofanacceptabledesignmethodologyandsupportingtoolsforASIPslimitstheiruseeventoday.WepresentinthispaperamethodologyfordesignspaceexplorationofhighperformanceVLIWASIPsbymodelingApplicationSpecificFunctionalUnitsinTrimaranCompilerInfrastructure.TodemonstratetheeffectivenessofourstrategyweconsidertwoimportantapplicationsFFTandKalmanFilterandperformcomputeintensiveoperationsintheseapplicationsviaspecialFunctionalUnits.Theresultsweobtainareverypromisingwithupto2×speedimprovement.
1.INTRODUCTIONANDMOTIVATION
CategoriesandSubjectDescriptors
C.1.1[ProcessorArchitectures]:VLIWarchitectures
GeneralTerms
Performance
Keywords
Trimaran,VLIW,Performance,ASIP,DesignSpaceExplo-ration
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.
ISSS’02,October2–4,2002,Kyoto,Japan.
Copyright2002ACM1-58113-576-9/02/0010...$5.00.
Withthecostofsiliconareadecreasing,itisbecomingmoreandmoreattractivetotrade-offthisareaforthein-creasedflexibilitywhichanASIPcanprovide.However,designmethodologiesandtoolsdonotexistwhichcande-liverASIPdesignssuitableforembeddedsystemswithchal-lengingdemandsonperformance.TheworkreportedinthispaperisaimedatdevelopingamethodologyfordesignspaceexplorationandsynthesisofhighperformanceASIPs.
Inordertodeliverhighperformance,anASIPmustex-ploittheinstructionlevelparallelism(ILP)availableinthegivenapplication.ThispointstoVLIWarchitectureasapossiblechoicebecauseitoffersabetterpossibilityofcus-tomization[1].ThenumberofFunctionalUnits(FUs)andtheirorganizationintoclusters(knownasclusteredVLIWarchitecture)isonedimensionofarchitecturalspacewhichhasalreadybeenexplored[3].ThetypesofFUsandin-troductionofapplicationspecificFUsinthearchitectureisarelativelylessexploreddimension.HenceweconsideraVLIWarchitecturewhichconsistsofacoresetofFUsaug-mentedwithapplicationspecificcoarsegrainFUs.Special-izingorcustomizingFUsforoperations(orgroupofopera-tions)occurringinagivenapplicationcanpotentiallyleadtohighperformancegainsbecauseofthefollowingconsider-ations[4]:(a)Iftheoperandsofanoperationhavealimitedresolution(bitwidth),FUhardwarecanbesimplifiedandmadefaster.(b)BychainingasequenceofoperationsinanFU,thecomputationtimecanbereduced.(c)ConcurrentoperationswithinagroupcanbemoreeasilyparallelizedthanparallelizationacrosstheFUs.Further,bymappingagroupofoperationstoanFU,accesstoregisterfilefortheintermediateresultsisavoided.Thisreductioninregisterpressurehasabeneficialeffectonperformance.
Inordertoevaluatetheperformanceofanarchitectureforaspecificapplication,onecanfollowanestimationbasedapproachorasimulationbasedapproach.Theestimationbasedapproachyieldsquick,butinaccurateresults.Hence
2
forfinegrainperformancecomparisonofvariousarchitec-turesoneneedstosimulatethearchitecturerunningthecodeoftheapplication.ForVLIWprocessors,thisisimpracti-calwithoutacompiler.Tosupportarchitectureexploration,thecompileraswellasthesimulatorneedtoberetargetable.Trimaran[5]thoughlimitedintermsofarchitecturalspaceprovidessuchtools.
Inthispaper,wepresentaframeworkbuiltaroundtheTrimaraninfrastructure,whichallowsustostudytheef-fectofapplicationspecificFUsonperformance.WeassumeanexecutablespecificationinCandsuggestamethodol-ogytoobservetheeffectofputtinghardwareacceleratorsforspeedingupspecificportionsofthecode.Thispaperisorganizedasfollows:Section2describesthespaceoftargetOntheotherhand,iftheI/Ooperationsarehold-able,thatisthecyclesinwhichtheyoccurcouldbedelayed,thetimeshapeissaidtobeflexiblewhicheasesscheduling.ABasicMIMOtakesallitsoperandsfromregisterfiles,thereforeafurtherextensioncanbeintermsofconsideringMIMOswithload/storewhicharecapableofaccessingthemem-ory.AnorthogonaldimensioniswhetherconditionalsarepermittedwithinaFUornot.ThisfurtherenhancesitsscopebutitcausesthelatencyoftheFUtobevariable.SuchvariablelatencymakespureVLIWkindofschedul-ingdifficult.OnecaneventhinkofmappingloopstoanFUwitheventheloopcontrolinsidetheFUbutagainthiswillrequiresomeruntimecontrolandsynchronization,e.g.handshaking.architectures,Section3,describestheperformanceevalua-tionframework,Section4describestheextensionstoTri-maranforintroductionofapplicationspecificFUs,Section5describesthecasestudiesforvalidationoftheworkandfinallySection6summarizesthisworkandalsodiscussesthepossiblefuturedirections.
2.SPACEOFTARGETARCHITECTURES
REGFILE(Single/Clustered)INTERCONNECTION NETWORKAPPLICATIONCORE SET OF FUsSPECIFIC FUFU1FU2.....FUnafu..afuINTERCONNECTION NETWORKFigure1:AtypicalVLIWASIPArchitectureThetargetarchitecturewhichweconsiderforASIPsyn-thesisisasshowninFigure1.ItisessentiallyaVLIWpro-cessorwithacoretosupporttheusualfinegrainoperationslikeadd,multiply,compareetc.,augmentedbyapplicationspecificextensions.TheseextensionsarecenteredaroundsomemediumorcoarsegrainFUsdefiningnewinstructionsforimplementingsomecriticalfunctionalityofthespecificapplication.Actuallythefinegraincoremaynotbeab-solutelyrigid,butgenericinsomelimitedsense.Further,itprovidesadefaultsetofresourceswhichareadequatetoimplementanypartoftheapplication.
OurfocusinthispaperisonexploringtheuseofcoarsegrainFUsforobtaininghighperformanceASIParchitec-tures.WenowlookatthespectrumofcustomFUswhichweconsiderinthedesignspaceexploration.Atoneendofthespectrumtherearemultipleinputsingleoutputunitswith-outanymemoryaccessesandcontrol,termedasMISOs[6].Thisisthesimplestgeneralizationofbasicfinegrainoper-ationswhichtypicallytakeoneortwoinputsandproduceoneresult.Thenextconceivablegeneralizationistoal-lowmultipleoutputstobeproducedbyanFU,makingitaMIMOoramultipleinputmultipleoutputunit.ThecyclesinwhichvariousoperandsofaMIMOareinputandresultsareoutput,relativetothebeginningcycle,definetheI/OtimeshapeofsuchaMIMO[7].IfthecyclesinwhichI/Ooccursarefixed,thetimeshapeisconsideredtoberigid.
3
NameInputsand
OutputsI/OPolicySources
andDests.MISOMultiple(Regfile)
Single(Regfile)FlexibleorRigidMIMOMultiple(Regfile)
Multiple(Regfile)FlexibleorRigidMIMOwithMultipleMultipleLD/ST(RegfileMemory)
or(RegfileFlexiblefororRigidMemory)
orLD/STRegfiles,ningeration
andatBlockendbegin-ofop-Table1:ArchitecturalspectrumofcustomFUs
3.
FRAMEWORKFORPERFORMANCEEVALUATION
HereweconsiderTrimaranCompilerInfrastructureastheframeworkforperformanceevaluation.TheTrimaransystemisbasedontheHPL-PDarchitecturewhichisaparametericprocessorarchitectureconceivedforresearchininstruction-levelparallelism.TheHPL-PDopcodereper-toire,atitscore,issimilartothatofaRISC-likeload/storearchitecture,withstandardinteger,floatingpoint(includingfusedmultiply-addtypeofoperations)andmemoryopera-tions.WemapthecorepartofourtargetarchitecturetotheHPL-PDarchitecture.
TheTrimarancompilerinfrastructure,asshowninFigure2,consistsofacompilerfront-end,IMPACT,compilerback-end,Elcor[2],andasimulatorgenerator.Theframeworkisparameterizedusingamachinedescriptionfacility,HMDES[10].Webrieflydescribeeachofthesetools.
TheIMPACTcompilersystem,isusedbytheTri-maransystemasitsfrontend.Thisfront-endperforms,ANSICparsing,codeprofiling,classicalcodeoptimizationsalongwithblockformation.
TheHighLevelMachineDescriptionFacilityorHMDESisthemachinedescriptionlanguageusedinTri-maransystem.Thislanguagedescribesaprocessorarchi-tecturefromthecompiler’spointofview.Tothisenditspecifiestheinstructionformat,resourceusagesandreserva-tiontables,latencyinformation,operationinformationandsomecompilerspecificinformation.Theinstructionformatconveyswhatoperandsareallowedbyeachtypeofopera-tion,resourceusagesspecifyhowoperationsuseprocessorsresourcesastheyexecuteandlatencyinformationspecifieshowtocalculatedependencedistancesbetweenoperations.Finally,operationinformationspecifiestheoperationssup-
C ProgramIMPACT* ANSI C Parsing* Code Profiling*Classical Machine Independent Optimizations*Block FormationBridge CodeELCORElcor IRGenerated SimulatorSIMULATOR GENERATOR* Elcor IR to low level C files* HPL−PD virtual machine* Cache simulation* Machine dependent code optimizations* Code scheduling* Register allocationHMDES Machine DescriptionFigure2:TheTrimaranCompilerInfrastructureportedbythearchitecture.anddescribeseachofthemintermsofthereSchedulingAlternativeswhichincludestheformat,resourceusageandlatency.
ElcorisTrimaran’sback-endfortheHPL-PDarchitec-ture.Itperformsthreetasks:(a)codeselectionandschedul-ing.(b)registerallocation.(c)machinedependentcodeoptimizations.Elcorisparameterizedbythemachinede-scriptionfacilitytoalargeextent.AsshowninFigure2,ittakesasinputthebridgecodeproducedbyfront-endalongwithaHMDESmachinespecificationandproducesanEl-corIRfile.TheIRisannotatedwithHPL-PDassemblyinstructions.TheinternalrepresentationofElcorIRcon-sistsofasetofC++objects.AlloptimizationmodulesintheElcorIRusetheinterfaceprovidedbytheseobjectstocarryoutoptimizations.OptimizationsaresimplyIRtoIRtransformations.
TheTrimaranframeworkalsoconsistsofasimulatorwhichisusedtogeneratevariousstatisticssuchascomputecycles,totalnumberofoperations,etc.
ThelimitationsoftheTrimaranframeworkarethatfirstly,itisbuiltaroundtheHPL-PDarchitecturaldomain.Hence,itonlysupportsoperationswhichareasubsetofHPL-PDoperations.Secondly,theTrimaranframeworkdoesnotcompletelysupportclusteredVLIWarchitecture.Ithasasingleregisterfileofeachtype(e.g.integerregfile,floatingpointregfileetc).EachintegerFUaccessesthesameintegerregfile.Hence,wecannotevaluateperformanceforclusteredarchitectures.
toidentifythepatterncorrespondingtoOintheapplicationsourcecodeandemittheappropriateIntermediateRepre-sentation(IR).ThebackendshouldbeabletogeneratecodecorrespondingtothisIR.Theformerisingeneralaveryhardproblemasalltheinformationcannotbecodedinthemachinedescription,whichwillenablethefrontendtoidentifythepatterninthesourcecodecorrespondingtocoarseoperationO.
Anotherapproachcouldbethatthefrontendremainsunchanged.Theapplicationcodeisitselfmodifiedsothatthedesiredcomputation(tobecarriedbycoarsegrainFU)isreplacedbyanexternalfunctioncall.TheIRwillconsistofnodescorrespondingtothisfunctioncall.ThenonecanmodifytheIRitselftoreplacethesenodesbyanewnodecorrespondingtotheoperationO.Thebackendwillthentreatthisnodeasanyotherstandardmachineoperation(e.g.ADD)andgeneratecodeforit.Finallyoneneedstodefinetheoperationsemanticsinsidetheretargetablesimu-latorsothatvariousstatisticscanbegenerated.
WetakethelatterapproachtoextendtheTrimaranin-frastructure.Eachnewoperationisrepresentedintermsofanexternalfunctioncall.ThefunctionnameandcoarsegrainFUbindingisimplicit.Thefunctionnameitselfspec-ifiestowhichFUitshouldbebound.Weidentifythefunc-tioncallintheIRofthecodeandreplaceitwithacoarsegrainoperationintheIR.Thereisaonetoonemappingbetweenthecoarsegrainoperationandthenameofthefunctionintheapplicationcode.Theoperationnowpropa-gatesthroughthewholesuiteofoptimizationsdonebythecompiler.WealsodefinethesemanticsofthenewoperationintheTrimaransimulator.
TheTrimaranframeworkhasnonotionofregisterfileports.Itassumeseachregfiletohaveanunlimitednum-berofports.Weincorporatethenotionofregfileportsintheframeworkwithparameterizednumberofread/writeportscorrespondingtoeachregfile.Thisisanessentialcon-straintinVLIWASIPdesignasaccesstime,areaandpowerconsumptionsharplyincreasewiththenumberofportsinaregisterfile.ThemodifiedframeworkisshowninFig-ure3.Theshadedportionrepresentsthosepartsoftheframeworkwhichhavebeenmodified,withchangesindi-catedalongwitheachpart.WehavesuccessfullymodeledthethreeclassesofFUsdescribedabove,MISOs,BasicMI-MOsandMIMOswithLD/ST.Inthefollowingparagraphswedescribeeachofthese.
4.
EXTENDINGTRIMARANINFRASTRUCTURE
4.1ModelingMISOs
Inthissection,weconsidertheproblemofintroducingcoarsegrainFUsinacompilerinfrastructure.Weassumethatthecompilerinfrastructureconsistsofamachinede-scriptionfacility,acompilerfrontandbackendandaretar-getablesimulator.
ThefirststepinvolvesdefininganewmachineoperationOandanewresourceRinthesystem.TheoperationOwillbeperformedbyRwhichcorrespondstoacoarsegrainFUinthearchitecture.TheoperationOwillbedefinedintermsoftheoperationformat,operationlatencyandthere-sourceusage.Afterthisthecompilerneedstobemodifiedsothatitisabletogeneratecodeforthisnewoperation.Forthisonerequiresaretargetablecompilerparameterizedwiththemachinedescription.Thefrontendshouldbeable
WehaveidentifiedandsuccessfullytestedthefollowingapproachforintroducingMISOsintheTrimaranCompilerInfrastructure.TheapplicationprograminCconsistsofaprototypedeclarationofafunctionwhichtheuserwantstoperformviaaspecialfunctionalunit.Thisisillustratedwiththehelpofanexample:
main(){
inta,b,c,d;a=3;b=4;c=5;
//Thefollowingcomputation//istobedoneviaspecialFUd=(a+b)*(b+c)*(a+c);}
Letusdefineanewfunctionalunitwhichtakesin3inputsa,bandcandproducesoneoutputd.Fordefiningthenew
4
C ProgramoriginalinstrumentedC ProgramBridge CodeIMPACTtionscorrespondingtomisofunfunctioncallarereplacedbyNEWOPandothersuchcombinationsremainunchanged.ThenewoperationisalsodefinedinMDES(withopcodeMNEWOP)whichinvolvesdefiningitsOperationFormat,NumberofResources(FUs),OperationLatency,ResourceUsageandtheReservationTable.
ModifiedELCOR( IR Transformation)(Incorporated Register PortConstraints)Elcor IRGenerated Simulator4.2ModelingMIMOs
statsSIMULATOR GENERATORwith semantics of new operationSinceafunctioncannotreturnmorethan1valuebydef-inition,wetakeaslightlydifferentapproachhere.Insteadofaprototypefunctionreturningavalueweconsideravoidfunction.Wereservesomeregisters(throughtheCcodeitself,bygivingsomecompilerdirectives)andthefunctionreturnsvaluesinthoseregisters.ThisisillustratedbytheHMDES Machine Descriptionreflecting new operationFigure3:ModifiedTrimaranFrameworkfunctionalunitwedeclareaprototypefunctioninC;i.e.,wedonotdefineitsfunctionalitybutonlydeclareitsinterface.
intmiso_fun(inta,intb,intc);main(){
inta,b,c,d;a=3;b=4;c=5;
d=miso_fun(a,b,c);}
Sincethefunctionisnotcompletelydefinedinsidetheap-plication,afterpassingthroughthefrontenditappearsintheformofanexternalfunctioncallintheTrimaranbridgecodealongwiththerelevantannotationswhichconsistsofnameofthefunctionetc.AfterfirstpassthroughElcoritappearsintheformoftwoElcorOperationsPBRR(preparetobranch)andBRL(branchandlink).Weidentifythiscom-binationofPBRRandBRLcorrespondingtotheprototypefunctionintheIRandreplacethiscombinationbyanewnodeintheIRwhichcorrespondstoanewElcorOperationandrepresentstheFUwhichwewanttointroduce.ThesourceanddestinationoperandsofthisnewElcorOpera-tionarethesameasthesourceanddestinationoperandsoftheprototypefunctioncall.
Finallythesemanticsofthenewoperationaredefinedinthesimulatorwhichinvolvesdefiningdestinationasafunc-tionofsources.AsillustratedinFigure4,weconsideran
ADD_WADD_WPBRRNew_Op [d] [a b c] s_time(1)(Miso_Fun)New_OPs_opcode(MNEWOP)BRLSUB_WSUB_WPBRRPBRR(Printf)BRLBRLFigure4:ModificationsinElcorIR
exampleofapartofIRinwhichthePBRRandBRLopera-
5
followingexample:
main(){
inta,b,c,d;a=4;b=5;
//Thefollowingcomputations//aretobedoneviaspecialFUc=a+b;d=a-b;
printf(c,d);}
Letusdefineanewfunctionalunitwhichtakesin2inputsaandbandreturnoutputs(a+b)and(a-b).HenceitisaMIMO.Theapplicationconsistsofvoidprototypefunctiondeclaration.
voidmimo_fun(inta,intb);main(){
inta,b,ret1,ret2;
//Somecodeanddirectives//toreserveregisters........
mimo_fun(a,b);
//Valueswillbereturnedinret1&ret2printf(ret1,ret2);}
AsinthepreviouscasetheprototypefunctioncallappearsasacombinationofPBRRandBRLintheElcorIR.ButnowinadditiontoreplacingtheabovecombinationbyanewElcorOperationwesetthedestinationsofoperationastheregistersreservedforthispurpose(thatisregisterscor-respondingtovariablesret1andret2intheaboveexample).TheoperationisalsodefinedinMDESanditssemanticsaredefinedinthesimulator.
4.3
ModelingMIMOswithload/store
TohandleMIMOswithcapabilityofinteractionwithmemorywemakemodificationsonlyintheMDES.Basi-callyinthereservationtablecorrespondingtotheoperationwealsoreservememoryunitsineachtimeunitwherein-teractionwiththememoryisrequired.Wehavemultiplememoryunitsinthesystem,sooneoftheunitsisreservedforperformingthisoperationwhileotherscanhandlenor-malload/storeoperations.ThearchitectureassumeseachLD/STunithasportstomemorysotheycanbeactivesi-multaneously.Whilemakingthefunctioncallwealsopasstheaddressesofthememorylocationsfromwhichdataisre-quired.Inthefirstfewcyclesoftheoperationmemoryresi-dentdataisaccessedwiththehelpofLD/STunitandstored
inlocalbuffers.Thenthecomputationisperformedandfi-nallydataiswrittenintothememory,ifrequired,againwiththehelpofLD/STunit.ThesemanticsarehandledinthesimulatorinasimilarwayasforbasicMIMOs.
tationofthebutterflyoperation,ThebutterflyoperationisshowninFigure6.Inthisapplicationwereplacethebut-
awb*4.4ImposingRegisterPortConstraints
+a+bwTheTrimaranframeworkhasnonotionofregfileports.Itprimarilyhas1regfileofeachtype(GPR,controlregfile,floatingpointregfile,branchtargetandpredicateregfile).Thescheduler,forexamplecanscheduleanynumberofin-tegeroperationsinparalleldependingontheavailabilityofresources.Thisimplieseachregfilehasinfinitenumberofportsintheory.SoweimposetheseportconstraintsinthearchitecturalframeworkbecauseallthespecialFUslikeMIMOswillhavealargeno.ofsourcesandmanydestina-tions.ToincorporatetheseconstraintswebuildaTimeXRegporttableforreadaswellaswriteportsinwhichateachinstantoftimecorrespondingtoeachregfiletheutilizationofitsread/writeportsismaintained.Beforeschedulinganyoperationthetableischeckedforavailabilityofread/writeportsalongwiththeavailabilityofresources.IntherigidI/OtimeshapemodelifaparticularFUhasmoresourcesthanthenumberofreadportsormoredestinationsthanthenumberofwriteportsthentheportsarereservedintheverynexttimeinstant,i.e.,thecyclesinwhichI/Ooccursisfixed.
IntheflexibleI/OtimeshapemodelanFUisdividedintovariousstagesandthenumberofsourcesanddesti-nationsineachstageliewithinthemaximumnumberofread/writeports.Aflowdependencyedgeisaddedbetweeneachstagetoensureeachstageisscheduledafteritspre-decessorsarescheduleddependingontheavailabilityofre-sourcesandports.Butitisflexibleinthesensethatstageicanbescheduledanytimeafterstagei-1hasbeenscheduled.ThisisshowninFigure5.
−a−bwa = ar + i(ac)w = wr + i(wc)b = br + i(bc)i =−1Figure6:ButterflyOperation
terflyoperationwhichhas6sources(aseachofthe3sourcesarecomplexnumbers)and4destinationswithaMultipleIn-putMultipleOutputFU.ThemodelconformstotherigidI/Otimeshapemodelwitheachregfilehaving4readportsand2writeports.Thelatencyofthebutterflyoperationissetto8.
1readbrreadbcreadwrreadwc2345readac++XXX+−readar678−+writeo1−writeo3+writeo2−writeo4AFU_S0flow dependencyedgeFigure7:DataflowgraphofbutterflyoperationWeimplementtheDFGshowninFigure7inthe6-4MIMOcorrespondingtothebutterflyoperation.Thehigh-lightedportionrepresentsoneofthemanycriticalpaths.Thelengthofthecriticalpathis8,assuming4readsand2writesarepermittedineachcycle.Theshadedportionsrepresentthereadandwriteoperations.Thelatencyofthemultiplicationoperationinthebasearchitectureis3,whilstthatofarithmeticoperationsis1.Theresultsofintroduc-tionofthenewfunctionalunitareshowninTable2.
n-pointFFT24816WithoutSpe-cialFU(cycles)236549120929WithSpecialFU(cycles)2093415831081AFUAFU_S1ELCOR OPERATIONSFigure5:FlexibleI/OTimeshape
5.5.1
CASESTUDIES
FastFourierTransform(FFT)
ToillustratetheconceptofMISOsandMIMOsandtoevaluatetheperformancegainwhenspecialfunctionalunitsarepresentinthesystemweconsiderastandardN-pointFFTapplication.FFTformstheheartofmanyimagetrans-formationpackages,thusisaninterestingapplicationtocon-siderspeedup.TheheartofFFTistherepetitivecompu-
Table2:ComputecyclesforvaryingninFFTTable2showsthatasthevalueofnincreasesthenumberofbutterflyoperationsalsoincreaseandthereforeperfor-
6
mancegainalsoincreases.Forn=16thespeedimprovementisalmost2.5×.
WehavealsoimplementedtheflexibleI/OtimeshapemodelcorrespondingtotheFFTapplication.Theresultsobtainedaresimilartoasintherigidcase.
5.2KalmanFilter
TheKalmanfilter[8]isasetofmathematicalequationsthatprovidesanefficientcomputational(recursive)solutionoftheleast-squaresmethod.Thefilterisverypowerfulinseveralaspects:itsupportsestimationsofpast,present,andevenfuturestates,anditcandosoevenwhentheprecisenatureofthemodeledsystemisunknown.TheKalmanfilterbasicallyconsistsoftwomainfunctionspredictstatewhichpredictsthestateofthesystemandkalmanupdatewhichupdatesthesystem.WebuiltspecialFUstoperformvariousfrequentlyoccurringoperationsinthesefunctions.ManyoftheseoperationsinvolvemanipulationofvariousarrayswhichinvolvedhandlingmemorywithintheAFU.Inall,weintroduced5MISOswithload/storetohandlethevariousoperations.ThedescriptionofeachAFUisshowninTable3.Thesemanticsarespecifiedintheformofdesti-nationasafunctionofsources,wheresirepresentstheithsourceoftheAFUandirepresentstheithdestinationoftheAFU.ThelatencyofeachFUconformedtotheamountofAFUNo.12345No.ofIn-puts53555No.ofout-puts11111Semanticsframeworkcanbeusedforextensivedesignspaceexplo-rationandtheusercanexperimentbymappingvariouscomputeintensivepartsoftheapplicationtospecialFUsinhardwareandcomparingtherelativeperformanceestimatesunderaccurateimplementationconstraints.ThiscaneasethesynthesisofASIPscorrespondingtothesesetofapplica-tions.Researchisgoingonintheareaofautomatictopologybasedidentificationofinstructionsetextensionsforembed-dedprocessors[9].Currentlythepotentialcandidatesareidentifiedmanually.Apossiblefutureworkcanbetocreateanautomaticidentification-evaluationframework,whichau-tomaticallyidentifiesthepotentialcandidatesbasedonsomespeedupfactorsassociatedwitheachinstructionandthenevaluatesthemusingthisextendedTrimaranframeworkforpossiblegains.ThemodelingdoesnottakeintoaccounttherelativecostofthespecialFU.Apossibleextensioncanbeanintroductionofcostmodelwhichevaluatestheareacor-respondingtoeachspecialFUsothatonecanevenevaluatetheperformance-areatradeoff.Besides,theFUsareoflim-itedcomplexity,onecanextendtheframeworktointroduceFUswhicharecapableofhandlingconditionalsandloopsalso.Currently,thechangesrequiredintheTrimaranframe-workfortheintroductionofanyparticularFUaremanual.Wearemakingeffortstoautomateittotheextentpossible.
7.REFERENCES
d1=(s1+(s2*(s3+s4+s5*s2)))d1=(s1+(s2*s3))d1=(-s1*s2+s3*s4)/s5d1=(s1+s2*s3+s4*s5)d1=(s1-s2*s3-s4*s5)Table3:KalmanFilterAFUs
computationinvolvedandthemodelfollowedwasrigidI/Otimeshapemodel.Thenumberofreadportsineachregfilewere3andnumberofwriteportswere2.TheresultsobtainedareshowninTable4.
WithoutSpecialFU(cycles)
PredictState699KalmanUpdate774FunctionWithSpecialFU(cycles)498342Table4:KalmanFilterResults
Wehaveused3specialFUsinthekalmanupdatefunctionand2specialFUsinthepredictstatefunction,theperfor-manceisbetterinthecaseoflatterbecausethereweremoreoperationsthatcouldbemappedtothesespecialFUsthantheformercase.AscanbeobservedfromTable4,thenum-berofcycleshavecomedowntolessthanhalfinthepre-dictstatefunction,whichimpliesafairlylargeperformancegain.
6.CONCLUSIONANDFUTUREWORK
Wehavepresentedaframeworktoquicklyevaluatetheperformancegainobtainedwhenspecialapplicationspecificfunctionalunitsareintroducedinthearchitecture.The
[1]ShailAditya,B.R.RauandV.Kathail.Automatic
architecturalsynthesisofVLIWandEPICprocessors.InProceedingsof12thISSS.November,1999.
[2]ShailAditya,VinodKathail,andB.Ramakrishna
Rau.Elcor’sMachineDescriptionSystem:Version3.0.TechnicalReportHPL-1998-128,Hewlett-PackardLaboratories,October1998.
[3]MargaridaF.Jacomeetal.ClusteredVLIW
architectureswithpredicatedswitching.InDAC,pages696-701,2001.
[4]PaoloIenne,LauraPozzi,M.Vuletic.OntheLimits
ofProcessorSpecialisationbyMappingDataflowSectionsonAd-hocFunctionalUnits.CSTechnicalReport01/376,LAP,EPFL,Lausanne.December2001.
[5]TheTrimaranCompilerInfrastructure,
http://www.trimaran.org.
[6]CesareAlippietal.ADAGbaseddesignapproachfor
reconfigurableVLIWprocessors.InProceedingsoftheDATE,pages778-79,March1999.
[7]N.G.Busaetal.Schedulingcoarsegrainoperations
forVLIWprocessorsInProceedingsofthe13thISSS,pages47-53,Madrid,September2000.
[8]GregWelchandGaryBishop.AnIntroductiontothe
KalmanFilter.TechnicalReport,DepartmentofComp.Sc.andEngg.,Univ.ofNorthCarolinaatChapelHill,March2002.[9]Pozzi,LauraandVuleti´c,MiljanandIenne,Paolo.
AutomaticTopology-BasedIdentificationof
Instruction-SetExtensionsforEmbeddedProcessors.InProceedingsoftheDesign,AutomationandTestinEuropeConferenceandExhibition,Paris,March2002.[10]J.Gyllenhaal,B.Rau,andW.Hwu.HMDESversion
2.0specification,IMPACT,UniversityofIllinois,Urbana,IL,Tech.Rep.IMPACT-96-03,1996.
7
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- baoaiwan.cn 版权所有 赣ICP备2024042794号-3
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务