Pandas是Python中最著名的数据分析工具。在处理数据集时,每个人都会使用到它。但是随着数据大小的增加,执行某些操作的某些方法会比其他方法花费更长的时间。所以了解和使用更快的方法非常重要,特别是在大型数据集中,本文将介绍一些使用Pandas处理大数据时的技巧,希望对你有所帮助 数据生成 为了方便介绍,我们生成一些数据作为演示,faker是一个生成假数据的Python包。这里我们直接使用它importrandomfromfakerimportFakerfakeFaker()carbrands〔Audi,Bmw,Jaguar,Fiat,Mercedes,Nissan,Porsche,Toyota,None〕tvbrands〔Beko,Lg,Panasonic,Samsung,Sony〕defgeneraterecord():generatesafakerowcidfake。bothify(textCID)namefake。name()agefake。randomnumber(digits2)cityfake。city()platefake。licenseplate()jobfake。job()companyfake。company()employedfake。boolean(chanceofgettingtrue75)socialsecurityfake。boolean(chanceofgettingtrue90)healthcarefake。boolean(chanceofgettingtrue95)ibanfake。iban()salaryfake。randomint(min0,max99999)carrandom。choice(carbrands)tvrandom。choice(tvbrands)record〔cid,name,age,city,plate,job,company,employed,socialsecurity,healthcare,iban,salary,car,tv〕returnrecordrecordgeneraterecord()print(record)〔CID753,KristyTerry,5877566,NorthJessicaborough,988XEE,Engineer,controlandinstrumentation,Braun,RobinsonandShaw,True,True,True,GB57VOOS96765461230455,27109,Bmw,Beko〕 我们创建了一个100万行的DF。importosimportpandasaspdfrommultiprocessingimportPoolN1000000ifnamemain:cpusos。cpucount()poolPool(cpus1)asyncresults〔〕forinrange(N):asyncresults。append(pool。applyasync(generaterecord))pool。close()pool。join()data〔〕fori,asyncresultinenumerate(asyncresults):data。append(asyncresult。get())dfpd。DataFrame(datadata,columns〔CID,Name,Age,City,Plate,Job,Company,Employed,SocialSecurity,Healthcare,Iban,Salary,Car,Tv〕) 磁盘IO Pandas可以使用不同的格式保存DF。让我们比较一下这些格式的速度。Writetimeitdf。tocsv(df。csv)3。77s339msperloop(meanstd。dev。of7runs,1loopeach)timeitdf。topickle(df。pickle)948ms13。1msperloop(meanstd。dev。of7runs,1loopeach)timeitdf。toparquet(df)2。77s13msperloop(meanstd。dev。of7runs,1loopeach)timeitdf。tofeather(df。feather)368ms19。4msperloop(meanstd。dev。of7runs,1loopeach)defwritetable(df):dtfdt。Frame(df)dtf。tocsv(df。csv)timeitwritetable(df)559ms10。1msperloop(meanstd。dev。of7runs,1loopeach) Readtimeitdfpd。readcsv(df。csv)1。89s22。4msperloop(meanstd。dev。of7runs,1loopeach)timeitdfpd。readpickle(df。pickle)402ms6。96msperloop(meanstd。dev。of7runs,1loopeach)timeitdfpd。readparquet(df)480ms3。62msperloop(meanstd。dev。of7runs,1loopeach)timeitdfpd。readfeather(df。feather)754ms8。31msperloop(meanstd。dev。of7runs,1loopeach)defreadtable():dtfdt。fread(df。csv)dfdtf。topandas()returndftimeitdfreadtable()869ms29。8msperloop(meanstd。dev。of7runs,1loopeach) CSV格式是运行最慢的格式。在这个比较中,我有包含Excel格式(readexcel),因为它更慢,并且还要安装额外的包。 在使用CSV进行的操作中,首先建议使用datatable库将pandas转换为datatable对象,并在该对象上执行读写操作这样可以得到更快的结果。 但是如果数据可控的话建议直接使用pickle。数据类型 在大型数据集中,我们可以通过强制转换数据类型来优化内存使用。 例如,通过检查数值特征的最大值和最小值,我们可以将数据类型从int64降级为int8,它占用的内存会减少8倍。 df。info()classpandas。core。frame。DataFrameRangeIndex:1000000entries,0to999999Datacolumns(total14columns):ColumnNonNullCountDtype0CID1000000nonnullobject1Name1000000nonnullobject2Age1000000nonnullint643City1000000nonnullobject4Plate1000000nonnullobject5Job1000000nonnullobject6Company1000000nonnullobject7Employed1000000nonnullbool8SocialSecurity1000000nonnullbool9Healthcare1000000nonnullbool10Iban1000000nonnullobject11Salary1000000nonnullint6412Car888554nonnullobject13Tv1000000nonnullobjectdtypes:bool(3),int64(2),object(9)memoryusage:86。8MB 我们根据特征的数值范围对其进行相应的转换,例如AGE特征的范围在0到99之间,可以将其数据类型转换为int8。intdf〔Age〕。memoryusage(indexFalse,deepFalse)8000000convertdf〔Age〕df〔Age〕。astype(int8)df〔Age〕。memoryusage(indexFalse,deepFalse)1000000floatdf〔SalaryAfterTax〕df〔Salary〕0。6df〔SalaryAfterTax〕。memoryusage(indexFalse,deepFalse)8000000df〔SalaryAfterTax〕df〔SalaryAfterTax〕。astype(float16)df〔SalaryAfterTax〕。memoryusage(indexFalse,deepFalse)2000000categoricaldf〔Car〕。memoryusage(indexFalse,deepFalse)8000000df〔Car〕df〔Car〕。astype(category)df〔Car〕。memoryusage(indexFalse,deepFalse)1000364 或者在文件读取过程中直接指定数据类型。dtypes{CID:int32,Name:object,Age:int8,。。。}dates〔DateColumnsHere〕dfpd。readcsv(dtypedtypes,parsedatesdates)查询过滤 常规过滤方法:timeitdffiltereddf〔df〔Car〕Mercedes〕61。8ms2。55msperloop(meanstd。dev。of7runs,1loopeach) 对于分类特征,我们可以使用pandas的groupby和getgroup方法。timeitdf。groupby(Car)。getgroup(Mercedes)92。1ms4。38msperloop(meanstd。dev。of7runs,10loopseach)dfgroupeddf。groupby(Car)timeitdfgrouped。getgroup(Mercedes)14。8ms167sperloop(meanstd。dev。of7runs,1loopeach) 分组的操作比正常应用程序花费的时间要长。如果要对分类特征进行很多过滤操作,例如在本例中,如果我们从头进行分组,并且只看getgroup部分的执行时间,我们将看到该过程实际上比常规方法更快。也就是说,对于重复的过滤操作,我们可以首选此方法(getgroup)。计数 Valuecounts方法比groupby和followingsize方法更快。timeitdf〔Car〕。valuecounts()49。1ms378sperloop(meanstd。dev。of7runs,10loopseach)Toyota111601Porsche111504Jaguar111313Fiat111239Nissan110960Bmw110906Audi110642Mercedes110389Name:Car,dtype:int64timeitdf。groupby(Car)。size()64。5ms37。9sperloop(meanstd。dev。of7runs,10loopseach)CarAudi110642Bmw110906Fiat111239Jaguar111313Mercedes110389Nissan110960Porsche111504Toyota111601dtype:int64迭代 在大容量数据集上迭代需要很长时间。所以有必要在这方面选择最快的方法。我们可以使用Pandas的iterrows和itertuples方法,让我们将它们与常规的for循环实现进行比较。deffooloop(df):total0foriinrange(len(df)):totaldf。iloc〔i〕〔Salary〕returntotaltimeitfooloop(df)34。6s593msperloop(meanstd。dev。of7runs,1loopeach)deffooiterrows(df):total0forindex,rowindf。iterrows():totalrow〔Salary〕returntotaltimeitfooiterrows(df)22。7s761msperloop(meanstd。dev。of7runs,1loopeach)deffooitertuples(df):total0forrowindf。itertuples():totalrow〔12〕returntotaltimeitfooitertuples(df)1。22s14。8msperloop(meanstd。dev。of7runs,1loopeach) Iterrows方法比for循环更快,但itertuples方法是最快的。 另外就是Apply方法允许我们对DF中的序列执行任何函数。deffoo(val):ifval50000:returnHighelifval50000andval10000:returnMidLevelelse:returnLowdf〔SalaryCategory〕df〔Salary〕。apply(foo)print(df〔SalaryCategory〕)0High1High2MidLevel3High4Low。。。999995High999996Low999997High999998High999999MidLevelName:SalaryCategory,Length:1000000,dtype:objecttimeitdf〔SalaryCategory〕df〔Salary〕。apply(foo)112ms50。6sperloop(meanstd。dev。of7runs,10loopseach)defboo():liste〔〕foriinrange(len(df)):valfoo(df。loc〔i,Salary〕)liste。append(val)df〔SalaryCategory〕listetimeitboo()5。73s130msperloop(meanstd。dev。of7runs,1loopeach) 而map方法允许我们根据给定的函数替换一个Series中的每个值。print(df〔SalaryCategory〕。map({High:H,MidLevel:M,Low:L}))0H1H2M3H4L。。999995H999996L999997H999998H999999MName:SalaryCategory,Length:1000000,dtype:objectprint(df〔SalaryCategory〕。map(SalaryCategoryis{}。format))0SalaryCategoryisHigh1SalaryCategoryisHigh2SalaryCategoryisMidLevel3SalaryCategoryisHigh4SalaryCategoryisLow。。。999995SalaryCategoryisHigh999996SalaryCategoryisLow999997SalaryCategoryisHigh999998SalaryCategoryisHigh999999SalaryCategoryisMidLevelName:SalaryCategory,Length:1000000,dtype:objectdf〔SalaryCategory〕df〔Salary〕。map(foo)print(df〔SalaryCategory〕)0High1High2MidLevel3High4Low。。。999995High999996Low999997High999998High999999MidLevelName:SalaryCategory,Length:1000000,dtype:object 让我们比较一下标对salary列进行标准化工时每一中迭代方法的时间吧。minsalarydf〔Salary〕。min()maxsalarydf〔Salary〕。max()defnormalizeforloc(df,minsalary,maxsalary):normalizedsalarynp。zeros(len(df,))foriinrange(df。shape〔0〕):normalizedsalary〔i〕(df。loc〔i,Salary〕minsalary)(maxsalaryminsalary)df〔NormalizedSalary〕normalizedsalaryreturndftimeitnormalizeforloc(df,minsalary,maxsalary)5。45s15。1msperloop(meanstd。dev。of7runs,1loopeach)defnormalizeforiloc(df,minsalary,maxsalary):normalizedsalarynp。zeros(len(df,))foriinrange(df。shape〔0〕):normalizedsalary〔i〕(df。iloc〔i,11〕minsalary)(maxsalaryminsalary)df〔NormalizedSalary〕normalizedsalaryreturndftimeitnormalizeforiloc(df,minsalary,maxsalary)13。8s29。5msperloop(meanstd。dev。of7runs,1loopeach)defnormalizeforiloc(df,minsalary,maxsalary):normalizedsalarynp。zeros(len(df,))foriinrange(df。shape〔0〕):normalizedsalary〔i〕(df。iloc〔i〕〔Salary〕minsalary)(maxsalaryminsalary)df〔NormalizedSalary〕normalizedsalaryreturndftimeitnormalizeforiloc(df,minsalary,maxsalary)34。8s108msperloop(meanstd。dev。of7runs,1loopeach)defnormalizeforiterrows(df,minsalary,maxsalary):normalizedsalarynp。zeros(len(df,))i0forindex,rowindf。iterrows():normalizedsalary〔i〕(row〔Salary〕minsalary)(maxsalaryminsalary)i1df〔NormalizedSalary〕normalizedsalaryreturndftimeitnormalizeforiterrows(df,minsalary,maxsalary)21。7s53。3msperloop(meanstd。dev。of7runs,1loopeach)defnormalizeforitertuples(df,minsalary,maxsalary):normalizedsalarylist()forrowindf。itertuples():normalizedsalary。append((row〔12〕minsalary)(maxsalaryminsalary))df〔NormalizedSalary〕normalizedsalaryreturndftimeitnormalizeforitertuples(df,minsalary,maxsalary)1。34s4。29msperloop(meanstd。dev。of7runs,1loopeach)defnormalizemap(df,minsalary,maxsalary):df〔NormalizedSalary〕df〔Salary〕。map(lambdax:(xminsalary)(maxsalaryminsalary))returndftimeitnormalizemap(df,minsalary,maxsalary)178ms970sperloop(meanstd。dev。of7runs,10loopseach)defnormalizeapply(df,minsalary,maxsalary):df〔NormalizedSalary〕df〔Salary〕。apply(lambdax:(xminsalary)(maxsalaryminsalary))returndftimeitnormalizeapply(df,minsalary,maxsalary)182ms1。83msperloop(meanstd。dev。of7runs,10loopseach)defnormalizevectorization(df,minsalary,maxsalary):df〔NormalizedSalary〕(df〔Salary〕minsalary)(maxsalaryminsalary)returndftimeitnormalizevectorization(df,minsalary,maxsalary)1。58ms7。87sperloop(meanstd。dev。of7runs,1000loopseach) 可以看到: loc比iloc快。如果你要使用iloc,那么最好使用这样df。iloc〔i,11〕的格式。Itertuples比loc更好,iterrows确差不多。Map和apply是第二种更快的选择。向量化的操作是最快的。向量化 向量化操作需要定义一个向量化函数,该函数接受嵌套的对象序列或numpy数组作为输入,并返回单个numpy数组或numpy数组的元组。deffoo(val,minsalary,maxsalary):return(valminsalary)(maxsalaryminsalary)foovectorizednp。vectorize(foo)timeitdf〔NormalizedSalary〕foovectorized(df〔Salary〕,minsalary,maxsalary)154ms310sperloop(meanstd。dev。of7runs,10loopseach)conditionaltimeitdf〔Old〕(df〔Age〕80)140s11。8sperloop(meanstd。dev。of7runs,10000loopseach)isintimeitdf〔Old〕df〔Age〕。isin(range(80,100))17。4ms466sperloop(meanstd。dev。of7runs,100loopseach)binswithdigitizetimeitdf〔AgeBins〕np。digitize(df〔Age〕。values,bins〔0,18,36,54,72,100〕)12ms107sperloop(meanstd。dev。of7runs,100loopseach)print(df〔AgeBins〕)0315243345。。99999549999962999997399999819999991Name:AgeBins,Length:1000000,dtype:int64索引 使用。at方法比使用。loc方法更快。timeitdf。loc〔987987,Name〕5。05s33。3nsperloop(meanstd。dev。of7runs,100000loopseach)timeitdf。at〔987987,Name〕2。39s23。3nsperloop(meanstd。dev。of7runs,100000loopseach)Swifter Swifter是一个Python包,它可以比常规的apply方法更有效地将任何函数应用到DF。!pipinstallswifterimportswifterapplytimeitdf〔NormalizedSalary〕df〔Salary〕。apply(lambdax:(xminsalary)(maxsalaryminsalary))192ms9。08msperloop(meanstd。dev。of7runs,1loopeach)swifter。applytimeitdf〔NormalizedSalary〕df〔Salary〕。swifter。apply(lambdax:(xminsalary)(maxsalaryminsalary))83。5ms478sperloop(meanstd。dev。of7runs,10loopseach)总结 如果可以使用向量化,那么任何操作都应该优先使用它。对于迭代操作可以优先使用itertuples、apply或map等方法。还有一些单独的Python包,如dask、vaex、koalas等,它们都是构建在pandas之上或承担类似的功能,也可以进行尝试。 作者:OkanYenign