1. Build a DataFrame



2. Determine whether there are duplicate items

use duplicated( ) Function judgment  



3.  There are duplicate items , You can use the drop_duplicates() Remove duplicate



4. Duplicated( ) and drop_duplicates( ) The method is to determine all columns by default ( In the example above, we look at two variables a and b Are they all repeated ).

We can also judge the repetition of a specific column .

 C.duplicated(['a'])      C.drop_duplicates(['a'])

 C.duplicated(['b'])      C.drop_duplicates(['b'])



5.  norepeat_df = df.drop_duplicates(subset=['A_ID', 'B_ID'], keep='first')

# Remove the order above UNIT_ID and KPI_ID Duplicate rows in column , And keep the first occurrence of the repeated rows

supplement : 
When keep=False Time , That is to remove all duplicate lines  
When keep=‘first’ Time , Is to keep the first occurrence of the duplicate line  
When keep=’last’ The last occurrence of the duplicate line is retained . 
( be careful , The parameter here is a string , Use quotation marks !!!)





©2020 ioDraw All rights reserved
Java realization PDF Online preview ( Four methods ) Android Development — Display food information according to customer budget spark.sql.shuffle.partitions and spark.default.parallelism The difference between Big data environment --- data warehouse (hive+mysql+hadoop) The construction of Children programming children's learning route What do you do in Shanghai to make money fast ? this 10 You can have a try ! Advanced programmer - Deep understanding of data structure After the outbreak Which programming has a bright future 2020 Nobel Prize in physiology or medicine announced Trump's "VIP therapy ": Is receiving a drug treatment that has not yet been approved