programing tip

팬더로 6GB CSV 파일을 읽는 방법

itbloger 2020. 6. 7. 10:40
반응형

팬더로 6GB CSV 파일을 읽는 방법


팬더에서 큰 CSV 파일 (약 6GB)을 읽으려고하는데 다음 메모리 오류가 발생합니다.

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
    450                     infer_datetime_format=infer_datetime_format)
    451 
--> 452         return _read(filepath_or_buffer, kwds)
    453 
    454     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    242         return parser
    243 
--> 244     return parser.read()
    245 
    246 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    693                 raise ValueError('skip_footer not supported for iteration')
    694 
--> 695         ret = self._engine.read(nrows)
    696 
    697         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
   1137 
   1138         try:
-> 1139             data = self._reader.read(nrows)
   1140         except StopIteration:
   1141             if nrows is None:

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader.read (pandas\parser.c:7145)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7369)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._read_rows (pandas\parser.c:8194)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_column_data (pandas\parser.c:9402)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_tokens (pandas\parser.c:10057)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10361)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser._try_int64 (pandas\parser.c:17806)()

MemoryError: 

이것에 대한 도움?


이 오류는 머신에 한 번에 전체 CSV를 DataFrame으로 읽을 수있는 메모리가 부족함을 나타냅니다. 한 번에 메모리에 전체 데이터 세트가 필요하지 않다고 가정하면 문제를 피하는 한 가지 방법 매개 변수 를 지정하여 CSV를 청크처리하는 것입니다 chunksize.

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

chucksize매개 변수는 청크 당 행 수를 지정합니다. (마지막 청크는 chunksize보다 적을 수 있습니다 .)


청킹이 항상이 문제의 첫 번째 포트가되는 것은 아닙니다.

  1. 숫자가 아닌 데이터가 반복되거나 원치 않는 열로 인해 파일이 커 집니까?

    그렇다면 때때로 열을 범주로 읽고 pd.read_csv usecols 매개 변수 를 통해 필요한 열을 선택하여 메모리를 크게 절약 할 수 있습니다 .

  2. 워크 플로우에 슬라이싱, 조작, 내보내기가 필요합니까?

    그렇다면 dask.dataframe사용 하여 슬라이스하고 계산을 수행하고 반복적으로 내보낼 수 있습니다. 청킹은 dask에 의해 자동으로 수행되며 팬더 API의 서브 세트도 지원합니다.

  3. 다른 모든 방법이 실패하면 청크를 통해 한 줄씩 읽으십시오.

    최후의 수단으로 팬더 또는 csv 라이브러리통한 청크 .


나는 이렇게 진행했다 :

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

큰 데이터의 경우 다음과 같은 라이브러리 "dask"를 사용하는 것이 좋습니다
.

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

The above answer is already satisfying the topic. Anyway, if you need all the data in memory - have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features

Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.


You can read in the data as chunks and save each chunk as pickle.

import pandas as pd 
import pickle

in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"

reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size, 
                    low_memory=False)    


for i, chunk in enumerate(reader):
    out_file = out_path + "/data_{}.pkl".format(i+1)
    with open(out_file, "wb") as f:
        pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)

In the next step you read in the pickles and append each pickle to your desired dataframe.

import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are

data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
   data_p_files.append(name)


df = pd.DataFrame([])
for i in range(len(data_p_files)):
    df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

Solution 1:

Using pandas with large data

Solution 2:

TextFileReader = pd.read_csv(path, chunksize=1000)  # the number of rows per chunk

dfList = []
for df in TextFileReader:
    dfList.append(df)

df = pd.concat(dfList,sort=False)

You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.


If you use pandas read large file into chunk and then yield row by row, here is what I have done

import pandas as pd

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
   for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): 
        yield (chunk)

def _generator( filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    for row in chunk:
        yield row

if __name__ == "__main__":
filename = r'file.csv'
        generator = generator(filename=filename)
        while True:
           print(next(generator))

Here follows an example:

chunkTemp = []
queryTemp = []
query = pd.DataFrame()

for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):

    #REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION
    chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})

    #YOU CAN EITHER: 
    #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET 
    chunkTemp.append(chunk)

    #2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT
    query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]   
    #BUFFERING PROCESSED DATA
    queryTemp.append(query)

#!  NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOP
print("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")
chunk = pd.concat(chunkTemp)
print("Database: LOADED")

#CONCATENATING PROCESSED DATA
query = pd.concat(queryTemp)
print(query)

In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.

def apply(dfg):
    # do stuff
    return dfg

c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)

# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)

# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible

In case someone is still looking for something like this, I found that this new library called modin can help. It uses distributed computing that can help with the read. Here's a nice article comparing its functionality with pandas. It essentially uses the same functions as pandas.

import modin.pandas as pd
pd.read_csv(CSV_FILE_NAME)

참고URL : https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas

반응형