熊猫烧香错误信息代码。_如何解决熊猫记忆错误

最新推荐文章于 2024-06-28 17:44:25 发布

weixin_26737625

最新推荐文章于 2024-06-28 17:44:25 发布

阅读量409

点赞数

文章标签： python java c++

原文链接：https://towardsdatascience.com/how-to-solve-pandas-memory-error-7a9d27214f7c

版权

熊猫烧香错误信息代码。

Many Data Analysis tasks are still performed on a laptop. This speeds up the analysis as you have your familiar work environment prepared with all of the tools. But chances are your laptop is not “the latest beast” with x-GB of main memory.

M之外的任何数据分析任务在笔记本电脑上依然执行。使用所有工具准备好熟悉的工作环境后，可以加快分析速度。但是，您的笔记本电脑可能不是具有x-GB主内存的“最新野兽”。

Then a Memory Error surprises you! What should you do? Use Dask? You never work with it and usually, these tools have some quirks. Should you ask for a Spark cluster? Or is a Spark a little exaggerated choice at this point?

然后，内存错误使您感到惊讶！你该怎么办？使用Dask？您从不使用它，通常，这些工具有一些怪癖。您是否需要Spark集群？还是Spark在这一点上有点夸张？

Before you think about using another tool, ask yourself the following question.

在考虑使用其他工具之前，请问自己以下问题。

我需要所有行和列进行分析吗？ (Do I need all rows and columns for the analysis?)

pandas throws a MemoryError when reading the dataset

熊猫在读取数据集时抛出MemoryError

In a case, you don’t need all rows, you can read the dataset in chunks and filter unnecessary rows to reduce the memory usage:

在某种情况下，您不需要所有行，您可以按块读取数据集并过滤不必要的行以减少内存使用量：

iter_csv = pd.read_csv('dataset.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

Reading a dataset in chunks is slower than reading it all once. I would recommend using this approach only with bigger than memory datasets.

分块读取数据集要比一次读取所有数据集慢。我建议仅对大于内存的数据集使用此方法。

In a case, you don’t need all columns, you can specify required columns with “usecols” argument when reading a dataset:

在某种情况下，不需要所有列，可以在读取数据集时使用“ usecols”参数指定所需的列：

df = pd.read_csvsecols=['col1', 'col2'])

The great thing about these two approaches is that you can combine them.

这两种方法的优点在于您可以将它们组合在一起。

但是我需要所有的列和行进行分析！ (But I need all the columns and rows for the analysis!)

Then you should try Vaex!

然后，您应该尝试Vaex ！

Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data.

Vaex是一个高性能Python库，用于懒惰的Out-of-Core DataFrame(类似于Pandas)，以可视化方式浏览大型表格数据集。它可以计算每秒超过十亿行的基本统计信息。它支持多种可视化，允许交互式探索大数据。

To install Vaex is as simple as installing any other Python package:

要安装Vaex，就像安装其他Python软件包一样简单：

pip install vaex

By reading the whole CSV directly with Vaex you wouldn’t have gained much as the speed would be similar to pandas. You need to convert the CSV to HDF5 (the Hierarchical Data Format version 5) to see the benefit with Vaex.

通过直接用Vaex读取整个CSV文件，您不会获得太多收益，因为速度与熊猫差不多。您需要将CSV转换为HDF5(分层数据格式版本5)才能看到Vaex的好处。

Vaex has a function for conversion, which even supports files bigger than the main memory by converting smaller chunks.

Vaex具有转换功能，该功能甚至可以通过转换较小的块来支持大于主内存的文件。

import glob
import vaexfile_path = 'big_file.csv'
dv = vaex.from_csv(file_path, convert=True, chunk_size=5_000_000)

Next time you start with the analysis you can simply read the HDF5 files.

下次开始分析时，您只需阅读HDF5文件即可。

dv = vaex.open('*.hdf5')

Vaex的基本操作很少 (Few basic operations with Vaex)

Display head:

显示头：

dv.head()

Calculate 10th quantile:

计算第十个分位数：

Note, Vaex has percentile_approx function which calculates an approximation of quantile.

注意，Vaex具有percentile_approx函数，该函数计算分位数的近似值。

quantile = dv.percentile_approx('col1', 10)

Add a new column:

添加新列：

Vaex has a concept of virtual columns, which stores an expression as a column. It does not take up any memory and is computed on the fly when needed. A virtual column is treated just like a normal column.

Vaex具有虚拟列的概念，该表达式将表达式存储为列。它不占用任何内存，并在需要时动态计算。虚拟列的处理方式与普通列相同。

dv[‘col1_binary’] = dv.col1 > dv.percentile_approx(‘col1’, 10)

Filter data:

筛选数据：

Vaex has a concept of selections. The filter below is similar to filtering with pandas, except that Vaex does not copy the data.

Vaex有选择的概念。除了Vaex不会复制数据之外，下面的过滤器类似于使用熊猫进行过滤。

dv = dv[dv.col2 > 10]

Grouping and aggregating data:

分组和汇总数据：

The command below is slightly different from pandas as it combines grouping and aggregation.

下面的命令与pandas稍有不同，因为它结合了分组和聚合。

The command groups the data by col1 and calculate the mean for col3 (instead of mean you can use sum, max, min...):

该命令将数据按col1分组并计算col3的平均值(您可以使用sum，max，min ...代替平均值)：

group_res = dv.groupby(by=dv.col1, agg={'col3_mean': vaex.agg.mean('col3')})

Visualize the histogram:

可视化直方图：

Visualization with bigger datasets is problematic as traditional tools for data analysis are not optimized to handle them.

由于没有优化传统的数据分析工具来处理它们，因此使用较大的数据集进行可视化存在问题。

Vaex can also visualize the data:

Vaex还可以可视化数据：

plot = dv.plot1d(dv.col3, what='count(*)', limits=[0, 100])

Vaex也无法处理我的数据集！ (Vaex also fails to process my dataset!)

Ok, your dataset seems really too big to be processed on your laptop. Then it time to ask for a server with more memory or Dask/Hadoop/Spark cluster.

好的，您的数据集似乎太大了，无法在笔记本电脑上进行处理。然后是时候请求具有更多内存或Dask / Hadoop / Spark群集的服务器了。

你走之前 (Before you go)

These are a few links that might interest you:

这些链接可能会让您感兴趣：

- Your First Machine Learning Model in the Cloud- AI for Healthcare- Parallels Desktop 50% off- School of Autonomous Systems- Data Science Nanodegree Program- 5 lesser-known pandas tricks- How NOT to write pandas code