AI学术 | 解决技术问题, Claude和ChatGPT哪家强？

Claude和ChatGPT这两个AI工具，在学术领域的工作当中到底哪家强呢？

接上一篇文章《AI学术 | 写Research Proposal, Claude和ChatGPT哪家强？》

刚好碰到一个技术问题，涉及到理论和编码实践，让我们来看看两者在处理具体的技术问题上，各自表现如何。

一. 具体问题(开始挖坑)

目前是这样的场景，我们正在一个数据集进行预处理(Data Preprocessing)，并对其用机器学习(Machine Learning)建模，从而生产高准确率的分类模型(Classification)。

为了得到更为轻量级的模型(Lightweight model)，我们需要对数据集进行降维，其中一种办法就是对数据集的特征进行选择（Feature Selection），而常用的特征选择的方法之一，就是通过Pearson Correlation技术选择与目标类高相关度的特征。

那么问题来了（开始挖坑），假设我现在仅考虑了特征之间的相关度，而忽略了特征与目标类的相关度，我们看看Claude和ChatGPT会如何应对。

二. 理论对话

我们提出如下同一个问题，来各自询问Claude和ChatGPT，看看他们的答复。“I’ve just implemented the correlation-based feature selection technique, which only considers the correlation between features but not the correlation between features and target values. I’m not sure if this method makes sense.”

Claude

它直接认为仅考虑特征间的关联性，而忽略特征与目标类的关联性，并非一个好的选择。

然后，Claude进而给出了，为什么应该要考虑与目标类的关联性。

其中还提出了，不能仅仅考虑线形关系，还得考虑非线形关系，并且还推荐了一些具体的技术，比如mutual information技术。

ChatGPT

同样，ChatGPT也认为更应该考虑与目标类的关联度。但并没有推荐更具体的其他相关技术。

从上述回答来看，Claude和ChatGPT都能够指出具体的问题出在哪里，即只考虑了A，但没有考虑B，B很重要，应该采取A+B的策略；

（A代表特征之间的关联性，B表示各个特征与目标类的关联性，为了方便，后续就以A和B代替。）

区别在于，Claude会推荐更多的技术，比如也可以考虑非线形的方案；而ChatGPT仅聚焦在当前Pearson correlation技术上。

三. 来看代码

我们继续提出如下同一个问题，来各自询问Claude和ChatGPT，看看他们的答复。“OK, so would you please write code to implement it?”

Claude

从下面代码可以看出，Claude把A和B都考虑到了，并给出了最终的特征列表。

ChatGPT

相信大家看到了这一段代码，立马就知道问题所在了。没错，ChatGPT依然只考虑了A，即只考虑了特征与目标类的关联性，而没有考虑特征之间的关联。

# Select features based on the correlation threshold

selected_features = correlation_with_target[abs(correlation_with_target) > correlation_threshold].index

鉴于可能是ChatGPT的偶然失误，我们继续追问 “Sorry, it seems you only considered the correlation with target class, where is the correlation between the features?”

ChatGPT倒是很诚实，给出了道歉，并继续修改上述代码。

但请仔细看，有没有问题？

依然有！它增加了部分B如下，

# Calculate the correlation matrix (features vs. features)

correlation_matrix = X_train.corr()

但是最终选择的特征依然还是和之前一样！

selected_features = correlation_with_target[abs(correlation_with_target) > correlation_threshold].index

鉴于可能是ChatGPT的偶然失误，我们继续追问 “Sorry, the features fed into machine learning, do not consider the correlation values within feature, can you modify the code to consider it?”

这次的问题又在哪呢？

是的，依然没有将B的因素，考虑到最终输入到机器学习模型的特征列表中！

我们继续追问 “Sorry, you still don’t understand my words, I mean can you select the features that consider both correlation within features and correlation with target class?”

这次终于把B考虑到最终列表中了～

但是我们需要的是，最大化A，同时最小化B，因为特征间的关联度越高，代表该特征可能是多余特征，进而忽略。

所以，我们继续追问 “However, the best feature sets should be the features that have high correlation with target class, and low correlation with each other, no?” （从问题就可以看到，我的耐心已经快到极限了…）