HEAD에 존재하지 않는 xMB 이상의 git repo에서 파일 찾기
나는 임의의 것을 저장하는 Git 저장소를 가지고있다. 대부분 내가 디자인 한 임의의 스크립트, 텍스트 파일, 웹 사이트 등등.
시간이 지남에 따라 삭제 한 대용량 바이너리 파일 (일반적으로 1 ~ 5MB)이 있는데, 리포지토리의 크기를 늘려서 수정 내역에 필요하지 않습니다.
기본적으로 할 수 있고 싶어 ..
me@host:~$ [magic command or script]
aad29819a908cc1c05c3b1102862746ba29bafc0 : example/blah.psd : 3.8MB : 130 days old
6e73ca29c379b71b4ff8c6b6a5df9c7f0f1f5627 : another/big.file : 1.12MB : 214 days old
.. 그런 다음 각 결과를 살펴보고 더 이상 필요하지 않은지 확인한 다음 제거 할 수 있습니다 (아마 사용 filter-branch
).
이것은 이전에 게시 한 git-find-blob
스크립트를 수정 한 것입니다 .
#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;
sub usage { die "usage: git-large-blob <size[b|k|m]> [<git-log arguments ...>]\n" }
@ARGV or usage();
my ( $max_size, $unit ) = ( shift =~ /^(\d+)([bkm]?)\z/ ) ? ( $1, $2 ) : usage();
my $exp = 10 * ( $unit eq 'b' ? 0 : $unit eq 'k' ? 1 : 2 );
my $cutoff = $max_size * 2**$exp;
sub walk_tree {
my ( $tree, @path ) = @_;
my @subtree;
my @r;
{
open my $ls_tree, '-|', git => 'ls-tree' => -l => $tree
or die "Couldn't open pipe to git-ls-tree: $!\n";
while ( <$ls_tree> ) {
my ( $type, $sha1, $size, $name ) = /\A[0-7]{6} (\S+) (\S+) +(\S+)\t(.*)/;
if ( $type eq 'tree' ) {
push @subtree, [ $sha1, $name ];
}
elsif ( $type eq 'blob' and $size >= $cutoff ) {
push @r, [ $size, @path, $name ];
}
}
}
push @r, walk_tree( $_->[0], @path, $_->[1] )
for @subtree;
return @r;
}
memoize 'walk_tree';
open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %cr'
or die "Couldn't open pipe to git-log: $!\n";
my %seen;
while ( <$log> ) {
chomp;
my ( $tree, $commit, $age ) = split " ", $_, 3;
my $is_header_printed;
for ( walk_tree( $tree ) ) {
my ( $size, @path ) = @$_;
my $path = join '/', @path;
next if $seen{ $path }++;
print "$commit $age\n" if not $is_header_printed++;
print "\t$size\t$path\n";
}
}
더 컴팩트 한 루비 스크립트 :
#!/usr/bin/env ruby -w
head, treshold = ARGV
head ||= 'HEAD'
Megabyte = 1000 ** 2
treshold = (treshold || 0.1).to_f * Megabyte
big_files = {}
IO.popen("git rev-list #{head}", 'r') do |rev_list|
rev_list.each_line do |commit|
commit.chomp!
for object in `git ls-tree -zrl #{commit}`.split("\0")
bits, type, sha, size, path = object.split(/\s+/, 5)
size = size.to_i
big_files[sha] = [path, size, commit] if size >= treshold
end
end
end
big_files.each do |sha, (path, size, commit)|
where = `git show -s #{commit} --format='%h: %cr'`.chomp
puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where]
end
용법:
ruby big_file.rb [rev] [size in MB]
$ ruby big_file.rb master 0.3
3.8M example/blah.psd (aad2981: 4 months ago)
1.1M another/big.file (6e73ca2: 2 weeks ago)
동일한 작업을 수행하는 Python 스크립트 ( 이 게시물 기반 ) :
#!/usr/bin/env python
import os, sys
def getOutput(cmd):
return os.popen(cmd).read()
if (len(sys.argv) <> 2):
print "usage: %s size_in_bytes" % sys.argv[0]
else:
maxSize = int(sys.argv[1])
revisions = getOutput("git rev-list HEAD").split()
bigfiles = set()
for revision in revisions:
files = getOutput("git ls-tree -zrl %s" % revision).split('\0')
for file in files:
if file == "":
continue
splitdata = file.split()
commit = splitdata[2]
if splitdata[3] == "-":
continue
size = int(splitdata[3])
path = splitdata[4]
if (size > maxSize):
bigfiles.add("%10d %s %s" % (size, commit, path))
bigfiles = sorted(bigfiles, reverse=True)
for f in bigfiles:
print f
아 ... 첫 번째 스크립트 (아리스토텔레스)는 꽤 느립니다. git.git 저장소에서 파일> 100k를 찾고 약 6 분 동안 CPU를 씹습니다.
또한 잘못된 SHA가 여러 개 인쇄 된 것으로 보입니다. 종종 다음 줄에 언급 된 파일 이름과 관련이없는 SHA가 인쇄됩니다.
더 빠른 버전이 있습니다. 출력 형식은 다르지만 매우 빠르며 내가 말할 수있는 한 정확합니다.
이 프로그램 은 조금 더 길지만 대부분은 말 그대로입니다.
#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;
use File::Temp qw(tempdir);
END { chdir( $ENV{HOME} ); }
my $tempdir = tempdir( "git-files_tempdir.XXXXXXXXXX", TMPDIR => 1, CLEANUP => 1 );
my $min = shift;
$min =~ /^\d+$/ or die "need a number";
# ----------------------------------------------------------------------
my @refs =qw(HEAD);
@refs = @ARGV if @ARGV;
# first, find blob SHAs and names (no sizes here)
open( my $objects, "-|", "git", "rev-list", "--objects", @refs) or die "rev-list: $!";
open( my $blobfile, ">", "$tempdir/blobs" ) or die "blobs out: $!";
my ( $blob, $name );
my %name;
my %size;
while (<$objects>) {
next unless / ./; # no commits or top level trees
( $blob, $name ) = split;
$name{$blob} = $name;
say $blobfile $blob;
}
close($blobfile);
# next, use cat-file --batch-check on the blob SHAs to get sizes
open( my $sizes, "-|", "< $tempdir/blobs git cat-file --batch-check | grep blob" ) or die "cat-file: $!";
my ( $dummy, $size );
while (<$sizes>) {
( $blob, $dummy, $size ) = split;
next if $size < $min;
$size{ $name{$blob} } = $size if ( $size{ $name{$blob} } || 0 ) < $size;
}
my @names_by_size = sort { $size{$b} <=> $size{$a} } keys %size;
say "
The size shown is the largest that file has ever attained. But note
that it may not be that big at the commit shown, which is merely the
most recent commit affecting that file.
";
# finally, for each name being printed, find when it was last updated on each
# branch that we're concerned about and print stuff out
for my $name (@names_by_size) {
say "$size{$name}\t$name";
for my $r (@refs) {
system("git --no-pager log -1 --format='%x09%h%x09%x09%ar%x09$r' $r -- $name");
}
print "\n";
}
print "\n";
Git 저장소에서 대용량 파일 을 제거하기 위해 특별히 설계된 보다 빠르고 간단한 대안 인 BFG Repo-Cleaner 를 사용하려고합니다 .git-filter-branch
BFG jar (Java 6 이상 필요)를 다운로드하고 다음 명령을 실행합니다.
$ java -jar bfg.jar --strip-blobs-bigger-than 1M my-repo.git
Any files over 1M in size (that aren't in your latest commit) will be removed from your Git repository's history. You can then use git gc
to clean away the dead data:
$ git gc --prune=now --aggressive
The BFG is typically 10-50x faster than running git-filter-branch
and the options are tailored around these two common use-cases:
- Removing Crazy Big Files
- Removing Passwords, Credentials & other Private data
Full disclosure: I'm the author of the BFG Repo-Cleaner.
Aristote's script will show you what you want. You also need to know that deleted files will still take up space in the repo.
By default, Git keeps changes around for 30 days before they can be garbage-collected. If you want to remove them now:
$ git reflog expire --expire=1.minute refs/heads/master
# all deletions up to 1 minute ago available to be garbage-collected
$ git fsck --unreachable
# lists all the blobs(file contents) that will be garbage-collected
$ git prune
$ git gc
A side comment: While I am big fan of Git, Git doesn't bring any advantages to storing your collection of "random scripts, text files, websites" and binary files. Git tracks changes in content, particularly the history of coordinated changes among many text files, and does so very efficiently and effectively. As your question illustrates, Git doesn't have good tools for tracking individual file changes. And it doesn't track changes in binaries, so any revision stores another full copy in the repo.
Of course this use of Git is a perfectly good way to get familiar with how it works.
#!/bin/bash
if [ "$#" != 1 ]
then
echo 'git large.sh [size]'
exit
fi
declare -A big_files
big_files=()
echo printing results
while read commit
do
while read bits type sha size path
do
if [ "$size" -gt "$1" ]
then
big_files[$sha]="$sha $size $path"
fi
done < <(git ls-tree --abbrev -rl $commit)
done < <(git rev-list HEAD)
for file in "${big_files[@]}"
do
read sha size path <<< "$file"
if git ls-tree -r HEAD | grep -q $sha
then
echo $file
fi
done
My python simplification of https://stackoverflow.com/a/10099633/131881
#!/usr/bin/env python
import os, sys
bigfiles = []
for revision in os.popen('git rev-list HEAD'):
for f in os.popen('git ls-tree -zrl %s' % revision).read().split('\0'):
if f:
mode, type, commit, size, path = f.split(None, 4)
if int(size) > int(sys.argv[1]):
bigfiles.append((int(size), commit, path))
for f in sorted(set(bigfiles)):
print f
This bash "one-liner" displays all blob objects in the repository that are larger than 10 MiB and are not present in HEAD
sorted from smallest to largest.
It's very fast, easy to copy & paste and only requires standard GNU utilities.
git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk -v min_mb=10 '/^blob/ && $3 >= min_mb*2^20 {print substr($0,6)}' \
| grep -vF "$(git ls-tree -r HEAD | awk '{print $3}')" \
| sort --numeric-sort --key=2 \
| cut --complement --characters=13-40 \
| numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
This will generate output like this:
2ba44098e28f 12MiB path/to/hires-image.png
bd1741ddce0d 63MiB path/to/some-video-1080p.mp4
For more information, including an output format more suitable for further script processing, see my original answer on a similar question.
A little late to the party, but git-fat has this functionality built in.
Just install it with pip and run git fat -a find 100000
where the number at the end is in Bytes.
ReferenceURL : https://stackoverflow.com/questions/298314/find-files-in-git-repo-over-x-megabytes-that-dont-exist-in-head
'programing tip' 카테고리의 다른 글
Angular-animate-알 수없는 공급자 : $$ asyncCallbackProvider <-$$ asyncCallback <-$ animate <-$ compile (0) | 2020.12.15 |
---|---|
Angular2 canActivate () 호출 비동기 함수 (0) | 2020.12.15 |
VisualStudio : 변수 이름을 바꿀 때의 바로 가기 (0) | 2020.12.15 |
InvalidKeyException 잘못된 키 크기 (0) | 2020.12.15 |
Java에서 문자를 정수로 변환 (0) | 2020.12.14 |